RAID Solutions made easy!
An essential element of the data availability architecture is RAID. With a RAID configuration, you…
Read MoreRecently, when performing planned test scenarios with different hardware parts, our QA team identified an issue with kernel panic during read operations on the Areca ARC-1883 SAS RAID Adapter. We notified Areca and thanks to their fast reaction we were able to quickly resolve the problem. Here’s an overview.
The problem
During sequential read operations kernel panic occurred on Linux. As it turned out, the newer the kernel version the faster the system would hang.
Call trace from dying system:
BUG: unable to handle kernel paging request at ffff8800ffffffc8
IP: [<ffffffffa01be89d>] arcmsr_drain_donequeue+0xd/0x70 [arcmsr]
PGD 1a86063 PUD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/pci0000:00/0000:00:1d.0/usb2/2-1/2-1.4/speed
CPU 12
Modules linked in: arcmsr(U) autofs4 sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tab]
Pid: 3576, comm: dd Not tainted 2.6.32-431.11.2.el6.x86_64 #1 Supermicro X9DRH-7TF/7F/iTF/iF/X9DRH-7TF/7F/iTF/iF
RIP: 0010:[<ffffffffa01be89d>] [<ffffffffa01be89d>] arcmsr_drain_donequeue+0xd/0x70 [arcmsr]
RSP: 0018:ffff88089c483e38 EFLAGS: 00010082
RAX: ffffc90016ea00c8 RBX: ffff8810731885e0 RCX: ffffc90016ea0020
RDX: 0000000000000001 RSI: ffff8800ffffffb0 RDI: ffff8810731885e0
RBP: ffff88089c483e48 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90016ea0030
R13: 0000000000000008 R14: 0000000000000010 R15: 0000000000000001
FS: 00007f7f733e5700(0000) GS:ffff88089c480000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffff8800ffffffc8 CR3: 0000000f2b97c000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dd (pid: 3576, threadinfo ffff8809c7172000, task ffff8810712b8ae0)
Stack:
ffff88089c483e48 ffffffff81095628 ffff88089c483eb8 ffffffffa01beeff
<d> 0000000000000005 ffff88089c483e90 ffff88107318acd8 ffffc90016ea0020
<d> ffffc90016ea00c8 ffffc90016ea0030 0000000000000100 ffff88107026dc40
Call Trace:
<IRQ>
[<ffffffff81095628>] ? schedule_work+0x18/0x20
[<ffffffffa01beeff>] arcmsr_interrupt+0x5ff/0x6a0 [arcmsr]
[<ffffffffa01befb1>] arcmsr_do_interrupt+0x11/0x20 [arcmsr]
[<ffffffff810e6eb0>] handle_IRQ_event+0x60/0x170
[<ffffffff8107a93f>] ? __do_softirq+0x11f/0x1e0
[<ffffffff810e980e>] handle_edge_irq+0xde/0x180
[<ffffffff8100c30c>] ? call_softirq+0x1c/0x30
[<ffffffff8100faf9>] handle_irq+0x49/0xa0
[<ffffffff815315fc>] do_IRQ+0x6c/0xf0
[<ffffffff8100b9d3>] ret_from_intr+0x0/0x11
<EOI>
[<ffffffff81136bc9>] ? activate_page+0x189/0x1a0
[<ffffffff81136bb9>] ? activate_page+0x179/0x1a0
[<ffffffff81136c21>] mark_page_accessed+0x41/0x50
[<ffffffff811213c3>] generic_file_aio_read+0x2c3/0x700
[<ffffffff811c4841>] blkdev_aio_read+0x51/0x80
[<ffffffff81188e7c>] ? do_sync_read+0xec/0x140
[<ffffffff81188e8a>] do_sync_read+0xfa/0x140
[<ffffffff8109b290>] ? autoremove_wake_function+0x0/0x40
[<ffffffff812334d6>] ? selinux_file_permission+0x26/0x150
[<ffffffff812335ab>] ? selinux_file_permission+0xfb/0x150
[<ffffffff81226496>] ? security_file_permission+0x16/0x20
[<ffffffff81189775>] vfs_read+0xb5/0x1a0
[<ffffffff811975bd>] ? path_put+0x1d/0x40
[<ffffffff811898b1>] sys_read+0x51/0x90
[<ffffffff810e1e4e>] ? __audit_syscall_exit+0x25e/0x290
[<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
Code: ff ff c6 02 00 48 8d 7a 01 40 b6 5f e9 5f ff ff ff 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 48 83 ec 10 0f 1f 44 00 00 <4c> 8b 46 18 49 39 f
RIP [<ffffffffa01be89d>] arcmsr_drain_donequeue+0xd/0x70 [arcmsr]
RSP <ffff88089c483e38>
CR2: ffff8800ffffffc8
Preliminary tests showed that the same scenario performed with an older kernel works longer but finally provides the same results – kernel panic.
What was the solution?
Our development team researched the driver and immediately informed Areca about the part of the code where this issue occurred. The issue was caused by getting the wrong Command Control Block pointer value in the arcmsr_hbaC_postqueue_isr function.
In the arcmsr_hbaC_postqueue_isr function of the old driver:
flag_ccb = readl(&phbcmu->outbound_queueport_low);
ccb_cdb_phy = (flag_ccb & 0xFFFFFFF0);/*frame must be 32 bytes aligned*/arcmsr_cdb = (struct ARCMSR_CDB *)(acb->vir2phy_offset + ccb_cdb_phy);ccb = container_of(arcmsr_cdb, struct CommandControlBlock, arcmsr_cdb); <- ccb points to the wrong address
Upon Areca’s request, our kernel developers also tested this scenario on different Linux systems (including Ubuntu and CentOS) and provided further information about the test results.
Devices affected
We tested other Areca Adapters, however, this behavior only occurred on the Areca ARC-1883 SAS RAID Adapter.
Operation System affected
All Linux-based systems that use the Areca driver in versions lower than 1.30.0X.18-140417 are affected.
What to do if you want to use Open-E DSS V7 with the Areca ARC-1883 SAS RAID Adapter?
Please contact our support team which already has a fix for this issue. Additionally, Areca prepared a driver which fixes the issue. [http://www.areca.com.tw/support/s_linux/linux.htm]
All in all, we have to say that we were impressed with Areca’s technical support. They reacted very fast from the moment our QA team first reported the problem. Then, upon Areca’s request – our team further investigated the issue. Once the problem was identified, our technology partner Areca quickly prepared a driver fixing the issue. Great job guys!
Leave a Reply