Decoding a Linux kernel oops panic due to DMAR error

Mon Feb 3 04:35:27 EST 2014

Hello,

I have a server with onboard Intel 10G ports (82599). When I load the kernel 
module driver for these ports, everything is fine, I can see the newly created ethX devices using "ip addr show".  However, after I assign an IP address, and right after I issue command to bring up the port, I get a kernel panic related to DMAR (DMA remapping) in the VFIO (Virtual Function IO) module.  I am not even 
sure why I am getting this panic since this Intel kernel module does not 
use VFIO.  I know why the panic is happening, NULL being sent as a 
parameter to function vfio_group_get(), in which it is being de-referenced.  I 
know NULL is passed, because register RDI, which is used to pass the 
first argument to a function, contains 0.

Linux kernel 3.6.11

Following is the stack trace of panic:

# [11036.855410] BUG: unable to handle kernel [11036.887249] ixgbe 0000:84:00.0: eth6: detected SFP+: 3
NULL pointer dereference at           (null)
[11037.010224] IP: [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio]
[11037.085047] PGD 1fd6b5b067 PUD 20404b1067 PMD 0
[11037.140181] Oops: 0000 [#1] SMP
[11037.178676] Modules linked in: ixgbe(O) nfsv3 autofs4 nfsd nfs_acl nfs lockd sunrpc vfio_pci vfio_iommu_type1 vfio i2c_mux i2c_smbus i2c_dev container ide_pci_generic ide_core uhci_hcd isci ata_generic
[11037.393137] CPU 0
[11037.414974] Pid: 14045, comm: kworker/0:0 Tainted: G           O 3.6.11
[11037.539628] RIP: 0010:[<ffffffffa006615a>]  [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio]
[11037.643521] RSP: 0018:ffff881f52453d00  EFLAGS: 00010282
[11037.706886] RAX: ffff881fd6740680 RBX: 0000000000000000 RCX: ffff88204157ec00
[11037.792053] RDX: 0000000000000084 RSI: 0000000001f5327a RDI: 0000000000000000
[11037.877221] RBP: ffff881f52453d10 R08: ffff881f5327abe0 R09: 0000000000000000
[11037.962394] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88204157f800
[11038.024995] ixgbe 0000:84:00.0: eth6: NIC Link is Up 10 Gbps, Flow Control: RX/TX
[11038.025144] IPv6: ADDRCONF(NETDEV_CHANGE): eth6: link becomes ready
[11038.211671] R13: 0000000000000084 R14: 0000000000000000 R15: 0000000000000000
[11038.296842] FS:  0000000000000000(0000) GS:ffff88204f000000(0000) knlGS:0000000000000000
[11038.393430] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[11038.461988] CR2: 0000000000000000 CR3: 0000001fd686d000 CR4: 00000000001407f0
[11038.547156] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[11038.632326] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[11038.717496] Process kworker/0:0 (pid: 14045, threadinfo ffff881f52452000, task ffff882034d61950)
[11038.822392] Stack:
[11038.846298]  0000000000000084 ffff881fd6740680 ffff881f52453d30 ffffffffa006618a
[11038.934688]  0000000001f5327a ffff882035e23e00 ffff881f52453d50 ffffffffa0066442
[11039.023078]  ffff881f52453d70 ffff881fd6740680 ffff881f52453d70 ffffffffa0072072
[11039.111465] Call Trace:
[11039.140571]  [<ffffffffa006618a>] vfio_device_get+0x12/0x30 [vfio]
[11039.214324]  [<ffffffffa0066442>] vfio_device_get_from_dev+0x19/0x1f [vfio]
[11039.297425]  [<ffffffffa0072072>] vfio_pci_dmar_error_handler+0x13/0x4a [vfio_pci]
[11039.387796]  [<ffffffff81420cc6>] dmar_fault_do_one+0xd4/0xf1
[11039.456366]  [<ffffffff8104175d>] process_one_work+0x1c2/0x311
[11039.525968]  [<ffffffff81041568>] ? manage_workers+0x23a/0x24c
[11039.595566]  [<ffffffff81420bf2>] ? dmar_get_fault_reason+0x52/0x52
[11039.670354]  [<ffffffff81041b42>] worker_thread+0x26c/0x34a
[11039.736840]  [<ffffffff810418d6>] ? process_scheduled_works+0x2a/0x2a
[11039.813710]  [<ffffffff8104583a>] kthread+0x86/0x8e
[11039.871891]  [<ffffffff81604bf4>] kernel_thread_helper+0x4/0x10
[11039.942524]  [<ffffffff810457b4>] ? kthread_freezable_should_stop+0x4d/0x4d
[11040.025618]  [<ffffffff81604bf0>] ? gs_change+0xb/0xb
[11040.085865] Code: 48 8b 00 48 8b 40 20 48 85 c0 74 0c 55 48 8b 7f 40 48 89 e5 ff d0 eb 08 48 c7 c0 ea ff ff ff c3 5d c3 55 48 89 e5 53 48 89 fb 52 <8b> 07 85 c0 75 11 be 2a 00 00 00 48 c7 c7 38 76 06 a0 e8 32 84
[11040.312869] RIP  [<ffffffffa006615a>] vfio_group_get+0x9/0x27 [vfio]
[11040.388722]  RSP <ffff881f52453d00>
[11040.430282] CR2: 0000000000000000

- Can someone please help me understand the damr/vfio related function calls in the back trace, and why they are getting invoked?  I know what causes DMAR error, but not sure how this could be happening, since none of the devices is managed by VFIO. 

- Looking at the source code, it seems dmar_fault_do_one() is called from interrupt handler dmar_fault().  I am just curious, why dmar_fault() is not part of the stack trace?
- What is the significance of the "?" in front of some of the functions in the backtrace (e.g. dmar_get_fault_reason()).

Thank you,
Ahmed.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.kernelnewbies.org/pipermail/kernelnewbies/attachments/20140203/ea302cb0/attachment.html