STATE: TASK_UNINTERRUPTIBLE (PANIC)

Wed Nov 11 11:25:14 EST 2020

Hi,

I have an issue with the 'bcache' Linux subsystem (block I/O cache). I
hit a kernel panic when using this software, and I've reported that
upstream on the "linux-bcache" mailing list:
https://www.spinics.net/lists/linux-bcache/msg09069.html

I'd like to contribute and learn more on how to debug this myself.
Here is the output from 'crash' on a dumpfile from this panic:
  SYSTEM MAP: /home/marc.smith/Downloads/System.map-esos.prod
DEBUG KERNEL: /home/marc.smith/Downloads/vmlinux-esos.prod (5.4.69-esos.prod)
    DUMPFILE: /home/marc.smith/Downloads/dumpfile-1604062993
        CPUS: 8
        DATE: Fri Oct 30 09:02:56 2020
      UPTIME: 2 days, 12:38:15
LOAD AVERAGE: 9.48, 8.89, 7.69
       TASKS: 980
    NODENAME: node-10cccd-2
     RELEASE: 5.4.69-esos.prod
     VERSION: #1 SMP Thu Oct 22 19:45:11 UTC 2020
     MACHINE: x86_64  (2799 Mhz)
      MEMORY: 24 GB
       PANIC: "Oops: 0002 [#1] SMP NOPTI" (check log for details)
         PID: 18272
     COMMAND: "kworker/2:13"
        TASK: ffff88841d9e8000  [THREAD_INFO: ffff88841d9e8000]
         CPU: 2
       STATE: TASK_UNINTERRUPTIBLE (PANIC)

crash> bt
PID: 18272  TASK: ffff88841d9e8000  CPU: 2   COMMAND: "kworker/2:13"
 #0 [ffffc90000100938] machine_kexec at ffffffff8103d6b5
 #1 [ffffc90000100980] __crash_kexec at ffffffff8110d37b
 #2 [ffffc90000100a48] crash_kexec at ffffffff8110e07d
 #3 [ffffc90000100a58] oops_end at ffffffff8101a9de
 #4 [ffffc90000100a78] no_context at ffffffff81045e99
 #5 [ffffc90000100ae0] async_page_fault at ffffffff81e010cf
    [exception RIP: atomic_try_cmpxchg+2]
    RIP: ffffffff810d3e3b  RSP: ffffc90000100b98  RFLAGS: 00010046
    RAX: 0000000000000000  RBX: 0000000000000003  RCX: 0000000000080006
    RDX: 0000000000000001  RSI: ffffc90000100ba4  RDI: 0000000000000a6c
    RBP: 0000000000000010   R8: 0000000000000001   R9: ffffffffa0418d4e
    R10: ffff88841c8b3000  R11: ffff88841c8b3000  R12: 0000000000000046
    R13: 0000000000000000  R14: ffff8885a3a0a000  R15: 0000000000000a6c
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
 #6 [ffffc90000100b98] _raw_spin_lock_irqsave at ffffffff81cf7d7d
 #7 [ffffc90000100bb8] try_to_wake_up at ffffffff810c1624
 #8 [ffffc90000100c08] closure_sync_fn at ffffffffa040fb07 [bcache]
 #9 [ffffc90000100c10] clone_endio at ffffffff81aac48c
#10 [ffffc90000100c40] call_bio_endio at ffffffff81a78e20
#11 [ffffc90000100c58] raid_end_bio_io at ffffffff81a78e69
#12 [ffffc90000100c88] raid1_end_write_request at ffffffff81a79ad9
#13 [ffffc90000100cf8] blk_update_request at ffffffff814c3ab1
#14 [ffffc90000100d38] blk_mq_end_request at ffffffff814caaf2
#15 [ffffc90000100d50] blk_mq_complete_request at ffffffff814c91c1
#16 [ffffc90000100d78] nvme_complete_cqes at ffffffffa002fb03 [nvme]
#17 [ffffc90000100db8] nvme_irq at ffffffffa002fb7f [nvme]
#18 [ffffc90000100de0] __handle_irq_event_percpu at ffffffff810e0d60
#19 [ffffc90000100e20] handle_irq_event_percpu at ffffffff810e0e65
#20 [ffffc90000100e48] handle_irq_event at ffffffff810e0ecb
#21 [ffffc90000100e60] handle_edge_irq at ffffffff810e494d
#22 [ffffc90000100e78] do_IRQ at ffffffff81e01900
#23 [ffffc90000100eb0] common_interrupt at ffffffff81e00a0a
#24 [ffffc90000100f38] __softirqentry_text_start at ffffffff8200006a
#25 [ffffc90000100fc8] irq_exit at ffffffff810a3f6a
#26 [ffffc90000100fd0] smp_apic_timer_interrupt at ffffffff81e020b2
bt: invalid kernel virtual address: ffffc90000101000  type: "pt_regs"
crash>

Looking at the call trace, I see this was the last function from
'bcache' in the trace (linux-5.4.69/drivers/md/bcache/closure.c):
static void closure_sync_fn(struct closure *cl)
{
        struct closure_syncer *s = cl->s;
        struct task_struct *p;

        rcu_read_lock();
        p = READ_ONCE(s->task);
        s->done = 1;
        wake_up_process(p);
        rcu_read_unlock();
}

And I believe the calls above this in my crash-backtrace output come
from this call: wake_up_process()

Is the panic perhaps because the task/process is already
gone/finished? Not sure where to start looking next. Any help would be
greatly appreciated.

--Marc