How to debug stuck read?

Sun Feb 6 07:48:09 EST 2022

On Sun, Feb 06, 2022 at 12:01:02PM +0100, FMDF wrote:
> On Wed, Feb 2, 2022 at 10:50 PM Dāvis Mosāns <davispuh at gmail.com> wrote:
> >
> > trešd., 2022. g. 2. febr., plkst. 21:13 — lietotājs Matthew Wilcox
> > (<willy at infradead.org>) rakstīja:
> > >
> > > On Wed, Feb 02, 2022 at 07:15:14PM +0200, Dāvis Mosāns wrote:
> > > > I have a corrupted file on BTRFS which has CoW disabled thus no
> > > > checksum. Trying to read this file causes the process to get stuck
> > > > forever. It doesn't return EIO.
> > > >
> > > > How can I find out why it gets stuck?
> > >
> > > > $ cat /proc/3449/stack | ./scripts/decode_stacktrace.sh vmlinux
> > > > folio_wait_bit_common (mm/filemap.c:1314)
> > > > filemap_get_pages (mm/filemap.c:2622)
> > > > filemap_read (mm/filemap.c:2676)
> > > > new_sync_read (fs/read_write.c:401 (discriminator 1))
> > >
> > > folio_wait_bit_common() is where it waits for the page to be unlocked.
> > > Probably the problem is that btrfs isn't unlocking the page on
> > > seeing the error, so you don't get the -EIO returned?
> >
> >
> > Yeah, but how to find where that happens.
> > Anyway by pure luck I found memcpy that wrote outside of allocated
> > memory and fixing that solved this issue but I still don't know how to
> > debug this properly.
> >
> There is no special recipe for debugging "this properly" :)
> 
> You wrote that "by pure luck" you found a memcpy() that wrote beyond the
> limit of allocated memory. I suppose that you found that faulty memcpy()
> somewhere in one of the function listed in the stack trace.

I very much doubt that.  The code flow here is:

userspace calls read() -> VFS -> btrfs -> block layer -> return to btrfs
-> return to VFS, wait for read to complete.  So by the time anyone's
looking at the stack trace, all you can see is the part of the call
chain in the VFS.  There's no way to see where we went in btrfs, nor
in the block layer.  We also can't see from the stack trace what
happened with the interrupt which _should have_ cleared the lock bit
and didn't.