How to debug stuck read?

Sun Feb 6 16:22:16 EST 2022

On Sun, Feb 6, 2022 at 1:48 PM Matthew Wilcox <willy at infradead.org> wrote:
>
> On Sun, Feb 06, 2022 at 12:01:02PM +0100, FMDF wrote:
> > On Wed, Feb 2, 2022 at 10:50 PM Dāvis Mosāns <davispuh at gmail.com> wrote:
> > >
> > > trešd., 2022. g. 2. febr., plkst. 21:13 — lietotājs Matthew Wilcox
> > > (<willy at infradead.org>) rakstīja:
> > > >
> > > > On Wed, Feb 02, 2022 at 07:15:14PM +0200, Dāvis Mosāns wrote:
> > > > > I have a corrupted file on BTRFS which has CoW disabled thus no
> > > > > checksum. Trying to read this file causes the process to get stuck
> > > > > forever. It doesn't return EIO.
> > > > >
> > > > > How can I find out why it gets stuck?
> > > >
> > > > > $ cat /proc/3449/stack | ./scripts/decode_stacktrace.sh vmlinux
> > > > > folio_wait_bit_common (mm/filemap.c:1314)
> > > > > filemap_get_pages (mm/filemap.c:2622)
> > > > > filemap_read (mm/filemap.c:2676)
> > > > > new_sync_read (fs/read_write.c:401 (discriminator 1))
> > > >
> > > > folio_wait_bit_common() is where it waits for the page to be unlocked.
> > > > Probably the problem is that btrfs isn't unlocking the page on
> > > > seeing the error, so you don't get the -EIO returned?
> > >
> > >
> > > Yeah, but how to find where that happens.
> > > Anyway by pure luck I found memcpy that wrote outside of allocated
> > > memory and fixing that solved this issue but I still don't know how to
> > > debug this properly.
> > >
> > There is no special recipe for debugging "this properly" :)
> >
> > You wrote that "by pure luck" you found a memcpy() that wrote beyond the
> > limit of allocated memory. I suppose that you found that faulty memcpy()
> > somewhere in one of the function listed in the stack trace.
>
> I very much doubt that.  The code flow here is:
>
> userspace calls read() -> VFS -> btrfs -> block layer -> return to btrfs
> -> return to VFS, wait for read to complete.  So by the time anyone's
> looking at the stack trace, all you can see is the part of the call
> chain in the VFS.  There's no way to see where we went in btrfs, nor
> in the block layer.  We also can't see from the stack trace what
> happened with the interrupt which _should have_ cleared the lock bit
> and didn't.
>
OK, I agree. This appears to be is one of those special cases where the mere
reading of a stack trace cannot help much... :(

My argument is about a general approach to debugging some unknown code
by just reading the calls chain. Many times I've been able to find out what was
wrong with code I had never seen before by just following the chain of calls
in subsystems that I know nothing of (e.g., a bug in "tty" that was reported by
Syzbot).

In this special case, if the developer doesn't know that "the interrupt [which]
_should have_ cleared the lock bit and didn't." there is nothing that one can
deduce from a stack trace.

Here one need to know how things work, well beyond the functions that are
listed in the trace. So, probably, if one needs a "recipe" for those cases, the
recipe is just know the subsystem(s) at hand and know how the kernel manages
interrupts.

Actually I haven't deepened this issue but, by reading what Matthew writes,
I doubt that a faulty memcpy() can be the culprit... Davis, are you really sure
that you've fixed that bug?

Regards,

Fabio