How to debug stuck read?

Sun Feb 6 18:21:53 EST 2022

svētd., 2022. g. 6. febr., plkst. 13:01 — lietotājs FMDF
(<fmdefrancesco at gmail.com>) rakstīja:
[...]
> There is no special recipe for debugging "this properly" :)
>
> You wrote that "by pure luck" you found a memcpy() that wrote beyond the
> limit of allocated memory. I suppose that you found that faulty memcpy()
> somewhere in one of the function listed in the stack trace.
>
> That's the right approach! You read the calls chain and find out where something
> looks wrong and then fix it. This is why stack traces are so helpful.
>
> It was not "pure luck". I think that you did what developers usually do after
> decoding a stack trace. If not, how did you find that faulty memcpy() buried
> somewhere in 40 millions lines of code?
>
> it seems that you've found the right way to figure out the problems in code
> that (probably) you had not ever worked on or read before you hit that bug.
>

I think there should be a way to see which locks (and by who/where)
have been taken for a long time. In this case it would be just this
one as no one should take a lock for 10+ mins and such. Here we have
several complications that make figuring this out difficult. Firstly
you can get crash only once per boot. Killing process and reading file
again you won't get any crash at all but process will always get
stuck. Not even dropping caches changes anything. That's why I
couldn't figure out what is the issue because I tried repeating it
several times and never got any new crash in dmesg. Secondly even if
you see some crash in dmesg, your process didn't crash and it's not
clear if it is related to your issue - how to link kernel side crash
with your stuck process. Finally the way how this bug is it was
possible for it to not cause any crash but I'm guessing process
probably wouldn't have stuck then.

> Have you sent a patch to the LKML? If not, please do it.
>
Yeah I did, see patch titled: "[PATCH 2/2] btrfs: prevent copying too
big compressed lzo segment"

svētd., 2022. g. 6. febr., plkst. 23:22 — lietotājs FMDF
(<fmdefrancesco at gmail.com>) rakstīja:
>
> On Sun, Feb 6, 2022 at 1:48 PM Matthew Wilcox <willy at infradead.org> wrote:
> >
> > On Sun, Feb 06, 2022 at 12:01:02PM +0100, FMDF wrote:
[...]
> >
> > I very much doubt that.  The code flow here is:
> >
> > userspace calls read() -> VFS -> btrfs -> block layer -> return to btrfs
> > -> return to VFS, wait for read to complete.  So by the time anyone's
> > looking at the stack trace, all you can see is the part of the call
> > chain in the VFS.  There's no way to see where we went in btrfs, nor
> > in the block layer.  We also can't see from the stack trace what
> > happened with the interrupt which _should have_ cleared the lock bit
> > and didn't.
> >
> OK, I agree. This appears to be is one of those special cases where the mere
> reading of a stack trace cannot help much... :(
>
> My argument is about a general approach to debugging some unknown code
> by just reading the calls chain. Many times I've been able to find out what was
> wrong with code I had never seen before by just following the chain of calls
> in subsystems that I know nothing of (e.g., a bug in "tty" that was reported by
> Syzbot).
>
> In this special case, if the developer doesn't know that "the interrupt [which]
> _should have_ cleared the lock bit and didn't." there is nothing that one can
> deduce from a stack trace.
>
> Here one need to know how things work, well beyond the functions that are
> listed in the trace. So, probably, if one needs a "recipe" for those cases, the
> recipe is just know the subsystem(s) at hand and know how the kernel manages
> interrupts.
>

Yeah, that's why I'm saying it was by luck :P The way how I fixed this
was pretty much accidental byproduct. Basically for this issue all I
saw in dmesg was "invalid lzo header". By looking at that part of code
I didn't see anything suspicious and also issue happened after this so
it wasn't culprit (but did hint it could be something related to
this). Anyway at this point I gave up on it, but I saw that I have one
crash in dmesg so I thought I'll fix that. I didn't thought it's
related to this issue, but once I fixed that this also was solved.

> Actually I haven't deepened this issue but, by reading what Matthew writes,
> I doubt that a faulty memcpy() can be the culprit... Davis, are you really sure
> that you've fixed that bug?
>

Yep, fully sure and tested :P I'm able to reproduce stuck process with
100% reliability. After applying my patch it returns EIO as expected
and doesn't get stuck.
If you look at copy_compressed_segment (in fs/btrfs/lzo.c) you'll see
  kaddr = kmap(cur_page);
  memcpy(dest + *cur_in - orig_in,
                 kaddr + offset_in_page(*cur_in), copy_len);
  kunmap(cur_page);

My guess is that kmap(cur_page) locks that page, then memcpy crashes
so that page never gets unmapped causing anyone that tries to map it
to wait forever. Hence this can be reproduced only once per boot. If I
knew how to find kernel thread that is running this we could check
it's stack and it should be stuck here on kmap.