Try/catch for modules?

Fri Oct 18 18:09:54 EDT 2019

Hi Valdis, thanks for the thorough response.

El vie., 18 oct. 2019 a las 18:53, Valdis Klētnieks
(<valdis.kletnieks at vt.edu>) escribió:
> Well..here's the thing.  Unless you have "panic_on_oops" set, hitting a null
> pointer will usually *NOT* panic the whole system. In fact, that #0000 in the
> panic message is a counter of how many times the kernel has OOPs'ed already.
> Way back in the dark mists of time, I had a system that managed to get it up to
> #1500 or so overnight.

Yes, and this is why my horribly hackish way to fix things is to
manually tamper with panic_on_oops on a die_notifier. I was hoping to
find a way not to do this.

> The most graceful generic thing the kernel can do at that point is kill the execution
> thread that hit the error.  This can quickly go sideways if that thread held a lock
> or similar critical resource.  And no, even though the kernel knows all the locks
> the thread had, it *does not* know which ones, if any, are safe to unlock.

I'd rather have the kernel just return control to me, at the beginning
of the catch block, and give me a chance to fix things (or at least
log some debugging info). I imagine that's what Windows' __except
block is for. The kernel may not know which locks are safe to break,
but I do.

Whether a kernel left in an unstable state is less desirable than a
panic is debatable in a case-by-case basis, and IMHO outside the scope
of this discussion.

> And if you actually *think* about it - a 'try/catch' is semantically *identical* to
> coding a parameter test before the event or checking a return code after.

I humbly disagree. Return codes aren't possible in all cases, which is
why there are things like native_read_msr_safe which implement some
form of exception handling through _ASM_EXTABLE.

> Also - say you have a try/catch around a statement.  For some exceptions, such
> as an end-of-file or a dropped network connection, it's reasonably easy to know
> how to clean up and continue. But what if the statement hits a null pointer
> error.   What do you do to clean things up?   You have a bad pointer, and you
> have *no way to actually fix it and continue normally*.

But then I can choose to let my process die, plus log some useful info
and maybe even do some minor cleanups, without raising a panic. My
particular module just reads some hardware registers and returns the
info to userspace, so it's not something essential for the system. As
a user, I would hate it if a non-essential module crashes the whole
system like that. Perhaps the real problem is that panic_on_oops
affects all of the kernel, rather than a given module.

In any case, I think I already have my answer. Thanks for the response
& discussion.