Worth Trying to Take this Upstream?

Fri Mar 7 12:13:56 EST 2014

Hi Folks,

I'm currently making my living working with a system derived from RHEL
6.1, complete with proprietary kernel modules. That's based on the
2.6.32 kernel, which is, of course, old news in the linux world. 

I recently encountered and bandaided a rather nasty semi-deadlock,
that was being aggravated by some of the proprietary code - but which
appears to me (so far only from reading source code) to be present in
current versions of the linux kernel. It is, however, hard to tickle
even with our proprietary module, and doubtless worse without. 

The problem, in a nutshell, is that whenever the tasklist_lock is held
for read there's a tiny chance that jiffies won't advance - increasing
the longer the lock is held - and there's code running in softirqs
(potentially interrupting a holder of the tasklist_lock) that relies
on jiffies as a method of limiting how much work it does before
allowing the interrupted task to proceed. 

The mechanism is quite straightforward, at least with the combination
of timer options we're using - which as far as I know now are probably
default RHEL for x86_64.  It appears that the job of updating jiffies 
is assigned to one particular core. Jiffies won't advance while
that core has local interrupts blocked. Unfortunately the spinlock
routine used to get the tasklist_lock in write mode leaves interrupts
blocked while spinning. So all we need to get in trouble are 2 cpus.
A gets the tasklist_lock in read mode, and is then interrupted, and
starts running code that uses jiffies to limit its run time, and
happens to have so much work available that only the time limit is
likely to stop it. B is in charge of updating jiffies; it runs a
thread that calls write_lock_irq(&tasklist_lock). The next thing that
happens is that whatever watchdog mechanisms are enabled (and don't
use jiffies ;-)) proceed to go off - in our case 60 seconds later. 

What I'm asking here (other than feedback on the explanation, if folks
have any) is whether there's any point taking the issue to LKML. The
modality there seems to be to communicate via suggested patches to real
bugs someone's actually encountering on current kernels. And there's
an understandable aversion to patches submitted by unknowns which
touch fundamental core kernel mechanisms. Also, judging by the
reaction of my coworkers to my suggestions for a real fix, the linux
kernel is seen as very fragile - likely to have code relying on
unintended behaviors, so that a change that's theoretically correct
may expose all kinds of nastiness.  

With a little work, I can determine empirically whether the issue is
potentially present on current kernels. But there's no way I have the
machine farm to make this happen replicably on a standard kernel. Our
QA team managed to replicate this 3 times in 8 months, with the help
of some proprietary code using the same idiom of limiting its run time
via jiffies and running in softirq context. And any work I do on this
would be on my own time - management is happy with the proprietary
code being changed to use a different technique to limit its run time. 

The fix I'm inclined to propose is to the reader-writer spinlock code
- re-enable interrupts while spinning. Of all the fixes we considered,
this is the only one that doesn't potentially cost performance or
latency - the only extra work done is while already busy-waiting. 

It's not a general fix for all RW spinlocks - the tasklist_lock has
the useful property that no one ever tries to get it for write with
interrupts already blocked. (They use write_lock_irq(), and never
write_lock_irqsave()) But it is general and targetted at the
mechanism, not the symptom. 

Thanks for any advice,

---
Arlie
(Arlie Stephens					arlie at worldash.org)