Question on mutex code

Sat Mar 14 21:04:32 EDT 2015

On Sun, 2015-03-15 at 01:05 +0200, Matthias Bonne wrote:
> On 03/10/15 15:03, Yann Droneaud wrote:
> > Hi,
> >
> > Le mercredi 04 mars 2015 à 02:13 +0200, Matthias Bonne a écrit :
> >
> >> I am trying to understand how mutexes work in the kernel, and I think
> >> there might be a race between mutex_trylock() and mutex_unlock(). More
> >> specifically, the race is between the functions
> >> __mutex_trylock_slowpath and __mutex_unlock_common_slowpath (both
> >> defined in kernel/locking/mutex.c).
> >>
> >> Consider the following sequence of events:
> >>
> [...]
> >>
> >> The end result is that the mutex count is 0 (locked), although the
> >> owner has just released it, and nobody else is holding the mutex. So it
> >> can no longer be acquired by anyone.
> >>
> >> Am I missing something that prevents the above scenario from happening?
> >> If not, should I post a patch that fixes it to LKML? Or is it
> >> considered too "theoretical" and cannot happen in practice?
> >>
> >
> > I haven't looked at your explanations, you should have come with a
> > reproductible test case to demonstrate the issue (involving slowing
> > down one CPU ?).
> >
> > Anyway, such deep knowledge on the mutex implementation has to be found
> > on lkml.
> >
> > Regards.
> >
> 
> Thank you for your suggestions, and sorry for the long delay.
> 
> I see now that my explanation was unneccesarily complex. The problem is
> this code from __mutex_trylock_slowpath():
> 
>          spin_lock_mutex(&lock->wait_lock, flags);
> 
>          prev = atomic_xchg(&lock->count, -1);
>          if (likely(prev == 1)) {
>                  mutex_set_owner(lock);
>                  mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>          }
> 
>          /* Set it back to 0 if there are no waiters: */
>          if (likely(list_empty(&lock->wait_list)))
>                  atomic_set(&lock->count, 0);
> 
>          spin_unlock_mutex(&lock->wait_lock, flags);
> 
>          return prev == 1;
> 
> The above code assumes that the mutex cannot be unlocked while the
> spinlock is held. However, mutex_unlock() sets the mutex count to 1
> before taking the spinlock (even in the slowpath). If this happens
> between the atomic_xchg() and the atomic_set() above, and the mutex has
> no waiters, then the atomic_set() will set the mutex count back to 0
> after it has been unlocked by mutex_unlock(), but mutex_trylock() will
> still return failure. So the mutex will remain locked forever.

Good analysis, but not quite accurate for one simple fact: mutex
trylocks _only_ use fastpaths (obviously just depend on the counter
cmpxchg to 0), so you never fallback to the slowpath you are mentioning,
thus the race is non existent. Please see the arch code.