Question on mutex code

Tue Mar 10 09:03:59 EDT 2015

Hi,

Le mercredi 04 mars 2015 à 02:13 +0200, Matthias Bonne a écrit :

> I am trying to understand how mutexes work in the kernel, and I think
> there might be a race between mutex_trylock() and mutex_unlock(). More
> specifically, the race is between the functions
> __mutex_trylock_slowpath and __mutex_unlock_common_slowpath (both
> defined in kernel/locking/mutex.c).
> 
> Consider the following sequence of events:
> 
> 0. Suppose a mutex is locked by task A and has no waiters.
> 
> 1. Task B calls mutex_trylock().
> 
> 2. mutex_trylock() calls the architecture-specific
>     __mutex_fastpath_trylock(), with __mutex_trylock_slowpath() as
>     fail_fn.
> 
> 3. According to the description of __mutex_fastpath_trylock() (for
>     example in include/asm-generic/mutex-dec.h), "if the architecture
>     has no effective trylock variant, it should call the fail_fn
>     spinlock-based trylock variant unconditionally". So
>     __mutex_fastpath_trylock() may now call __mutex_trylock_slowpath().
> 
> 4. Task A releases the mutex.
> 
> 5. Task B, in __mutex_trylock_slowpath, executes:
> 
>          /* No need to trylock if the mutex is locked. */
>          if (mutex_is_locked(lock))
>                  return 0;
> 
>     Since the mutex is no longer locked, the function continues.
> 
> 6. Task C, which runs on a different cpu than task B, locks the mutex
>     again.
> 
> 7. Task B, in __mutex_trylock_slowpath(), continues:
> 
>          spin_lock_mutex(&lock->wait_lock, flags);
> 
>          prev = atomic_xchg(&lock->count, -1);
>          if (likely(prev == 1)) {
>                  mutex_set_owner(lock);
>                  mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
>          }
> 
> At this point task B holds mutex->wait_lock, prev is 0 (because there
> are no waiters other than task B, so the count was 0) and the mutex
> count is set to -1.
> 
> 5. Task C calls mutex_unlock() to unlock the mutex.
> 
> 6. mutex_unlock() calls the architecture-specific function
>     __mutex_fastpath_unlock(), which fails (because the mutex count is
>     -1), so it now calls __mutex_unlock_slowpath(), which calls
>     __mutex_unlock_common_slowpath().
> 
> 7. __mutex_unlock_common_slowpath() sets the mutex count to 1
>     unconditionally, before spinning on mutex->wait_lock.
> 
> 8. Task B, in __mutex_trylock_slowpath, continues:
> 
>          /* Set it back to 0 if there are no waiters: */
>          if (likely(list_empty(&lock->wait_list)))
>                  atomic_set(&lock->count, 0);
> 
>          spin_unlock_mutex(&lock->wait_lock, flags);
> 
>          return prev == 1;
> 
>     mutex->wait_list is still empty, so the code sets the mutex count to
>     zero (which means the mutex is locked), releases mutex->wait_lock,
>     and returns 0 (which means that the mutex is locked by someone else,
>     and cannot be acquired).
> 
> 9. Task C, in __mutex_unlock_common_slowpath, acquires
>     mutex->wait_lock, unlocks it immediately (because there are no
>     waiters to wake up) and returns.
> 
> The end result is that the mutex count is 0 (locked), although the
> owner has just released it, and nobody else is holding the mutex. So it
> can no longer be acquired by anyone.
> 
> Am I missing something that prevents the above scenario from happening?
> If not, should I post a patch that fixes it to LKML? Or is it
> considered too "theoretical" and cannot happen in practice?
> 

I haven't looked at your explanations, you should have come with a 
reproductible test case to demonstrate the issue (involving slowing 
down one CPU ?).

Anyway, such deep knowledge on the mutex implementation has to be found
on lkml.

Regards.

-- 
Yann Droneaud
OPTEYA