Question on mutex code
Matthias Bonne
lemonlime51 at gmail.com
Tue Mar 3 19:13:04 EST 2015
Hello,
I am trying to understand how mutexes work in the kernel, and I think
there might be a race between mutex_trylock() and mutex_unlock(). More
specifically, the race is between the functions
__mutex_trylock_slowpath and __mutex_unlock_common_slowpath (both
defined in kernel/locking/mutex.c).
Consider the following sequence of events:
0. Suppose a mutex is locked by task A and has no waiters.
1. Task B calls mutex_trylock().
2. mutex_trylock() calls the architecture-specific
__mutex_fastpath_trylock(), with __mutex_trylock_slowpath() as
fail_fn.
3. According to the description of __mutex_fastpath_trylock() (for
example in include/asm-generic/mutex-dec.h), "if the architecture
has no effective trylock variant, it should call the fail_fn
spinlock-based trylock variant unconditionally". So
__mutex_fastpath_trylock() may now call __mutex_trylock_slowpath().
4. Task A releases the mutex.
5. Task B, in __mutex_trylock_slowpath, executes:
/* No need to trylock if the mutex is locked. */
if (mutex_is_locked(lock))
return 0;
Since the mutex is no longer locked, the function continues.
6. Task C, which runs on a different cpu than task B, locks the mutex
again.
7. Task B, in __mutex_trylock_slowpath(), continues:
spin_lock_mutex(&lock->wait_lock, flags);
prev = atomic_xchg(&lock->count, -1);
if (likely(prev == 1)) {
mutex_set_owner(lock);
mutex_acquire(&lock->dep_map, 0, 1, _RET_IP_);
}
At this point task B holds mutex->wait_lock, prev is 0 (because there
are no waiters other than task B, so the count was 0) and the mutex
count is set to -1.
5. Task C calls mutex_unlock() to unlock the mutex.
6. mutex_unlock() calls the architecture-specific function
__mutex_fastpath_unlock(), which fails (because the mutex count is
-1), so it now calls __mutex_unlock_slowpath(), which calls
__mutex_unlock_common_slowpath().
7. __mutex_unlock_common_slowpath() sets the mutex count to 1
unconditionally, before spinning on mutex->wait_lock.
8. Task B, in __mutex_trylock_slowpath, continues:
/* Set it back to 0 if there are no waiters: */
if (likely(list_empty(&lock->wait_list)))
atomic_set(&lock->count, 0);
spin_unlock_mutex(&lock->wait_lock, flags);
return prev == 1;
mutex->wait_list is still empty, so the code sets the mutex count to
zero (which means the mutex is locked), releases mutex->wait_lock,
and returns 0 (which means that the mutex is locked by someone else,
and cannot be acquired).
9. Task C, in __mutex_unlock_common_slowpath, acquires
mutex->wait_lock, unlocks it immediately (because there are no
waiters to wake up) and returns.
The end result is that the mutex count is 0 (locked), although the
owner has just released it, and nobody else is holding the mutex. So it
can no longer be acquired by anyone.
Am I missing something that prevents the above scenario from happening?
If not, should I post a patch that fixes it to LKML? Or is it
considered too "theoretical" and cannot happen in practice?
Thanks.
More information about the Kernelnewbies
mailing list