Unexpected scheduling with mutexes

Fri Mar 29 16:01:58 EDT 2019

On Wed, Mar 27, 2019 at 11:56:51AM +0100, Martin Christian wrote:
> Hi,
> 
> I've written a linux kernel module for an USB device. The USB driver
> provides 2 read-only character devices, which can be opened only
> exclusively by one process:
>  - `/dev/cdev_a`
>  - `/dev/cdev_b`

"exclusive" opening really isn't that, unless you go through _HUGE_
gyrations to try to control this.  It's not worth it in the end, so just
let userspace deal with it.  If it wants to interleave data to a device
node, let it, it can handle the fallout.

As an example of this, serial ports are not "exclusively owned", right?

> The USB device can only handle one request at a time.
> 
> The test setup is a follows:
>  - Processes A reads data from 1st device: `dd if=/dev/cdev_a of=/tmp/a
> bs=X`
>  - Processes B reads data from 2nd device: `dd if=/dev/cdev_b of=/tmp/b
> bs=X`
>  - Process A and B run in parallel
>  - After 10 seconds both processes are killed and size of both output
> files is compared.
> 
> For certain values of `X` there is a significant difference in size
> between the two files, which I don't expect.
> 
> A read call to the driver does the following:
>  1. `mutex_lock_interruptible(iolock)`
>  2. `usb_bulk_msg(dev, pipe, buf, X, timeout)`
>  3. `mutex_unlock(iolock)`
>  4. `copy_to_user(buf)`

What are these values of X that cause differences here?

> What I would expect is the following:
>  1. Proc A: `mutex_lock_interruptible(iolock)`
>  2. Proc A: `usb_bulk_msg(dev, pipe, buf, X, timeout)`
>  3. Scheduling: A -> B
>  4. Proc B: `mutex_lock_interruptible(iolock)` -> blocks
>  5. Scheduling: B -> A
>  6. Proc A: `mutex_unlock(iolock)`
>  7. Proc A: `copy_to_user(buf)`
>  8. Proc A: `mutex_lock_interruptible(iolock)` -> blocks
>  9. Scheduling: A -> B
>  10. Proc B: `usb_bulk_msg(dev, pipe, buf, X, timeout)`
> 
> But what I see with ftrace is that in step 8, process A still continues.
> And it seems that for certain values of X the time inside the critical
> region is a multiple of the time slice, so that process B always gets
> the time slice when the critical region is blocked. What would be a best
> practise solution for this? I was thinking of calling `schedule()` each
> time after copying to user space or playing with nice values or using
> wait_queues?

Step back a second and let me ask what exactly you are trying to solve
here?  If you are just playing around and want to watch mutexes being
grabbed and passed of, that's fine, and fun.  You are getting a good
education with how scheduling works and of course, the hell^Wmess that
USB really is.

But if you are trying to somehow create a real api that you have to
enforce the passing off of writing data from two different character
devices in an interleaved format, you are doing this totally wrong, as
this is not going to work with a simple mutex, as you have found out.

There are so many different variables in play here, that you are trying
to somehow sync up with just a single lock.  You have, just off the top
of my head:
	- scheduling issues of the different userspace programs
	- variability in USB transports (usb_bulk_msg() is just about
	  the most inefficient way of ever sending data on a USB device,
	  you end up wasting loads of time sleeping and waking up and
	  waiting for things to happen, and the USB pipeline is almost
	  totally empty for most of it.)
	- scheduling issues of the USB wakequeue that handles the data
	  to be sent
	- sleeping/memory fault issues when copying data to/from
	  userspace can play a huge factor for some systems.
and I am sure there are more.

A mutex isn't always "fair" in that it instantly gives up and passes
control to someone else who is holding it at the same time.  Sometimes
it can be grabbed by someone else, like the person who just dropped it,
based on a whole raft of factors that have been worked out over the
years to provide a robust and scalable general purpose operating system.

Schedulers too are not always "fair", that depends on a whole raft of
things like what CPU is running when, where your task happens to be
living at the moment, and of course, what else is happening in the
system at the exact same time (I'm sure other things are happening,
right?)  Again, this is all due to the way CPUs work, and how Linux
manages tasks in order to try to keep all resources used best at the
moment.

So, I really haven't answered your question here except to say, "it's
complicated" and "you aren't measuring what you think you are measuring" :)

Try to take USB out of the picture as well as userspace, and try running
two kernel threads trying to grab a mutex and then print out "A" or "B"
to the kernel log and then give it up.  Is that output nicely
interleaved or is there some duplicated messages.[1]

Again, what are you really trying to determine here?  Odds are there is
a better way to do it, given that your above sequence of events is
highly variable for a whole raft of reasons.

thanks,

greg k-h

[1] Extra bonus points for those that recognize this task...