Invoking a system call from within the kernel

Demi Marie Obenour demiobenour at
Sat Nov 18 13:15:27 EST 2017

On Thu, Nov 16, 2017 at 10:54:24AM +0100, Greg KH wrote:
> On Wed, Nov 15, 2017 at 09:16:35PM -0500, Demi Marie Obenour wrote:
> > I am looking to write my first driver.  This driver will create a single
> > character device, which can be opened by any user.  The device will
> > support one ioctl:
> > 
> >         long ioctl_syscall(int fd, long syscall, long args[6]);
> > 
> > This is simply equivalent to:
> > 
> >         syscall(syscall, args[0], args[1], args[2], args[3], args[4],
> >                 args[5], args[6]);
> Wait, why?  Why do you want to do something like this, what problem are
> you trying to solve that you feel that something like this is the
> solution?  Let's step back and see if there isn't a better way to do
> this.
You are correct that there is a different problem that I really want to

Here is the different problem:  I want to have a new device (let's call
it `/dev/async_syscall`), with root:root owner and 0600 permissions.
When the user opens the device, the returned file descriptor can be used
to submit an async syscall request using the following ioctl:

        /* Fixed-size types to avoid a 32-bit compat layer */
        struct linux_async_syscall {
                __u64 syscall;
                __u64 args[6];
                __u64 user1;
                __u64 user2;

        /* arguments is really a struct linux_async_syscall * */
        /* n_syscalls is really a size_t */
        int ioctl(int fd, LINUX_ASYNC_SYSCALL, __u64 n_syscalls,
                  __u64 arguments, __u64 num_succeed);

Here `arguments` is an array of `struct linux_async_syscall` with
size `n_syscalls`, and `num_succeeded` is a pointer to an `int` that
receives the number of successfully submitted system calls.

In the kernel, this does the following:

1. Check that the parameters make sense
2. Copy them into kernel memory, and place the memory somewhere where it
   will be freed if the process terminates.
3. For each `struct linux_async_syscall` passed:
   1. Run seccomp filters to ensure that the process can actually make
      the syscall.
   2. Check the syscall against a whitelist of system calls that can be
      made asynchronously.
4. Call the in-kernel implementation of clone(), creating a new
   kernel thread.
5. In the parent, return success if and only if the thread creation was
6. In the child, for each `struct linux_async_syscall` passed, invoke
   the system call, as if from userspace.  Upon return, post a message
   to the file descriptor, which the userspace process can then
   retrieve with read(2).

I am sure there are more optimizations to be made, or possibly an
entirely different and superior approach.
> > and indeed I want it to behave *identically* to that.  That means that
> > ptracers are notified about the syscall (and given the opportunity to
> > update its arguments), and that seccomp_bpf filters are applied.
> > Furthermore, it means that all arguments to the syscall need full
> > validation, as if they came from userspace (because they do).
> > 
> > Is there an in-kernel API that allows one to invoke an arbitrary syscall
> > with arguments AND proper ptrace/seccomp_bpf filtering?  If not, how
> > difficult would it be to create one?
> Wouldn't creating such an interface be more work than just using the
> correct user/kernel interface in the first place?  :)
Yes, it would. :)

However, the ioctl I actually want to implement (see above) does the
system call asynchronously.  That isn’t possible using the existing
> Again, what is the problem you are trying to solve here.
See above :)  Basically, I am trying to improve performance and reduce
complexity of programs that need to do a lot of buffered file I/O.
> thanks,
> greg k-h
Thank you, Greg!


More information about the Kernelnewbies mailing list