the cost of inlining?

Thu Dec 4 20:32:35 EST 2014

Hoping this isn't too far off the topic, but I figure it might be of interest to other kernel developers and it has me a bit baffled.

The primary benefit to inlining functions is to avoid the cost of making function calls. At least that's how I've understood it.

So I was playing with a bit of sample code:

$ cat atomic_read.c

#include <asm/atomic.h>
#include <asm/system.h>

int samp_atomic_read(atomic_t *v)
{
        int val;

        val = atomic_read(v);
        return val;
}

atomic_read() is declared like so:

static inline int atomic_read(const atomic_t *v)
{
        return v->counter;
}

So I figured the compilation of the my sample code would result in no call to a function with samp_atomic_read().

But after I build the above with the following Makefile:

$ cat Makefile
obj-m += atomic_read.o

all:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) modules

clean:
        make -C /lib/modules/$(shell uname -r)/build M=$(PWD) clean

I dump the resultant .ko, I get this:

> objdump -S -M intel atomic_read.ko

atomic_read.ko:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <samp_atomic_read>:
#include <asm/atomic.h>
#include <asm/system.h>

int samp_atomic_read(atomic_t *v)
{
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   e8 00 00 00 00          call   9 <samp_atomic_read+0x9>
 *
 * Atomically reads the value of @v.
 */
static inline int atomic_read(const atomic_t *v)
{
        return v->counter;
   9:   8b 07                   mov    eax,DWORD PTR [rdi]
    int val;

        val = atomic_read(v);
        return val;
}
   b:   c9                      leave
   c:   c3                      ret
   d:   90                      nop
   e:   90                      nop
   f:   90                      nop

I think I understand most of it. The first 2 instructions save the base pointer of the caller and setup a new one from samp_atomic_read().

The instruction at offset 9 reads the contents of v->counter into eax to return to the caller.

The instruction at offset 0xb, restores the base pointer and stack pointer of the caller and the ret at offset 0xc returns execution to the caller. I am guessing the nops at the end are to make the next function land on an 8 byte boundary (this is for an X86_64 target).

But what is that call instruction at offset 4 for?

It would seem to accomplish nothing since without it execution would proceed at the mov at offset 9 like I'd expect and since no new base frame gets setup inside atomic_read() itself, the leave/ret causes control to return to the caller of samp_atomic_read() anyway.

If atomic_read() were a macro, we wouldn't have this seemingly superfluous call instruction.

Anybody know why it's there?

Thanks,

Jeff Haran