February 24th, 2024

Let's start with a simple C function:

extern int get_int();

int foo(int a) {
   int b = get_int();
   int c = get_int();
   int d = get_int();
  return a * (b + c + d);
}

The assembly code generated by gcc without optimisation looks like this:

foo:
  push  rbp
  mov  rbp, rsp
  sub  rsp, 32
  mov  DWORD PTR -20[rbp], edi
  mov  eax, 0
  call  get_int@PLT
  mov  DWORD PTR -12[rbp], eax
  mov  eax, 0
  call  get_int@PLT
  mov  DWORD PTR -8[rbp], eax
  mov  eax, 0
  call  get_int@PLT
  mov  DWORD PTR -4[rbp], eax
  mov  edx, DWORD PTR -12[rbp]
  mov  eax, DWORD PTR -8[rbp]
  add  edx, eax
  mov  eax, DWORD PTR -4[rbp]
  add  eax, edx
  imul  eax, DWORD PTR -20[rbp]
  leave
  ret

We see the prologue:

push rbp
mov rbp, rsp

This pushes to memory the value of rbp (base pointer register), and moves the value of the stack pointer register rsp in to the base pointer.

This basically creates a stack frame. Now the function can use memory space for its own variables without messing up the caller's stack frame.

Body of foo

In the body we have a pretty regular assembly code.

Setting up memory and function arguments

sub rsp, 32: Reserves 32 bytes of memory.
mov dword ptr [rbp-20], edi: saves int a in rbp-20.

Setting up function variables in memory

mov eax, 0 & call get_int: fetches int b into eax.
mov dword ptr [rbp-12], eax: saves int b in rbp-12.
mov eax, 0 & call get_int: fetches int c into eax.
mov dword ptr [rbp-8], eax: saves int c in rbp-8.
mov eax, 0 & call get_int: fetches int d into eax.
mov dword ptr [rbp-4], eax: saves int d in rbp-4.

Note that the compiler knew we had 4 integers we needed to store in memory. Integers occupy 4 bytes each, so that means 4 * 4 we needed 16 bytes of memory to hold all integers.

The first 4 bytes in rbp store the old value of the stack pointer, so our loading of memory starts at address rbp -4 and ends at rbp -20.

Calculating the result

mov edx, DWORD PTR -12[rbp]: moves b into edx
mov eax, DWORD PTR -8[rbp]: moves c into eax
add edx, eax: adds b + c
mov eax, DWORD PTR -4[rbp]: moves d into eax
add eax, edx: calculates b + c + d into eax
imul eax, DWORD PTR -20[rbp]: multiplies sum by a.

Epilogue

Finally we call leave which is the equivalent of:

mov rsp, rbp
pop rbp

What is the function call overhead?

The overhead is pretty much creating and destroying the stack frame. Pushing and popping things from memory isn't as fast as doing operations on registers and memory manipulation incurs higher performance penalties.

Additionally, before calling a function, the caller might have to push register values into the stack - given that the code in the function itself might change value of certain registers (non-volatile registers).

Argument passing is also a problem, note how our argument int a was passed in register edi. The caller might have to do some register manipulation to put a into edi if that value is not there already.

How to minimise function call overhead?

The compiler by itself might decide to inline trivial functions. If your function is complex and the compiler didn't inline it for you, you might decide to do that by yourself. This is possible if your function is only called from a single place.