February 24th, 2024
Let's start with a simple C function:
extern int get_int();
int foo(int a) {
int b = get_int();
int c = get_int();
int d = get_int();
return a * (b + c + d);
}
The assembly code generated by gcc without optimisation looks like this:
foo:
push rbp
mov rbp, rsp
sub rsp, 32
mov DWORD PTR -20[rbp], edi
mov eax, 0
call get_int@PLT
mov DWORD PTR -12[rbp], eax
mov eax, 0
call get_int@PLT
mov DWORD PTR -8[rbp], eax
mov eax, 0
call get_int@PLT
mov DWORD PTR -4[rbp], eax
mov edx, DWORD PTR -12[rbp]
mov eax, DWORD PTR -8[rbp]
add edx, eax
mov eax, DWORD PTR -4[rbp]
add eax, edx
imul eax, DWORD PTR -20[rbp]
leave
ret
We see the prologue:
push rbp
mov rbp, rsp
This pushes to memory the value of rbp (base pointer register), and moves the value of the stack pointer register rsp in to the base pointer.
This basically creates a stack frame. Now the function can use memory space for its own variables without messing up the caller's stack frame.
In the body we have a pretty regular assembly code.
sub rsp, 32
: Reserves 32 bytes of memory.mov dword ptr [rbp-20], edi
: saves int a
in rbp-20.mov eax, 0 & call get_int
: fetches int b
into eax.
mov dword ptr [rbp-12], eax
: saves int b
in rbp-12.
mov eax, 0 & call get_int
: fetches int c
into eax.
mov dword ptr [rbp-8], eax
: saves int c
in rbp-8.
mov eax, 0 & call get_int
: fetches int d
into eax.
mov dword ptr [rbp-4], eax
: saves int d
in rbp-4.
Note that the compiler knew we had 4 integers we needed to store in memory.
Integers occupy 4 bytes each, so that means 4 * 4
we needed 16 bytes of
memory to hold all integers.
The first 4 bytes in rbp
store the old value of the stack pointer, so our
loading of memory starts at address rbp -4 and ends at rbp -20.
mov edx, DWORD PTR -12[rbp]
: moves b into edxmov eax, DWORD PTR -8[rbp]
: moves c into eaxadd edx, eax
: adds b + cmov eax, DWORD PTR -4[rbp]
: moves d into eaxadd eax, edx
: calculates b + c + d into eaximul eax, DWORD PTR -20[rbp]
: multiplies sum by a.Finally we call leave
which is the equivalent of:
mov rsp, rbp
pop rbp
The overhead is pretty much creating and destroying the stack frame. Pushing and popping things from memory isn't as fast as doing operations on registers and memory manipulation incurs higher performance penalties.
Additionally, before calling a function, the caller might have to push register values into the stack - given that the code in the function itself might change value of certain registers (non-volatile registers).
Argument passing is also a problem, note how our argument int a
was passed
in register edi
. The caller might have to do some register manipulation to
put a
into edi
if that value is not there already.
The compiler by itself might decide to inline trivial functions. If your function is complex and the compiler didn't inline it for you, you might decide to do that by yourself. This is possible if your function is only called from a single place.