Created at: 2024-11-12
perf(1): Performance analysis tools for Linux
First make sure you have this set:
# Allow use of almost all events by all users.
# This will be reset if you reboot your computer.
sudo sysctl -w kernel.perf_event_paranoid=-1
# Sample on-CPU functions for the specified command, at 99 Hertz:
perf record -F 99 command
# Sample CPU stack traces (via frame pointers) system-wide for 10 seconds:
perf record -F 99 -a -g -- sleep 10
# Sample CPU stack traces for the PID, using dwarf (dbg info) to unwind stacks:
perf record -F 99 -p PID --call-graph dwarf -- sleep 10
# Record new process events via exec (might need sudo):
perf record -e sched:sched_process_exec -a
# Record context switch events for 10 seconds with stack traces:
perf record -e sched:sched_switch -a -g -- sleep 10
# Sample CPU migrations for 10 seconds:
perf record -e migrations -a -- sleep 10
# Record all CPU migrations for 10 seconds:
perf record -e migrations -a -c 1 -- sleep 10
# Show perf.data as a text report, with data coalesced and counts and
# percentages:
perf report -n --stdio
# List all perf.data events, with data header (recommended):
perf script --header
# Show PMC statistics for the entire system, for 5 seconds:
perf stat -a -- sleep 5
# Show PMC stats for the command
perf stat command
# Show CPU last level cache (LLC) statistics for the command:
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches command
# Show memory bus throughput system-wide every second:
perf stat -e uncore_imc/data_reads/,uncore_imc/data_writes/ -a -I 1000
# Show the rate of context switches per-second:
perf stat -e sched:sched_switch -a -I 1000
# Show the rate of involuntary context switches per-second (previous state was
# TASK_RUNNING):
perf stat -e sched:sched_switch --filter 'prev_state == 0' -a -I 1000
# Show the rate of mode switches and context switches per second:
perf stat -e cpu_clk_unhalted.ring0_trans,cs -a -I 1000
# Record a scheduler profile for 10 seconds:
perf sched record -- sleep 10
# Show per-process scheduler latency from a scheduler profile:
perf sched latency
# List per-event scheduler latency from a scheduler profile:
perf sched timehist
(added in Linux 5.8)
perf record -F 99 -a -g -- sleep 10
perf script report flamegraph
$BROWSER flamegraph.html
# Also available as a single command
perf script flamegraph -a -F 99 sleep 10 && $BROWSER flamegraph.html
To see a list of available hardware events that can be printed:
perf list
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
[...]
cpu:
L1-dcache-loads OR cpu/L1-dcache-loads/
L1-dcache-load-misses OR cpu/L1-dcache-load-misses/
L1-dcache-stores OR cpu/L1-dcache-stores/
L1-icache-load-misses OR cpu/L1-icache-load-misses/
LLC-loads OR cpu/LLC-loads/
[...]
You can select some and use it with perf stats -e <command>
. The flag e
needs to be selected to list the events you are choosing.
EVENTS="instructions,cycles,branch-instructions,branch-misses,LLC-loads"
perf stat -e $EVENTS man echo
Performance counter stats for 'man echo':
1,124,851,601 instructions:u # 1.39 insn per cycle
808,224,561 cycles:u
247,251,658 branch-instructions:u
4,341,929 branch-misses:u # 1.76% of all branches
1,905,146 LLC-loads:u
1.615330906 seconds time elapsed
0.221092000 seconds user
0.090197000 seconds sys
Some interesting events are:
L1-dcache-load-misses
: Level 1 data cache load misses. This gives you a
measure of the memory load caused by the application, after some loads have
been returned from the Level 1 cache. It can be compared with other L1 event
counters to determine cache hit ratio.LLC-load-misses
: Last level cache load misses. After the last level, this
accesses main memory, and so this is a measure of main memory load. The
difference between this and L1-dcache-load-misses
gives an idea of the
effectiveness of the CPU caches beyond Level 1, but other counters are needed
for completeness.dTLB-load-misses
: Data translation lookaside buffer misses. This shows the
effectiveness of the MMU to cache page mappings for the workload, and can
measure the size of the memory workload (working set).