uops.info
https://uops.info seems to be one of the best resources that exist for this. It contains the results of open source microbenchmarks on various Intel and AMD CPUs, notably establishing instruction latency and throughput.
E.g. this is the results page for IMUL rax, rax: https://uops.info/html-instr/IMUL_R64_R64.html E.g. if you were interested in AMD Zen 4 CPUs https://uops.info/html-instr/IMUL_R64_R64.html#ZEN4 tells us that:
latency: 3. I.e.: running one IMUL instruction takes 3 cycles to finish.
Also note that some instructions have different latencies for different outputs, e.g. MUL produces its RAX output in 3 cycles, but takes 4 cycles to produce RDX output: https://uops.info/html-instr/MUL_R64.html#ZEN4
throughput: 1: if you run a bunch of independent IMUL instructions, it is able to pipeline things and run one IMUL per cycle best case.
We could also look e.g. at INC https://uops.info/html-instr/INC_R64.html#ZEN4 for comparison:
- latency: 1
- throughput: 0.25, i.e. the CPU can run 4 independent INC per cycle
The minimal experiments below may also help to make these concepts clearer.
Simple experimental setup that may give a meaningful result: determining the latency of MUL
As others have emphasized, it can be hard to determine what is going on due to various microarhitectural features, and limitations of how precisely your experimental assembly can poke at them without triggering other effects. But here's a simplistic experimental attempt at determining the latency of the MUL multiplication:
main.c
#include <stdlib.h>
#include <stdint.h>
int main(int argc, char **argv) {
uint64_t max, i, x0;
if (argc > 1) {
max = strtoll(argv[1], NULL, 0);
} else {
max = 1;
}
i = max;
x0 = 1;
#if defined(__x86_64__) || defined(__i386__)
__asm__ (
"mov %[x0], %%rax;"
"mov $2, %%rbx;"
"loop:"
"mul %%rbx;"
"dec %[i];"
"jne loop;"
"mov %%rax, %[x0];"
: [i] "+r" (i),
[x0] "+r" (x0)
:
: "rax",
"rbx",
"rdx"
);
#endif
return x0;
}
Also, a quick reminder of how MUL works, doing e.g.:
mul %rbx
multiplies:
RBX * RAX
and stores the result in two fixed registers:
- RDX: top 64 bits (as the results of multiplying two 64-bit numbers can have up to 128 bits)
- RAX: lower 64 bits
Then compile and time it:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time ./main.out 5000000000
I choose a value of 5 billion loops because I know that this is more or less the frequency of my AMD Ryzen 7 7840U CPU ([Zen 4 microarchitecture](Zen 4)), so it will give results of the order of 1 second, which we hope will average well enough without taking too much time. You can find your CPU frequency on Linux as mentioned at: https://askubuntu.com/questions/218567/any-way-to-check-the-clock-speed-of-my-processor
The result was almost exactly 3 seconds:
real 0m3.014s
user 0m3.010s
sys 0m0.003s
Now, let's remove the "mul %%rbx;" multiplication from the loop to see how long it takes without it:
"loop:"
"inc %[i];"
"cmp %[max], %[i];"
"jb loop;"
I get about 1s, as expected running a tiny trivial loop at 1 iteration per cycle:
real 0m1.006s
user 0m1.005s
sys 0m0.001s
This therefore suggests that MUL took 3 cycles, as many sources suggest. It's likely not 2 cycles (the difference between INC and MUL) because we know that the CPU likely has separate functional units to run INC and MUL, so that when superscalar execution is running, both instructions can be executed at the same time:
+---------+
| CPU |
| |
| +-----+ |
| | ADD | |
| +-----+ |
| |
| +-----+ |
| | MUL | |
| +-----+ |
| |
+---------+
and then MUL just ends up taking the longest and dominating the overall runtime.
Estimating how many simultaneous MUL the CPU can do
Next, for fun, we can try to estimate how many MUL functional units the CPU has, which determines how many MUL it can do at once.
To do that, let's use assembly of the form:
__asm__ (
"mov $2, %%rbx;"
"loop:"
"mov %[x0], %%rax;"
"mul %%rbx;"
"mov %%rax, %[x0];"
"mov %[x1], %%rax;"
"mul %%rbx;"
"mov %%rax, %[x1];"
"dec %[i];"
"jne loop;"
: [i] "+r" (i),
[x0] "+r" (x0),
[x1] "+r" (x1)
:
: "rax",
"rbx" ,
"rdx"
);
Here we well test if the CPU can do 2 MUL instructions simultaneously.
We add some MOV instructions to ensure that the input from the second MOV, which uses RAX, does not depend on the output of the previous one, so that the CPU can get a chance to run them in parallel.
You may also object that they can't possibly run in parallel as both take input from and output to RAX. But that is not true because CPUs have register renaming to take care exactly of this type of issue. The CPU is able to notice that both RAX usages are independent, and use different actual memory locations for both, even though both in principle modify RAX.
When I benchmark this I get once again about 3 seconds:
real 0m3.091s
user 0m3.087s
sys 0m0.002s
so we conclude that the CPU managed to run two MUL "in parallel", without extra overhead. There are a few possible explanations for this:
- the CPU has two MUL ALUs, and MUL executes in either of them over three cycles
- the CPU has one MUL ALU, and MUL is split up into several uops, only one of which uses the MUL ALU. The 3 cycle latency exists because the multiple uops must execute in series. The MUL ALU uop runs in one cycle. Two instructions manage to run at once because they take turns using the MUL ALU while the other is doing the other uops
- the MUL ALU itself is pipelined. A MUL uop takes 3 cycles, but 3 can run at the same time in different stages.
We can try to decide between those two options by increasing the number of independent MUL instructions in the loop to see how many MUL the CPU can do per second (throughput). I did that using this helper script and the final plot was:

This plot excludes the possibility 1) in which there are two MUL ALUs each taking 3 cycles because from the inclination of the curve, for each MUL that we add to the loop, it only adds 1 second to the runtime. Therefore the CPU is able to run one MUL per cycle on average.
If we had 2 MUL ALUs taking 3 cycles each, then it would instead average to 1 MUL over 1.5 cycles.
It is not easy to decide between 2 and 3 from this graph alone.
The one thing that is not so neat about the experiment is that we weren't able to run 3 MUL in the same time as 2, given that the 3 cycle latency would suggest to us that this is possible, but seems not to happen due to interference in how the experiment is structured.
Doing this experiment with the simpler IMUL instruction which can output to a single register of our choice only without MOV e.g.:
IMUL $3, %rax, %rax
we obtain instead:

and the expected 3 instruction plateau is perfectly visible. I've also added a INC instruction for fun.
The following resources contain other reverse engineering efforts:
- https://en.wikichip.org/wiki/amd/microarchitectures/zen_4
- https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine
- https://uops.info/ measures 1c throughput, 3c latency for
mul r64, so one fully-pipelined integer multiply execution unit. https://uops.info/html-tp/ZEN4/MUL_R64-Measurements.html shows the instruction sequences they used to measure throughput: xor rax, rax / mul r8 to break the dependency chain, so there are many short dep chains, not loop-carried. And less overhead than 2x mov. (The xor/mul is inside an unrolled loop.) They also do latency measurements from each input to each output separately, and found 3 cycles from RAX input to RAX output, like this answer measures, but 4 cycles from inputs to the RDX high-half output. Perhaps it's written by the second uop which somehow obtains it from a hidden result of the first uop that widening-mul decodes to?
- https://agner.org/optimize/ (instruction tables) agrees, but didn't measure the RDX output latency.
- https://github.com/InstLatx64/InstLatx64/blob/master/AuthenticAMD/AuthenticAMD0A10F81_K19_StormPeak_01_InstLatX64.txt is a new-enough Zen 4 result for their testing methodology to measure correct throughput numbers for widening-
mul, not just bottlenecked by latency on the implicit RAX operand like in early Zen 4 instlat results. Also confirms uops.info's measurements of an extra cycle of latency to the RDX output.
Tested on Ubuntu 25.04.
Related:
xchgwith memory has an implicitlockprefix. All other instructions need alockprefix to be atomic with respect to observation by other CPUs, but the non-locked version can be useful on uniprocessor systems, which is probably whylockisn't implicit for things likecmpxchg.