How many CPU cycles are needed for each assembly instruction?

Question

I heard there is Intel book online which describes the CPU cycles needed for a specific assembly instruction, but I can not find it out (after trying hard). Could anyone show me how to find CPU cycle please?

Here is an example, in the below code, mov/lock is 1 CPU cycle, and xchg is 3 CPU cycles.

// This part is Platform dependent!
#ifdef WIN32
inline int CPP_SpinLock::TestAndSet(int* pTargetAddress, 
                                              int nValue)
{
    __asm
    {
        mov edx, dword ptr [pTargetAddress]
        mov eax, nValue
        lock xchg eax, dword ptr [edx]
    }
    // mov = 1 CPU cycle
    // lock = 1 CPU cycle
    // xchg = 3 CPU cycles
}

#endif // WIN32

BTW: here is the URL for the code I posted: http://www.codeproject.com/KB/threads/spinlocks.aspx

Do you think this <stackoverflow.com/questions/138932/…> is of any help? — dirkgently
– dirkgently, Commented Mar 28, 2009 at 12:47
Isn't the lock prefix redundant on xchg? I was thinking that was an instruction where lock is implied? Or is it required for multi-processor use? I seem to recall some difference between implied lock and explicit lock when it came to multi-processor configurations. — Brian Knoblauch
– Brian Knoblauch, Commented Dec 1, 2010 at 16:49
@BrianKnoblauch: yes, xchg with memory has an implicit lock prefix. All other instructions need a lock prefix to be atomic with respect to observation by other CPUs, but the non-locked version can be useful on uniprocessor systems, which is probably why lock isn't implicit for things like cmpxchg. — Peter Cordes
– Peter Cordes, Commented Jul 6, 2017 at 18:34
@George2 a new answer has been added by beeonrope which I think comes closest to answering your question - consider reviewing it and selecting it if you feel the same. — Adam Davis
– Adam Davis, Commented Jul 7, 2017 at 23:21

BeeOnRope · Accepted Answer · 2018-03-14 23:36:45Z

72

Modern CPUs are complex beasts, using pipelining, superscalar execution, and out-of-order execution among other techniques which make performance analysis difficult... but not impossible!

While you can no longer simply add together the latencies of a stream of instructions to get the total runtime, you can still get a (often) highly accurate analysis of the behavior of some piece of code (especially a loop) as described below and in other linked resources.

Instruction Timings

First, you need the actual timings. These vary by CPU architecture, but the best resource currently for x86 timings is Agner Fog's instruction tables. Covering no less than thirty different microarchitecures, these tables list the instruction latency, which is the minimum/typical time that an instruction takes from inputs ready to output available. In Agner's words:

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Where hyperthreading is enabled, the use of the same execution units in the other thread leads to inferior performance. Denormal numbers, NAN's and infinity do not increase the latency. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

So, for example, the add instruction has a latency of one cycle, so a series of dependent add instructions, as shown, will have a latency of 1 cycle per add:

add eax, eax
add eax, eax
add eax, eax
add eax, eax  # total latency of 4 cycles for these 4 adds

Note that this doesn't mean that add instructions will only take 1 cycle each. For example, if the add instructions were not dependent, it is possible that on modern chips all 4 add instructions can execute independently in the same cycle:

add eax, eax
add ebx, ebx
add ecx, ecx
add edx, edx # these 4 instructions might all execute, in parallel in a single cycle

Agner provides a metric which captures some of this potential parallelism, called reciprocal throughput:

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

For add this is listed as 0.25 meaning that up to 4 add instructions can execute every cycle (giving a reciprocal throughput of 1 / 4 = 0.25).

The reciprocal throughput number also gives a hint at the pipelining capability of an instruction. For example, on most recent x86 chips, the common forms of the imul instruction have a latency of 3 cycles, and internally only one execution unit can handle them (unlike add which usually has four add-capable units). Yet the observed throughput for a long series of independent imul instructions is 1/cycle, not 1 every 3 cycles as you might expect given the latency of 3. The reason is that the imul unit is pipelined: it can start a new imul every cycle, even while the previous multiplication hasn't completed.

This means a series of independent imul instructions can run at up to 1 per cycle, but a series of dependent imul instructions will run at only 1 every 3 cycles (since the next imul can't start until the result from the prior one is ready).

So with this information, you can start to see how to analyze instruction timings on modern CPUs.

Detailed Analysis

Still, the above is only scratching the surface. You now have multiple ways of looking at a series of instructions (latency or throughput) and it may not be clear which to use.

Furthermore, there are other limits not captured by the above numbers, such as the fact that certain instructions compete for the same resources within the CPU, and restrictions in other parts of the CPU pipeline (such as instruction decoding) which may result in a lower overall throughput than you'd calculate just by looking at latency and throughput. Beyond that, you have factors "beyond the ALUs" such as memory access and branch prediction: entire topics unto themselves - you can mostly model these well, but it takes work. For example here's a recent post where the answer covers in some detail most of the relevant factors.

Covering all the details would increase the size of this already long answer by a factor of 10 or more, so I'll just point you to the best resources. Agner Fog has an Optimizing Asembly guide that covers in detail the precise analysis of a loop with a dozen or so instructions. See "12.7 An example of analysis for bottlenecks in vector loops" which starts on page 95 in the current version of the PDF.

The basic idea is that you create a table, with one row per instruction and mark the execution resources each uses. This lets you see any throughput bottlenecks. In addition, you need to examine the loop for carried dependencies, to see if any of those limit the throughput (see "12.16 Analyzing dependencies" for a complex case).

If you don't want to do it by hand, Intel has released the Intel Architecture Code Analyzer, which is a tool that automates this analysis. It currently hasn't been updated beyond Skylake, but the results are still largely reasonable for Kaby Lake since the microarchitecture hasn't changed much and therefore the timings remain comparable. This answer goes into a lot of detail and provides example output, and the user's guide isn't half bad (although it is out of date with respect to the newest versions).

Other sources

Agner usually provides timings for new architectures shortly after they are released, but you can also check out instlatx64 for similarly organized timings in the InstLatX86 and InstLatX64 results. The results cover a lot of interesting old chips, and new chips usually show up fairly quickly. The results are mostly consistent with Agner's, with a few exceptions here and there. You can also find memory latency and other values on this page.

You can even get the timing results directly from Intel in their IA32 and Intel 64 optimization manual in Appendix C: INSTRUCTION LATENCY AND THROUGHPUT. Personally I prefer Agner's version because they are more complete, often arrive before the Intel manual is updated, and are easier to use as they provide a spreadsheet and PDF version.

Finally, the x86 tag wiki has a wealth of resources on x86 optimization, including links to other examples of how to do a cycle accurate analysis of code sequences.

If you want a deeper look into the type of "dataflow analysis" described above, I would recommend A Whirlwind Introduction to Data Flow Graphs.

edited Mar 14, 2018 at 23:36

answered Jul 7, 2017 at 23:13

BeeOnRope

66.3k20 gold badges244 silver badges446 bronze badges

Sign up to request clarification or add additional context in comments.

16 Comments

BeeOnRope Over a year ago

@PeterCordes It is intended to be covered by "certain instructions compete for the same execution units within the CPU", which uses "execution unit" to broadly cover all the capacity/specialization restrictions on scheduling such as ports, ALU/EUs, (those two being mostly interchangeable on recent archs), instruction-specific restrictions (lea for example). As I point out immediately after that, explaining how to do a full end-to-end analysis taking in all the factors would be very long and mostly just repeat other material that has already been prepared, some of which I link to.

Iwillnotexist Idonotexist Over a year ago

@PeterCordes The LLVM guys apparently recently got intimate details from Intel about Sandy Bridge uop latencies and up, and the encoded knowledge will end up in LLVM's scheduler. We should watch this space: reviews.llvm.org/rL307529 "Also note that this patch will be followed by additional patches for the remaining target architectures HSW, IVB, BDW, SKL and SKX."

Peter Cordes Over a year ago

Oh yeah, SnB runs integer shuffles (which don't have a 256b version) on 2 ports. Hmm, later in the same file, there's a lot of new lines, including ... (instregex "PSLLDri")>; in a port0 group. So I think it's sane after all.

Iwillnotexist Idonotexist Over a year ago

@PeterCordes and BeeOnRope: Behold, the LLVM scheduler for Haswell was updated. It even gives breakdowns of how many uops each instruction generates and the set of ports those uops can be issued to.

maxschlepzig Over a year ago

Regarding other sources: there is also uops.info/table.html - this interactive page certainly has a nice user interface

|

GEOCHET · Accepted Answer · 2009-03-31 14:25:24Z

30

Given pipelining, out of order processing, microcode, multi-core processors, etc there's no guarantee that a particular section of assembly code will take exactly x CPU cycles/clock cycle/whatever cycles.

If such a reference exists, it will only be able to provide broad generalizations given a particular architecture, and depending on how the microcode is implemented you may find that the Pentium M is different than the Core 2 Duo which is different than the AMD dual core, etc.

Note that this article was updated in 2000, and written earlier. Even the Pentium 4 is hard to pin down regarding instruction timing - PIII, PII, and the original pentium were easier, and the texts referenced were probably based on those earlier processors that had a more well-defined instruction timing.

These days people generally use statistical analysis for code timing estimation.

edited Mar 31, 2009 at 14:25

GEOCHET

21.3k15 gold badges78 silver badges99 bronze badges

answered Mar 28, 2009 at 13:00

Adam Davis

93.9k61 gold badges273 silver badges333 bronze badges

17 Comments

CDR Over a year ago

Excellent answer! Covers every counter question one might have.

Stack Overflow is garbage Over a year ago

Technically not entirely accurate. Each instruction does have a fixed duration/latency, as specified in Can Berk Güders answer. For the reasons you point out, this alone is only part of the story though. Knowing the latency of each instruction doesn't tell you when it gets scheduled.

Justicle Over a year ago

@AdamDavis stackoverflow.com/a/692727/94239 answers the question concisely as asked. The Intel guides do break down performance by model of processor (if you bother to look). Your answer is unhelpful to the learning environement of SO because it essentially says "don't even try".

Adam Davis Over a year ago

@Justicle I disagree. That answer provides the manuals one would look in to find the information, but it does not provide the information, or more importantly enough information to understand how to read the manual and find the information. I welcome you to read the manuals and provide the number of clock cycles those instructions will take on one of the processors in the Core line - your choice - and ignore the rest of the processors. If it's as simple as you say, and my answer is wrong, then you should be able to do so easily and quickly. Prove me wrong by providing an exact answer.

BeeOnRope Over a year ago

This answer is far too pessimistic. The overall idea that you can't just add together the number of cycles to get a total latency is correct, but that doesn't mean you just throw up your hands and say that modern CPUs are a black box. In you just need to use a somewhat more complex model where instructions are nodes in a dependency graph, which have a latency and some throughput constraints shared with other instructions. Agners guides go over it in detail (and he has the numbers for each instruction) and Intel's IACA implements the concept in software. Additional caveats apply.

|

Peter Cordes · Accepted Answer · 2017-07-06 19:48:29Z

27

What the other answers say about it being impossible to accurately predict the performance of code running on a modern CPU is true, but that doesn't mean the latencies are unknown, or that knowing them is useless.

The exact latencies for Intels and AMD's processors are listed in Agner Fog's instruction tables. See also Intel® 64 and IA-32 Architectures Optimization Reference Manual, and Instruction latencies and throughput for AMD and Intel x86 processors (from Can Berk Güder's now-deleted link-only answer). AMD also has pdf manuals on their own website with their official values.

For (micro-)optimizing tight loops, knowing the latencies for each instruction can help a lot in manually trying to schedule your code. The programmer can make a lot of optimizations that the compiler can't (because the compiler can't guarantee it won't change the meaning of the program).

Of course, this still requires you to know a lot of other details about the CPU, such as how deeply pipelined it is, how many instructions it can issue per cycle, number of execution units and so on. And of course, these numbers vary for different CPU's. But you can often come up with a reasonable average that more or less works for all CPU's.

It's worth noting though, that it is a lot of work to optimize even a few lines of code at this level. And it is easy to make something that turns out to be a pessimization. Modern CPUs are hugely complicated, and they try extremely hard to get good performance out of bad code. But there are also cases they're unable to handle efficiently, or where you think you're clever and making efficient code, and it turns out to slow the CPU down.

Edit Looking in Intel's optimization manual, table C-13: The first column is instruction type, then there is a number of columns for latency for each CPUID. The CPUID indicates which processor family the numbers apply to, and are explained elsewhere in the document. The latency specifies how many cycles it takes before the result of the instruction is available, so this is the number you're looking for.

The throughput columns show how many of this type of instructions can be executed per cycle.

Looking up xchg in this table, we see that depending on the CPU family, it takes 1-3 cycles, and a mov takes 0.5-1. These are for the register-to-register forms of the instructions, not for a lock xchg with memory, which is a lot slower. And more importantly, hugely-variable latency and impact on surrounding code (much slower when there's contention with another core), so looking only at the best-case is a mistake. (I haven't looked up what each CPUID means, but I assume the .5 are for Pentium 4, which ran some components of the chip at double speed, allowing it to do things in half cycles)

I don't really see what you plan to use this information for, however, but if you know the exact CPU family the code is running on, then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions.

edited Jul 6, 2017 at 19:48

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

answered Mar 28, 2009 at 14:02

Stack Overflow is garbage

249k53 gold badges356 silver badges558 bronze badges

5 Comments

George2 Over a year ago

@jalf, could you guide me to explain how to find how much CPU cycles needed for instruction like mov/xchg? I looked in documents recommended mentioned by others from Intel, but feel confusing to find what exactly each columns mean in tables. Thanks.

Stack Overflow is garbage Over a year ago

The latency columns show you how many cycles it takes from the instruction is initiated, until the result of it is available. Intel subdivides this into different CPUID's, to show the values for various families of CPU's xchg is listed as 1-3 cycles depending on CPU, and mov is 0.5-1.

Stack Overflow is garbage Over a year ago

Edited my post to add these details

Peter Cordes Over a year ago

Last sentence is bogus: "then adding up the latency tells you the minimum number of cycles required to execute this sequence of instructions." No, because the two mov loads can run in parallel. Adding up latencies only works within a single dep chain, assuming no resource conflicts (execution ports being stolen by other instructions, delaying the critical path).

Ross Ridge Over a year ago

@PeterCordes It's even worse in the example case because the the XCHG instruction (with the redundant LOCK prefix) has huge unknown latency that makes any minimum based on charts pretty bogus.

Peter Cordes · Accepted Answer · 2017-07-06 19:45:21Z

14

Measuring and counting CPU-cycles does not make sense on the x86 anymore.

First off, ask yourself for which CPU you're counting cycles? Core-2? a Athlon? Pentium-M? Atom? All these CPUs execute x86 code but all of them have different execution times. The execution even varies between different steppings of the same CPU.

The last x86 where cycle-counting made sense was the Pentium-Pro.

Also consider, that inside the CPU most instructions are transcoded into microcode and executed out of order by a internal execution unit that does not even remotely look like a x86. The performance of a single CPU instruction depends on how much resources in the internal execution unit is available.

So the time for a instruction depends not only on the instruction itself but also on the surrounding code.

Anyway: You can estimate the throughput-resource usage and latency of instructions for different processors. The relevant information can be found at the Intel and AMD sites.

Agner Fog has a very nice summary on his web-site. See the instruction tables for latency, throughput, and uop count. See the microarchictecture PDF to learn how to interpret those.

http://www.agner.org/optimize

But note that xchg-with-memory does not have predictable performance, even if you look at only one CPU model. Even in the no-contention case with the cache-line already hot in L1D cache, being a full memory barrier will mean it's impact depends a lot on loads and stores to other addresses in the surrounding code.

Btw - since your example-code is a lock-free datastructure basic building block: Have you considered using the compiler built-in functions? On win32 you can include intrin.h and use functions such as _InterlockedExchange.

That'll give you better execution time because the compiler can inline the instructions. Inline-assembler always forces the compiler to disable optimizations around the asm-code.

edited Jul 6, 2017 at 19:45

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

answered Mar 28, 2009 at 13:09

Nils Pipenbrinck

86.7k33 gold badges157 silver badges224 bronze badges

5 Comments

George2 Over a year ago

@Nils, I think you mean for the overall elapsed time for an instruction, it varies dependent on system resource status and scheduling. But I think once the instruction is executing, it will be executed in fixed CPU cycles for a specific architecture, correct?

George2 Over a year ago

@Nils, the code sample is just for my leaning purpose to learn spin lock, for real programming practices, I will definitely use interlock functions.

George2 Over a year ago

BTW: on agner.org where is the information shows CPU cycle needed for an assembly instruction? I looked some time in this site, but find nothing. Could you give 1-2 links please? :-)

Justicle Over a year ago

Does not answer the question.

BeeOnRope Over a year ago

Counting and adding up instruction timings is valid, it just requires a more complex model than the past. In fact, for many loops without external factors such as L1 misses such counting can get you cycle accurate results, or nearly so.

ben · Accepted Answer · 2010-01-04 14:21:44Z

8

lock xchg eax, dword ptr [edx]

Note the lock will lock memory for the memory fetch for all cores, this can take 100 cycles on some multi cores and a cache line will also need to be flushed. It will also stall the pipeline. So i wouldnt worry about the rest.

So optimal performance gets back to tuning your algorithms critical regions.

Note on a single core you can optmize this by removing the lock but it is needed for multi core.

answered Jan 4, 2010 at 14:21

ben

2232 silver badges2 bronze badges

Comments

Peter Cordes · Accepted Answer · 2025-07-14 06:10:47Z

0

uops.info

https://uops.info seems to be one of the best resources that exist for this. It contains the results of open source microbenchmarks on various Intel and AMD CPUs, notably establishing instruction latency and throughput.

E.g. this is the results page for IMUL rax, rax: https://uops.info/html-instr/IMUL_R64_R64.html E.g. if you were interested in AMD Zen 4 CPUs https://uops.info/html-instr/IMUL_R64_R64.html#ZEN4 tells us that:

latency: 3. I.e.: running one IMUL instruction takes 3 cycles to finish.

Also note that some instructions have different latencies for different outputs, e.g. MUL produces its RAX output in 3 cycles, but takes 4 cycles to produce RDX output: https://uops.info/html-instr/MUL_R64.html#ZEN4
throughput: 1: if you run a bunch of independent IMUL instructions, it is able to pipeline things and run one IMUL per cycle best case.

We could also look e.g. at INC https://uops.info/html-instr/INC_R64.html#ZEN4 for comparison:

latency: 1
throughput: 0.25, i.e. the CPU can run 4 independent INC per cycle

The minimal experiments below may also help to make these concepts clearer.

Simple experimental setup that may give a meaningful result: determining the latency of MUL

As others have emphasized, it can be hard to determine what is going on due to various microarhitectural features, and limitations of how precisely your experimental assembly can poke at them without triggering other effects. But here's a simplistic experimental attempt at determining the latency of the MUL multiplication:

main.c

#include <stdlib.h>
#include <stdint.h>

int main(int argc, char **argv) {
    uint64_t max, i, x0;
    if (argc > 1) {
        max = strtoll(argv[1], NULL, 0);
    } else {
        max = 1;
    }
    i = max;
    x0 = 1;
#if defined(__x86_64__) || defined(__i386__)
    __asm__ (
        "mov %[x0], %%rax;"
        "mov $2, %%rbx;"
        "loop:"
        "mul %%rbx;"
        "dec %[i];"
        "jne loop;"
        "mov %%rax, %[x0];"
        : [i] "+r" (i),
          [x0] "+r" (x0)
        :
        : "rax",
          "rbx",
          "rdx"
    );
#endif
    return x0;
}

Also, a quick reminder of how MUL works, doing e.g.:

mul %rbx

multiplies:

RBX * RAX

and stores the result in two fixed registers:

RDX: top 64 bits (as the results of multiplying two 64-bit numbers can have up to 128 bits)
RAX: lower 64 bits

Then compile and time it:

gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c
time ./main.out 5000000000

I choose a value of 5 billion loops because I know that this is more or less the frequency of my AMD Ryzen 7 7840U CPU ([Zen 4 microarchitecture](Zen 4)), so it will give results of the order of 1 second, which we hope will average well enough without taking too much time. You can find your CPU frequency on Linux as mentioned at: https://askubuntu.com/questions/218567/any-way-to-check-the-clock-speed-of-my-processor

The result was almost exactly 3 seconds:

real    0m3.014s
user    0m3.010s
sys     0m0.003s

Now, let's remove the "mul %%rbx;" multiplication from the loop to see how long it takes without it:

        "loop:"
        "inc %[i];"
        "cmp %[max], %[i];"
        "jb loop;"

I get about 1s, as expected running a tiny trivial loop at 1 iteration per cycle:

real    0m1.006s
user    0m1.005s
sys     0m0.001s

This therefore suggests that MUL took 3 cycles, as many sources suggest. It's likely not 2 cycles (the difference between INC and MUL) because we know that the CPU likely has separate functional units to run INC and MUL, so that when superscalar execution is running, both instructions can be executed at the same time:

+---------+
| CPU     |
|         |
| +-----+ |
| | ADD | |
| +-----+ |
|         |
| +-----+ |
| | MUL | |
| +-----+ |
|         |
+---------+

and then MUL just ends up taking the longest and dominating the overall runtime.

Estimating how many simultaneous MUL the CPU can do

Next, for fun, we can try to estimate how many MUL functional units the CPU has, which determines how many MUL it can do at once.

To do that, let's use assembly of the form:

__asm__ (
    "mov $2, %%rbx;"
    "loop:"

    "mov %[x0], %%rax;"
    "mul %%rbx;"
    "mov %%rax, %[x0];"

    "mov %[x1], %%rax;"
    "mul %%rbx;"
    "mov %%rax, %[x1];"

    "dec %[i];"
    "jne loop;"
    : [i] "+r" (i),
      [x0] "+r" (x0),
      [x1] "+r" (x1)
    :
    : "rax",
      "rbx" ,
      "rdx" 
);

Here we well test if the CPU can do 2 MUL instructions simultaneously.

We add some MOV instructions to ensure that the input from the second MOV, which uses RAX, does not depend on the output of the previous one, so that the CPU can get a chance to run them in parallel.

You may also object that they can't possibly run in parallel as both take input from and output to RAX. But that is not true because CPUs have register renaming to take care exactly of this type of issue. The CPU is able to notice that both RAX usages are independent, and use different actual memory locations for both, even though both in principle modify RAX.

When I benchmark this I get once again about 3 seconds:

real    0m3.091s
user    0m3.087s
sys     0m0.002s

so we conclude that the CPU managed to run two MUL "in parallel", without extra overhead. There are a few possible explanations for this:

the CPU has two MUL ALUs, and MUL executes in either of them over three cycles
the CPU has one MUL ALU, and MUL is split up into several uops, only one of which uses the MUL ALU. The 3 cycle latency exists because the multiple uops must execute in series. The MUL ALU uop runs in one cycle. Two instructions manage to run at once because they take turns using the MUL ALU while the other is doing the other uops
the MUL ALU itself is pipelined. A MUL uop takes 3 cycles, but 3 can run at the same time in different stages.

We can try to decide between those two options by increasing the number of independent MUL instructions in the loop to see how many MUL the CPU can do per second (throughput). I did that using this helper script and the final plot was:

This plot excludes the possibility 1) in which there are two MUL ALUs each taking 3 cycles because from the inclination of the curve, for each MUL that we add to the loop, it only adds 1 second to the runtime. Therefore the CPU is able to run one MUL per cycle on average.

If we had 2 MUL ALUs taking 3 cycles each, then it would instead average to 1 MUL over 1.5 cycles.

It is not easy to decide between 2 and 3 from this graph alone.

The one thing that is not so neat about the experiment is that we weren't able to run 3 MUL in the same time as 2, given that the 3 cycle latency would suggest to us that this is possible, but seems not to happen due to interference in how the experiment is structured.

Doing this experiment with the simpler IMUL instruction which can output to a single register of our choice only without MOV e.g.:

IMUL $3, %rax, %rax

we obtain instead:

and the expected 3 instruction plateau is perfectly visible. I've also added a INC instruction for fun.

The following resources contain other reverse engineering efforts:

https://en.wikichip.org/wiki/amd/microarchitectures/zen_4
https://chipsandcheese.com/p/amds-zen-4-part-1-frontend-and-execution-engine
https://uops.info/ measures 1c throughput, 3c latency for mul r64, so one fully-pipelined integer multiply execution unit. https://uops.info/html-tp/ZEN4/MUL_R64-Measurements.html shows the instruction sequences they used to measure throughput: xor rax, rax / mul r8 to break the dependency chain, so there are many short dep chains, not loop-carried. And less overhead than 2x mov. (The xor/mul is inside an unrolled loop.) They also do latency measurements from each input to each output separately, and found 3 cycles from RAX input to RAX output, like this answer measures, but 4 cycles from inputs to the RDX high-half output. Perhaps it's written by the second uop which somehow obtains it from a hidden result of the first uop that widening-mul decodes to?
https://agner.org/optimize/ (instruction tables) agrees, but didn't measure the RDX output latency.
https://github.com/InstLatx64/InstLatx64/blob/master/AuthenticAMD/AuthenticAMD0A10F81_K19_StormPeak_01_InstLatX64.txt is a new-enough Zen 4 result for their testing methodology to measure correct throughput numbers for widening-mul, not just bottlenecked by latency on the implicit RAX operand like in early Zen 4 instlat results. Also confirms uops.info's measurements of an extra cycle of latency to the RDX output.

Tested on Ubuntu 25.04.

https://superuser.com/questions/643442/latency-of-cpu-instructions-on-x86-and-x64-processors

edited Jul 14 at 6:10

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

answered Jul 11 at 14:03

Ciro Santilli OurBigBook.com

393k120 gold badges1.3k silver badges1.1k bronze badges

15 Comments

Peter Cordes Jul 11 at 14:20

uops.info/table.html has measured all Intel from Core2 to Alder Lake, and AMD Zen 1-4. agner.org/optimize has measured a bunch of earlier CPUs, too, but his latency breakdowns are just a single number, not each input to each output separately. (mul r64 has an extra cycle of latency to the RDX output; you're only testing the RAX output). Intel since Nehalem and AMD since Zen 1 at least have fully pipelined 1/clock throughput for imul r64, r64 (single uop, only writing RAX). Your Zen 4 can do 1/clock mul r64 (2 uops), although some earlier CPUs can't.

Peter Cordes Jul 11 at 14:27

uops.info/html-tp/ZEN4/MUL_R64-Measurements.html shows the microbenchmark sequences they used in an unrolled loop: 3 cycles for just mul r8, 1 cycle per xor rax,rax/mul r8. I don't know why they wasted a REX prefix on the xor-zeroing. Anyway, there's one fully-pipelined multiply unit in your CPU. Intel's recent Lion Cove (Arrow Lake P-cores) apparently has 3 "INT MUL" units; github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel/… measured 0.630c throughput for imul r64, r64 so only partially pipelined? Interesting.

Peter Cordes Jul 11 at 14:33

Your unrolled-by-2 loop is 10 uops, including mov of both an input and output for each mov, and an inc plus a cmp/jcc instead of just a dec/jcc. (Zen 3 and later can fuse dec/jcc like Intel does, previously only cmp/test.) So you're probably running into a front-end bottleneck as the limiting factor, not mul throughput. Zen 4 should be 6-wide, so it should be running that loop at about 1.66 cycles per iteration on average.

Peter Cordes Jul 11 at 14:36

Or wait, no, you're not testing throughput, you're just testing if two mul dep chains can interleave, so the bottleneck is still 3-cycle mul latency from RAX input to RAX output. (Thanks to mov-elimination, it's still 3 cycles, not 5.) You're moving each result to a separate scratch register then back. (imul %%rbx, %[x0] and imul %%rbx, %[x1] would compute the same result (not computing the high halves at all), but this is testing the less commonly used widening-mul.)

Peter Cordes Jul 11 at 14:45

And you're plotting with more and more dep chains, each one costing 4 uops for the front-end, instead of just discarding results. With 3 dep chains, that's 12 uops of mul+mov and 2 uops of loop overhead, for 14. That could in theory get through the front-end in 3 cycles; I don't know why you're not seeing a flat 3 cycles per iteration out to 3 parallel mul chains. Maybe some uops steal cycles on the port needed by a mul uop, so we lose progress on one of the mul dep chains, which can never be caught up since we need to be starting a mul every cycle.

|

Collectives™ on Stack Overflow

How many CPU cycles are needed for each assembly instruction?

6 Answers 6

Instruction Timings

Detailed Analysis

Other sources

16 Comments

17 Comments

5 Comments

5 Comments

Comments

15 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

6 Answers 6

Instruction Timings

Detailed Analysis

Other sources

16 Comments

17 Comments

5 Comments

5 Comments

Comments

15 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related