Cost of cache miss and the number of memory round trips

Question

Consider a CPU with 64 bytes (512 bits) cache block size, 16 bytes (128 bits) data bus, and a single level of cache (let's say only L4, and no L1-L3). How does the CPU fill up the cache? Obviously the entire cache block cannot fit into the bus, so I'm guessing that the CPU would have to request reads 4 separate times (as 512/128 = 4).

More specifically, given this setup and the cache miss penalty of 100 nanoseconds, does this mean that the CPU would make a read request every 25 nanoseconds? Or does it somehow batch request the addresses and wait for them to arrive in sequence?

That's what the line-fill buffers are for. See for example How do the store buffer and Line Fill Buffer interact with each other? for an explanation specific to write access. Read access is similar — Homer512
– Homer512, Commented Mar 10 at 20:11
Also, see What every programmer should know about memory part 2 (Caches), specifically section 3.5.2 Critical Word Load — Homer512
– Homer512, Commented Mar 10 at 20:15
What are you implying by calling it an L4 when there are no other levels? That it's shared by all cores, and made of EDRAM or something (like on some Intel laptop chips with Iris GPUs) vs. a usual shared L3 made from SRAM? Also, as others have said, a single request will initiate a DDR DRAM burst transfer, and the memory controller will send the whole line to the cache. — Peter Cordes
– Peter Cordes, Commented Mar 14 at 4:06

Peter Cordes · Accepted Answer · 2025-03-14 23:12:17Z

1

Nowadays, I doubt you could find a single processor being manufactured with an L4 cache. If you only have a single level of cache, that is going to be your L1 cache. I don't see the point in calling it L4. Unless you have another underlying question, from which this weird question originates (The XY problem).

Since there is a Line Fill Buffer (LFB) associated with the L1 cache, there would be an entry allocated in the LFB to track this miss (to retrieve and assemble the cache line as data is sent over).

The memory controller will further send the request to the appropriate DRAM chip, which would activate the corresponding row containing the data, followed by the proper column. This part of the procedure is responsible for most of the latency you mentioned. Since the bus is capable of transferring 16 bytes at a time in your scenario, the burst transfer of data to the processor would happen in 4 separate bus cycles (all part of one request/response transaction), from the DRAM to the processor.

This is not 4 different requests from the processor, rather one request for a cache line from the processor, and 4 separate chunks of data which are assembled later by the LFB and inserted into the L1. (Or assembled by the memory controller and sent over wider busses inside the CPU.)

These transfers occur in "burst mode" where the DRAM automatically sends sequential chunks after receiving the initial address, with the memory controller issuing just a single command to retrieve the entire cache line. This is due to the high memory access latency (most % of the 100ns in your scenario), which needs to be amortized by transferring more than a single-bus-width of data upon a request, which could later potentially be useful.

It's not a coincidence that DDR SDRAM's burst size is 64 bytes (or 32 for a short burst), same as the cache line width of typical CPUs. (DDR SDRAM's data width is only 64 bits, so your hypothetical system has a wider data path.)

edited Mar 14 at 23:12

Peter Cordes

377k50 gold badges742 silver badges1k bronze badges

answered Mar 14 at 0:41

Mani

1109 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Peter Cordes Mar 14 at 4:11

I wouldn't have described the cycles of a burst as multiple separate transactions; is that standard terminology? (For either a DDR SDRAM bus which is 64 bits wide (32 with DDR5), or some other DRAM <-> controller bus. Or for an internal bus between memory controller and caches / cores.) But yeah, key point is that one request from a core initiates one DRAM-controller command, starting a burst transfer.

Mani Mar 14 at 15:13

I didn't mean to describe "cycles of burst" (If you are referring to the passage of data to the bus from the DDR DRAM?) as multiple transaction, rather the transfer of pieces of data over the 16B-wide bus to the processor as separate "transactions". Maybe I should have simply kept it to "transfer"?

Peter Cordes Mar 14 at 17:10

The phrasing I was commenting on was "the transfer of data to the processor would happen in 4 separate bus transactions". (My emphasis). You could say "4 bus cycles ... to answer the request" or something like that.

Mani Mar 14 at 18:33

I see. I have seen both terminologies being used. If you believe "bus cycles" is more common, I can edit it. You are more knowledgable than me in all honesty. Few sources that I've seen phrase it as "bus transaction": UMich lecture notes | Wikipedia

Peter Cordes Mar 14 at 23:15

The top of the wiki article you linked explains that a burst is one transaction which takes the bus multiple cycles to transfer the data. That's what I've been saying. I edited your answer as requested.

|

Collectives™ on Stack Overflow

Cost of cache miss and the number of memory round trips

1 Answer 1

6 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

6 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related