Newest 'cpu-cache' Questions

0 votes

1 answer

64 views

Cache Allocation Technology in 13th Generation Core i9 13900E Intel CPU [closed]

I am trying to implement Cache allocation Technology`s impact with my CPU. However, when I use either lscpu to see whether my CPU supports, or cpuid -l 0x10, output is false. How is this possible? How ...

Ali Hosseini

1

asked Oct 10 at 12:38

1 vote

0 answers

76 views

cache-efficient partitioning for multithreaded processing in arm

Suppose you are processing a large data set using several cores in parallel. I am looking for the most memory-efficient way to break up the data among the processors. Specifically, this would be for ...

Zzyzx

53

asked Aug 1 at 21:14

0 votes

1 answer

87 views

On x86-64 if you modify code at runtime and don't flush the icache, can the old code run indefinitely? [duplicate]

There is a lot of documentation online suggesting that when you modify code at runtime that you should flush the instruction cache. However, it's unclear how much this is about making sure the old ...

Joseph Garvin

22.3k

asked Jul 15 at 20:56

6 votes

0 answers

247 views

What is an explanation for the performance characteristics of CLWB when sharing data between cores (Tigerlake)?

Goal I would like to transfer a 32KiB buffer between two cores (C1 and C2) as fast as possible, performing loads and stores at both cores. Observation A simple benchmark is devised: one core performs ...

doliphin

1,044

asked May 31 at 20:44

1 vote

1 answer

118 views

Cache line sizes for AMD Zen 3 Architecture

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...

Kush Jenamani

11

asked May 27 at 15:27

1 vote

0 answers

49 views

When to use DMB before invalidating cache

I am trying to understand the invalidating DMA buffers example from page D7-2450 in the Architecture Reference Manual ARMv7-A and ARMv7-R edition. What is the memory barrier before the second ...

93Iq2Gg2cZtLMO

333

asked May 22 at 16:30

2 votes

1 answer

127 views

CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)

We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...

DTL2020

101

asked May 22 at 10:37

1 vote

0 answers

55 views

What is the basline implementation for "allocate on update" cache policy for L2 cache

Hellow. I’m analyzing a fully-featured L2 cache with the following properties: Non-blocking Write-allocate Write-back For simplification: Only full-cacheline stores are allowed (every store allocates ...

Konstantin Kazartsev

97

asked May 2 at 17:54

-1 votes

1 answer

197 views

in chi cache coherence protocol, how does RN decide which READ transaction to send

In amba chi cache coherence protocol, the RN sends instructions (to HN) like ReadClean, ReadNotSharedDirty, ReadShared, ReadUnique, etc. But the CPU has sent only a READ instruction to the RN, so how ...

R71

4,573

asked Apr 30 at 7:23

1 vote

0 answers

82 views

Why does cache-friendly design matter in lock-free queues if threads trash their cache anyway?

I'm trying to understand the practical value of "cache-friendly" design in lock-free queues. I often see people go to great lengths to pad structures, align data, and avoid false sharing — ...

SpeakX

427

asked Apr 22 at 0:21

0 votes

2 answers

112 views

Are cache coherence protocols only active when explicitly using certain types in your code?

When I read a description of cache coherence protocols it talks about how separate CPU cores keep track of memory address ranges modified, I think, through methods like bus snooping. The end result of ...

Zebrafish

16.3k

asked Apr 21 at 19:47

6 votes

0 answers

97 views

How does the caching system in a cpu decide which memory is to be stored in L1/L2/L3 caches as they're being accessed

I am trying to understand in a general sense how L1/L2/L3 caches are updated and how the updates are propagated in a multi-core x86/x86-64 CPU. Assuming a 4 core CPU and 2 pairs of L1/L2 caches, where ...

Giles Ramoni

61

asked Apr 12 at 23:32

1 vote

0 answers

93 views

Inserting MOV instruction to prefetch data to L1 cache

While implementing tiled matrix multiplication with avx512 and _mm512_i32gather_pd targeting 32 KiB L1 $ and 1 MiB L2 $, I’ve noticed a significant increase in matmul time when the value of kend ...

sridoo

53

asked Mar 23 at 6:57

0 votes

1 answer

90 views

Cost of cache miss and the number of memory round trips

Consider a CPU with 64 bytes (512 bits) cache block size, 16 bytes (128 bits) data bus, and a single level of cache (let's say only L4, and no L1-L3). How does the CPU fill up the cache? Obviously the ...

martian17

520

asked Mar 10 at 19:25

0 votes

0 answers

45 views

Monitoring Cache Events Using libpfm

I am working on an Intel Xeon Gold 6338 server running Ubuntu 20.04.6 LTS. I am trying to monitor cache events when running a certain C workload. In particular, I am trying to measure the amount of ...

tomercory

7

asked Mar 10 at 17:40

7 votes

1 answer

171 views

Pointer chasing benchmark - unexpected lack of out of order execution?

I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what ...

Box Box Box Box

5,388

asked Mar 4 at 19:56

1 vote

0 answers

68 views

where is directory memory of dir controller of cache coherence protocol stored in real chip?

In the MESI_Three_Level protocol of GEM5 simulator, there are L0, L1, L2, dir and dma state machines. L0 and L1 cache controllers simulate private caches for processor cores, which are implemented ...

ben liu

21

asked Feb 25 at 9:37

0 votes

1 answer

120 views

How page alignment & page borders affect performance when traversing & reading objects

I ran a test to better understand how pages affect performance when traversing & reading objects in memory but I got the opposite of what I was expecting. I'm hoping someone can explain this ...

greenlagoon

181

asked Jan 26 at 4:00

-1 votes

1 answer

138 views

If cache invalidation happens every time memory mappings change, why not opt for VIVT?

As far as I know, L1 is VIPT for at least Intel chips. VIVT caches don't depend on address translation, so they can fully operate in parallel with TLB lookup. VIPT can also achieve some parallelism by ...

Devashish

193

asked Jan 21 at 6:50

2 votes

2 answers

112 views

What's the benefit of bring frequently-accessed array address into cache?

Recently I am looking over someone's Limit Order Book implementation. There is one place the author left a comment and I don't quite understand how is it going to benefit performance-wise. Let me ...

Love Cute Shiba

101

asked Jan 17 at 7:56

0 votes

0 answers

46 views

Main memory bandwidth measurement for split cache

I have an exercise that asks me to calculate the bandwidth of a CPU with split cache memory for instructions and data. I have the references per second, miss and hit ratio, and the block size for both ...

papitas

21

asked Jan 17 at 0:27

2 votes

0 answers

128 views

Matrix multiply fastest with -O0 [duplicate]

I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function: void dgemm_naive(const int M, const int N, const int K, const double alpha, const double *A, const int lda, ...

ligro

29

asked Jan 1 at 18:31

3 votes

2 answers

199 views

Interpreting part of an array as an object by casting a pointer to an array element

Say you have an array of floats representing raw data, and several types representing 2D shapes that only have floats as members, like this: #include <iostream> #include <array> struct ...

greenlagoon

181

asked Dec 28, 2024 at 1:55

3 votes

1 answer

94 views

Does INVLPG instruction or mprotect() affect the CPU cache state while invalidating TLB entries?

I am working on some code that involves L1/2/3 cache eviction & TLB entry invalidation. I'm trying to use the INVLPG instruction to invalidate TLB entries and verify some results achieved by ...

Mani

110

asked Dec 23, 2024 at 5:41

1 vote

1 answer

65 views

With the given information about a direct-mapped cache (including a trace and hit/miss status), how do I find the number of tag bits and offset bits?

I am doing I problem set on direct mapped cache, I need help with finding the number of offset bits and tag bits. I don't know how to calculate the number of tags and offset bits. The solution key ...

Chris Wang

19

asked Dec 11, 2024 at 7:13

0 votes

0 answers

52 views

Programmatically determine L1 icache size, line size?

I have already searched for related issues, but all of them are related to dcache. I can't think of a way to determine the characteristics of L1 icache.

boat

23

asked Dec 7, 2024 at 13:39

1 vote

0 answers

81 views

Why am I seeing two L1D cache misses in a multithreaded setup during read-modify-store operations?

I have a multithreaded setup where one thread acts as a writer and the other as a reader. The writer performs a read-modify-store operation on a shared std::atomic variable, and the reader ...

user26135499

asked Dec 4, 2024 at 8:06

1 vote

1 answer

283 views

CLDEMOTE support in intel CPUs

I am new to intel nomenclature, it is not very clear to me which all CPUs support CLDEMOTE instructions. Intel® 64 and IA-32 Architectures Software Developer’s Manual states that CLDEMOTE supported in ...

Priyanshu Yadav

147

asked Nov 5, 2024 at 11:58

0 votes

1 answer

29 views

Collecting Cached data to characterizing the residing 0/1 bits

I am a 1st year Ph.D. student (Research Assistant). I am trying to increase the transfer rate between cache and DRAM. To do so I am planning to integrate a good compression technique (or some other ...

Sadman Sakib Akash

9

asked Nov 4, 2024 at 17:59

0 votes

0 answers

169 views

How do I calculate the L3 cache miss rate and find number of trips to main memory using perf?

I'm trying to measure the L3 cache miss rate using the following formula: I found that LLC misses can be obtained using this perf command from How to catch the L3-cache hits and misses by perf tool ...

Sherlock

63

asked Oct 17, 2024 at 14:08

2 votes

1 answer

116 views

Python, numpy and the cacheline

I try to follow https://igoro.com/archive/gallery-of-processor-cache-effects/ in python using numpy. Though it does not work and I don't quite understand why... numpy has fixed size dtypes, such as np....

Stefan

1,962

asked Sep 27, 2024 at 9:24

4 votes

0 answers

126 views

How do CPUs evenly occupy a PIPT L2 larger than associativity x page size?

Context I was researching memcpy bandwidth of different platforms on different buffer sizes, and some of the runs showed much worse results, despite me doing what seemed like an appropriate amount of ...

aolo2

101

asked Jul 29, 2024 at 15:35

0 votes

2 answers

155 views

MESI: why we need write-miss to move from shared to modified

The book "Computer Architecture", by Hennessy/Patterson, 6th ed, on page 394, includes an example with true sharing and false sharing misses with 2 processors. here is the example from the ...

User710

1

asked Jul 25, 2024 at 6:02

1 vote

0 answers

65 views

Relationship between memory acceses and instructions in MIPS architecture

I'm currently studying computer architecture, following the Hennessy-Patterson books (Quantitative Approach 5 and Organizazion and Design 4), and I want to check if I'm understanding some cache ...

Paul

515

asked Jul 24, 2024 at 16:09

2 votes

0 answers

77 views

the Perf event issues of hardware prefetcher (all_pf_data_rd and pf_l2_data_rd)

My platform is 2nd generation scalable Xeon, equipped with a non-inclusive cache. I run a series of tests that had the L2 stream prefetcher aggressively prefetching. I use Perf to monitor performance, ...

grayxu

154

asked Jul 16, 2024 at 21:03

0 votes

2 answers

264 views

MOESI Protocol: What happens when Owned is dirty and other processors read the line in Shared?

I've been thinking about the "owned" state of the MOESI protocol. So let's say the following situation exists: P0 has line A in O state. P1 has line A in S state. P0 writes to line A in its ...

jkang

579

asked Jul 2, 2024 at 20:38

1 vote

1 answer

130 views

How to store items in the LIFO stack in a cache-friendly manner?

EDIT / DISCLAIMER: It appears that my understanding of cache lines was wrong which makes this question not relevant and misleading. I thought whenever CPU tries to fetch a memory at a specific index, ...

sleeptightAnsiC

849

asked Jun 30, 2024 at 21:17

5 votes

1 answer

276 views

Optimization Challenge Due to L1 Cache with Numba

I've been working on optimizing the calculation of differences between elements in NumPy arrays. I have been using Numba for performance improvements, but I get a 100-microsecond jump when the array ...

user25656250

53

asked Jun 21, 2024 at 0:40

2 votes

0 answers

128 views

is it possible to fetch data into cpu cache while the cpu works on something else?

I wonder if its possible to improve performance by getting the cpu to load something into cache while it still works on something else. I'm not very knowledgeable about the inner workings of a cpu and ...

StackOverflowToxicityVictim

121

asked May 4, 2024 at 18:51

4 votes

0 answers

225 views

How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency

In Java, adding the volatile keyword to a variable guarantees memory consistency (or visibility). On the x86 platform, the Hotspot virtual machine implements volatile variable memory consistency by ...

Triassic

41

asked Mar 15, 2024 at 11:05

0 votes

0 answers

130 views

How to check whether the PCIe Memory-mapped BAR region is cacheable or uncacheable

I want to know the way to check the PCIe Memory-mapped BAR region is cacheable or not. Is there any way to check the setting value or not? Or is it just configured uncacheable in hardware-way?? (I saw ...

horse-master

1

asked Mar 5, 2024 at 12:59

1 vote

2 answers

458 views

Are RISC-V SH and SB instructions allowed to communicate with the cache?

Are risc-v instructions such as sb and sh allowed to access the cache? Or does it communicate directly with the main memory? I have seen the Wstrb event in main memory structures, but generally not in ...

Kamer Kırali

11

asked Mar 4, 2024 at 10:05

1 vote

1 answer

659 views

Performance implications of aliasing in VIPT cache

What are the performance implications of virtual address synonym (aliasing) in a VIPT cache? I'm specifically interested in recent x86_64 architectures but knowing more about others wouldn't hurt. ...

Jason Nordwick

1,606

asked Feb 22, 2024 at 11:30

0 votes

2 answers

176 views

Why do fast memory writes when run over multiple threads take much more time vs when they are run on a single thread?

I have a program which allocates some memory (200 million integers), does some quick compute and then writes the data to the allocated memory. When run on a single thread the process takes about 1 ...

zcoderz

11

asked Feb 20, 2024 at 22:39

1 vote

0 answers

49 views

question regarding the behavior of the program in Meltdown attack

I am doing the Meltdown attack lab using Ubuntu 16.04 32-bit, and an old CPU (Intel i5 7th Gen). There is a secret value 83 stored in 0xfbce3000 by a kernel module, the user program cannot directly ...

Heisenbug

11

asked Feb 15, 2024 at 14:49

0 votes

1 answer

514 views

OS cache/memory hierarchy: How does writing to a new file work?

I know how read/load operations are theoretically supposed to work in OSes. A read instruction causes a TLB lookup, then a look through caches, then a look in main memory, and finally a read from disk ...

wxz

2,626

asked Feb 11, 2024 at 5:31

0 votes

2 answers

584 views

Can there be a cache block with the same Tag-ID in different Sets?

I am currently investigating some previous exams for my CA course. There is one question which i found really confusing, here is the the data to work with: Considering a 32-bit address (tag 20bits, ...

wiliam969

13

asked Feb 7, 2024 at 17:48

1 vote

0 answers

84 views

why is there a need to stop prefetching to pages when a write happens to it?

I read in this StackOverflow answer that prefetching does not happen for dirty pages. In which condition DCU prefetcher start prefetching? It seems to me that the prefetcher is receiving the dirty ...

Sai Aravind

41

asked Feb 6, 2024 at 8:42

0 votes

1 answer

174 views

is it possible that a cpu has several L3 level caches？

Must the cores of a multi-core CPU all share L3 caches? is it possible that a cpu has several L3 level caches? For example, suppose a cpu has 24 cores, and no three cores share a L3 cache, so there ...

拉克克

21

asked Jan 21, 2024 at 8:56

0 votes

1 answer

215 views

Are 64-byte CPU cache line reads aligned on 64-byte boundaries? [duplicate]

CPU cache lines are typically 64-bytes. When a CPU (say modern Intel processor) reads a cache line from memory, does the CPU read from 64-byte aligned blocks of memory, or any contiguous 64-byte block?...

user22608671

asked Jan 19, 2024 at 19:41

Collectives™ on Stack Overflow