Skip to main content
Filter by
Sorted by
Tagged with
0 votes
1 answer
64 views

I am trying to implement Cache allocation Technology`s impact with my CPU. However, when I use either lscpu to see whether my CPU supports, or cpuid -l 0x10, output is false. How is this possible? How ...
Ali Hosseini's user avatar
1 vote
0 answers
76 views

Suppose you are processing a large data set using several cores in parallel. I am looking for the most memory-efficient way to break up the data among the processors. Specifically, this would be for ...
Zzyzx's user avatar
  • 53
0 votes
1 answer
87 views

There is a lot of documentation online suggesting that when you modify code at runtime that you should flush the instruction cache. However, it's unclear how much this is about making sure the old ...
Joseph Garvin's user avatar
6 votes
0 answers
247 views

Goal I would like to transfer a 32KiB buffer between two cores (C1 and C2) as fast as possible, performing loads and stores at both cores. Observation A simple benchmark is devised: one core performs ...
doliphin's user avatar
  • 1,044
1 vote
1 answer
118 views

I wanted to see if I am correctly interpreting the attached diagram. It shows the AMD Zen 3's cache lines. OC Fetch is Opcode Cache, IC Fetch is Instruction Cache. I am just unable to make sense of ...
Kush Jenamani's user avatar
1 vote
0 answers
49 views

I am trying to understand the invalidating DMA buffers example from page D7-2450 in the Architecture Reference Manual ARMv7-A and ARMv7-R edition. What is the memory barrier before the second ...
93Iq2Gg2cZtLMO's user avatar
2 votes
1 answer
127 views

We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...
DTL2020's user avatar
  • 101
1 vote
0 answers
55 views

Hellow. I’m analyzing a fully-featured L2 cache with the following properties: Non-blocking Write-allocate Write-back For simplification: Only full-cacheline stores are allowed (every store allocates ...
Konstantin Kazartsev's user avatar
-1 votes
1 answer
197 views

In amba chi cache coherence protocol, the RN sends instructions (to HN) like ReadClean, ReadNotSharedDirty, ReadShared, ReadUnique, etc. But the CPU has sent only a READ instruction to the RN, so how ...
R71's user avatar
  • 4,573
1 vote
0 answers
82 views

I'm trying to understand the practical value of "cache-friendly" design in lock-free queues. I often see people go to great lengths to pad structures, align data, and avoid false sharing — ...
SpeakX's user avatar
  • 427
0 votes
2 answers
112 views

When I read a description of cache coherence protocols it talks about how separate CPU cores keep track of memory address ranges modified, I think, through methods like bus snooping. The end result of ...
Zebrafish's user avatar
  • 16.3k
6 votes
0 answers
97 views

I am trying to understand in a general sense how L1/L2/L3 caches are updated and how the updates are propagated in a multi-core x86/x86-64 CPU. Assuming a 4 core CPU and 2 pairs of L1/L2 caches, where ...
Giles Ramoni's user avatar
1 vote
0 answers
93 views

While implementing tiled matrix multiplication with avx512 and _mm512_i32gather_pd targeting 32 KiB L1 $ and 1 MiB L2 $, I’ve noticed a significant increase in matmul time when the value of kend ...
sridoo's user avatar
  • 53
0 votes
1 answer
90 views

Consider a CPU with 64 bytes (512 bits) cache block size, 16 bytes (128 bits) data bus, and a single level of cache (let's say only L4, and no L1-L3). How does the CPU fill up the cache? Obviously the ...
martian17's user avatar
  • 520
0 votes
0 answers
45 views

I am working on an Intel Xeon Gold 6338 server running Ubuntu 20.04.6 LTS. I am trying to monitor cache events when running a certain C workload. In particular, I am trying to measure the amount of ...
tomercory's user avatar
7 votes
1 answer
171 views

I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what ...
Box Box Box Box's user avatar
1 vote
0 answers
68 views

In the MESI_Three_Level protocol of GEM5 simulator, there are L0, L1, L2, dir and dma state machines. L0 and L1 cache controllers simulate private caches for processor cores, which are implemented ...
ben liu's user avatar
  • 21
0 votes
1 answer
120 views

I ran a test to better understand how pages affect performance when traversing & reading objects in memory but I got the opposite of what I was expecting. I'm hoping someone can explain this ...
greenlagoon's user avatar
-1 votes
1 answer
138 views

As far as I know, L1 is VIPT for at least Intel chips. VIVT caches don't depend on address translation, so they can fully operate in parallel with TLB lookup. VIPT can also achieve some parallelism by ...
Devashish's user avatar
  • 193
2 votes
2 answers
112 views

Recently I am looking over someone's Limit Order Book implementation. There is one place the author left a comment and I don't quite understand how is it going to benefit performance-wise. Let me ...
Love Cute Shiba's user avatar
0 votes
0 answers
46 views

I have an exercise that asks me to calculate the bandwidth of a CPU with split cache memory for instructions and data. I have the references per second, miss and hit ratio, and the block size for both ...
papitas's user avatar
  • 21
2 votes
0 answers
128 views

I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function: void dgemm_naive(const int M, const int N, const int K, const double alpha, const double *A, const int lda, ...
ligro's user avatar
  • 29
3 votes
2 answers
199 views

Say you have an array of floats representing raw data, and several types representing 2D shapes that only have floats as members, like this: #include <iostream> #include <array> struct ...
greenlagoon's user avatar
3 votes
1 answer
94 views

I am working on some code that involves L1/2/3 cache eviction & TLB entry invalidation. I'm trying to use the INVLPG instruction to invalidate TLB entries and verify some results achieved by ...
Mani's user avatar
  • 110
1 vote
1 answer
65 views

I am doing I problem set on direct mapped cache, I need help with finding the number of offset bits and tag bits. I don't know how to calculate the number of tags and offset bits. The solution key ...
Chris Wang's user avatar
0 votes
0 answers
52 views

I have already searched for related issues, but all of them are related to dcache. I can't think of a way to determine the characteristics of L1 icache.
boat's user avatar
  • 23
1 vote
0 answers
81 views

I have a multithreaded setup where one thread acts as a writer and the other as a reader. The writer performs a read-modify-store operation on a shared std::atomic variable, and the reader ...
user avatar
1 vote
1 answer
283 views

I am new to intel nomenclature, it is not very clear to me which all CPUs support CLDEMOTE instructions. Intel® 64 and IA-32 Architectures Software Developer’s Manual states that CLDEMOTE supported in ...
Priyanshu Yadav's user avatar
0 votes
1 answer
29 views

I am a 1st year Ph.D. student (Research Assistant). I am trying to increase the transfer rate between cache and DRAM. To do so I am planning to integrate a good compression technique (or some other ...
Sadman Sakib Akash's user avatar
0 votes
0 answers
169 views

I'm trying to measure the L3 cache miss rate using the following formula: I found that LLC misses can be obtained using this perf command from How to catch the L3-cache hits and misses by perf tool ...
Sherlock's user avatar
2 votes
1 answer
116 views

I try to follow https://igoro.com/archive/gallery-of-processor-cache-effects/ in python using numpy. Though it does not work and I don't quite understand why... numpy has fixed size dtypes, such as np....
Stefan's user avatar
  • 1,962
4 votes
0 answers
126 views

Context I was researching memcpy bandwidth of different platforms on different buffer sizes, and some of the runs showed much worse results, despite me doing what seemed like an appropriate amount of ...
aolo2's user avatar
  • 101
0 votes
2 answers
155 views

The book "Computer Architecture", by Hennessy/Patterson, 6th ed, on page 394, includes an example with true sharing and false sharing misses with 2 processors. here is the example from the ...
User710's user avatar
1 vote
0 answers
65 views

I'm currently studying computer architecture, following the Hennessy-Patterson books (Quantitative Approach 5 and Organizazion and Design 4), and I want to check if I'm understanding some cache ...
Paul's user avatar
  • 515
2 votes
0 answers
77 views

My platform is 2nd generation scalable Xeon, equipped with a non-inclusive cache. I run a series of tests that had the L2 stream prefetcher aggressively prefetching. I use Perf to monitor performance, ...
grayxu's user avatar
  • 154
0 votes
2 answers
264 views

I've been thinking about the "owned" state of the MOESI protocol. So let's say the following situation exists: P0 has line A in O state. P1 has line A in S state. P0 writes to line A in its ...
jkang's user avatar
  • 579
1 vote
1 answer
130 views

EDIT / DISCLAIMER: It appears that my understanding of cache lines was wrong which makes this question not relevant and misleading. I thought whenever CPU tries to fetch a memory at a specific index, ...
sleeptightAnsiC's user avatar
5 votes
1 answer
276 views

I've been working on optimizing the calculation of differences between elements in NumPy arrays. I have been using Numba for performance improvements, but I get a 100-microsecond jump when the array ...
user25656250's user avatar
2 votes
0 answers
128 views

I wonder if its possible to improve performance by getting the cpu to load something into cache while it still works on something else. I'm not very knowledgeable about the inner workings of a cpu and ...
StackOverflowToxicityVictim's user avatar
4 votes
0 answers
225 views

In Java, adding the volatile keyword to a variable guarantees memory consistency (or visibility). On the x86 platform, the Hotspot virtual machine implements volatile variable memory consistency by ...
Triassic's user avatar
0 votes
0 answers
130 views

I want to know the way to check the PCIe Memory-mapped BAR region is cacheable or not. Is there any way to check the setting value or not? Or is it just configured uncacheable in hardware-way?? (I saw ...
horse-master's user avatar
1 vote
2 answers
458 views

Are risc-v instructions such as sb and sh allowed to access the cache? Or does it communicate directly with the main memory? I have seen the Wstrb event in main memory structures, but generally not in ...
Kamer Kırali's user avatar
1 vote
1 answer
659 views

What are the performance implications of virtual address synonym (aliasing) in a VIPT cache? I'm specifically interested in recent x86_64 architectures but knowing more about others wouldn't hurt. ...
Jason Nordwick's user avatar
0 votes
2 answers
176 views

I have a program which allocates some memory (200 million integers), does some quick compute and then writes the data to the allocated memory. When run on a single thread the process takes about 1 ...
zcoderz's user avatar
  • 11
1 vote
0 answers
49 views

I am doing the Meltdown attack lab using Ubuntu 16.04 32-bit, and an old CPU (Intel i5 7th Gen). There is a secret value 83 stored in 0xfbce3000 by a kernel module, the user program cannot directly ...
Heisenbug's user avatar
0 votes
1 answer
514 views

I know how read/load operations are theoretically supposed to work in OSes. A read instruction causes a TLB lookup, then a look through caches, then a look in main memory, and finally a read from disk ...
wxz's user avatar
  • 2,626
0 votes
2 answers
584 views

I am currently investigating some previous exams for my CA course. There is one question which i found really confusing, here is the the data to work with: Considering a 32-bit address (tag 20bits, ...
wiliam969's user avatar
1 vote
0 answers
84 views

I read in this StackOverflow answer that prefetching does not happen for dirty pages. In which condition DCU prefetcher start prefetching? It seems to me that the prefetcher is receiving the dirty ...
Sai Aravind's user avatar
0 votes
1 answer
174 views

Must the cores of a multi-core CPU all share L3 caches? is it possible that a cpu has several L3 level caches? For example, suppose a cpu has 24 cores, and no three cores share a L3 cache, so there ...
拉克克's user avatar
0 votes
1 answer
215 views

CPU cache lines are typically 64-bytes. When a CPU (say modern Intel processor) reads a cache line from memory, does the CPU read from 64-byte aligned blocks of memory, or any contiguous 64-byte block?...
user avatar

1
2 3 4 5
22