1,083 questions
0
votes
1
answer
64
views
Cache Allocation Technology in 13th Generation Core i9 13900E Intel CPU [closed]
I am trying to implement Cache allocation Technology`s impact with my CPU. However, when I use either lscpu to see whether my CPU supports, or cpuid -l 0x10, output is false.
How is this possible?
How ...
1
vote
0
answers
76
views
cache-efficient partitioning for multithreaded processing in arm
Suppose you are processing a large data set using several cores in parallel. I am looking for the most memory-efficient way to break up the data among the processors.
Specifically, this would be for ...
0
votes
1
answer
87
views
On x86-64 if you modify code at runtime and don't flush the icache, can the old code run indefinitely? [duplicate]
There is a lot of documentation online suggesting that when you modify code at runtime that you should flush the instruction cache. However, it's unclear how much this is about making sure the old ...
6
votes
0
answers
247
views
What is an explanation for the performance characteristics of CLWB when sharing data between cores (Tigerlake)?
Goal
I would like to transfer a 32KiB buffer between two cores (C1 and C2) as fast as possible, performing loads and stores at both cores.
Observation
A simple benchmark is devised: one core performs ...
1
vote
1
answer
118
views
Cache line sizes for AMD Zen 3 Architecture
I wanted to see if I am correctly interpreting the attached diagram.
It shows the AMD Zen 3's cache lines.
OC Fetch is Opcode Cache,
IC Fetch is Instruction Cache.
I am just unable to make sense of ...
1
vote
0
answers
49
views
When to use DMB before invalidating cache
I am trying to understand the invalidating DMA buffers example from page D7-2450 in the Architecture Reference Manual ARMv7-A and ARMv7-R edition.
What is the memory barrier before the second ...
2
votes
1
answer
127
views
CPU cache invalidation control from application - clear cache store queues (?) for x86/x64 architectures (Invalidate data after read, skip write-back)
We have some multimedia processing applications designed as a set of filters for processing data buffers. If temporal data in between filters is not very large and can fit in L1 or L2/L3 caches - the ...
1
vote
0
answers
55
views
What is the basline implementation for "allocate on update" cache policy for L2 cache
Hellow.
I’m analyzing a fully-featured L2 cache with the following properties:
Non-blocking
Write-allocate
Write-back
For simplification: Only full-cacheline stores are allowed (every store allocates ...
-1
votes
1
answer
197
views
in chi cache coherence protocol, how does RN decide which READ transaction to send
In amba chi cache coherence protocol, the RN sends instructions (to HN) like ReadClean, ReadNotSharedDirty, ReadShared, ReadUnique, etc. But the CPU has sent only a READ instruction to the RN, so how ...
1
vote
0
answers
82
views
Why does cache-friendly design matter in lock-free queues if threads trash their cache anyway?
I'm trying to understand the practical value of "cache-friendly" design in lock-free queues. I often see people go to great lengths to pad structures, align data, and avoid false sharing — ...
0
votes
2
answers
112
views
Are cache coherence protocols only active when explicitly using certain types in your code?
When I read a description of cache coherence protocols it talks about how separate CPU cores keep track of memory address ranges modified, I think, through methods like bus snooping. The end result of ...
6
votes
0
answers
97
views
How does the caching system in a cpu decide which memory is to be stored in L1/L2/L3 caches as they're being accessed
I am trying to understand in a general sense how L1/L2/L3 caches are updated and how the updates are propagated in a multi-core x86/x86-64 CPU.
Assuming a 4 core CPU and 2 pairs of L1/L2 caches, where ...
1
vote
0
answers
93
views
Inserting MOV instruction to prefetch data to L1 cache
While implementing tiled matrix multiplication with avx512 and _mm512_i32gather_pd targeting 32 KiB L1 $ and 1 MiB L2 $, I’ve noticed a significant increase in matmul time when the value of kend ...
0
votes
1
answer
90
views
Cost of cache miss and the number of memory round trips
Consider a CPU with 64 bytes (512 bits) cache block size, 16 bytes (128 bits) data bus, and a single level of cache (let's say only L4, and no L1-L3).
How does the CPU fill up the cache? Obviously the ...
0
votes
0
answers
45
views
Monitoring Cache Events Using libpfm
I am working on an Intel Xeon Gold 6338 server running Ubuntu 20.04.6 LTS.
I am trying to monitor cache events when running a certain C workload.
In particular, I am trying to measure the amount of ...
7
votes
1
answer
171
views
Pointer chasing benchmark - unexpected lack of out of order execution?
I wanted a reliable benchmark which has a lot of cache misses, and the obvious go-to seemed to be a pointer chasing benchmark. I ended up trying google/multichase but the results don't seem to be what ...
1
vote
0
answers
68
views
where is directory memory of dir controller of cache coherence protocol stored in real chip?
In the MESI_Three_Level protocol of GEM5 simulator, there are L0, L1, L2, dir and dma state machines. L0 and L1 cache controllers simulate private caches for processor cores, which are implemented ...
0
votes
1
answer
120
views
How page alignment & page borders affect performance when traversing & reading objects
I ran a test to better understand how pages affect performance when traversing & reading objects in memory but I got the opposite of what I was expecting. I'm hoping someone can explain this ...
-1
votes
1
answer
138
views
If cache invalidation happens every time memory mappings change, why not opt for VIVT?
As far as I know, L1 is VIPT for at least Intel chips. VIVT caches don't depend on address translation, so they can fully operate in parallel with TLB lookup. VIPT can also achieve some parallelism by ...
2
votes
2
answers
112
views
What's the benefit of bring frequently-accessed array address into cache?
Recently I am looking over someone's Limit Order Book implementation. There is one place the author left a comment and I don't quite understand how is it going to benefit performance-wise.
Let me ...
0
votes
0
answers
46
views
Main memory bandwidth measurement for split cache
I have an exercise that asks me to calculate the bandwidth of a CPU with split cache memory for instructions and data. I have the references per second, miss and hit ratio, and the block size for both ...
2
votes
0
answers
128
views
Matrix multiply fastest with -O0 [duplicate]
I timed a fairly naive BLAS-like matrix multiplication (DGEMM) function:
void dgemm_naive(const int M, const int N, const int K, const double alpha,
const double *A, const int lda, ...
3
votes
2
answers
199
views
Interpreting part of an array as an object by casting a pointer to an array element
Say you have an array of floats representing raw data, and several types representing 2D shapes that only have floats as members, like this:
#include <iostream>
#include <array>
struct ...
3
votes
1
answer
94
views
Does INVLPG instruction or mprotect() affect the CPU cache state while invalidating TLB entries?
I am working on some code that involves L1/2/3 cache eviction & TLB entry invalidation. I'm trying to use the INVLPG instruction to invalidate TLB entries and verify some results achieved by ...
1
vote
1
answer
65
views
With the given information about a direct-mapped cache (including a trace and hit/miss status), how do I find the number of tag bits and offset bits?
I am doing I problem set on direct mapped cache, I need help with finding the number of offset bits and tag bits. I don't know how to calculate the number of tags and offset bits. The solution key ...
0
votes
0
answers
52
views
Programmatically determine L1 icache size, line size?
I have already searched for related issues, but all of them are related to dcache. I can't think of a way to determine the characteristics of L1 icache.
1
vote
0
answers
81
views
Why am I seeing two L1D cache misses in a multithreaded setup during read-modify-store operations?
I have a multithreaded setup where one thread acts as a writer and the other as a reader. The writer performs a read-modify-store operation on a shared std::atomic variable, and the reader ...
1
vote
1
answer
283
views
CLDEMOTE support in intel CPUs
I am new to intel nomenclature, it is not very clear to me which all CPUs support CLDEMOTE instructions.
Intel® 64 and IA-32 Architectures Software Developer’s Manual states that CLDEMOTE supported in
...
0
votes
1
answer
29
views
Collecting Cached data to characterizing the residing 0/1 bits
I am a 1st year Ph.D. student (Research Assistant). I am trying to increase the transfer rate between cache and DRAM. To do so I am planning to integrate a good compression technique (or some other ...
0
votes
0
answers
169
views
How do I calculate the L3 cache miss rate and find number of trips to main memory using perf?
I'm trying to measure the L3 cache miss rate using the following formula:
I found that LLC misses can be obtained using this perf command from How to catch the L3-cache hits and misses by perf tool ...
2
votes
1
answer
116
views
Python, numpy and the cacheline
I try to follow https://igoro.com/archive/gallery-of-processor-cache-effects/ in python using numpy.
Though it does not work and I don't quite understand why...
numpy has fixed size dtypes, such as np....
4
votes
0
answers
126
views
How do CPUs evenly occupy a PIPT L2 larger than associativity x page size?
Context
I was researching memcpy bandwidth of different platforms on different buffer sizes, and some of the runs showed much worse results, despite me doing what seemed like an appropriate amount of ...
0
votes
2
answers
155
views
MESI: why we need write-miss to move from shared to modified
The book "Computer Architecture", by Hennessy/Patterson, 6th ed, on page 394, includes an example with true sharing and false sharing misses with 2 processors.
here is the example from the ...
1
vote
0
answers
65
views
Relationship between memory acceses and instructions in MIPS architecture
I'm currently studying computer architecture, following the Hennessy-Patterson books (Quantitative Approach 5 and Organizazion and Design 4), and I want to check if I'm understanding some cache ...
2
votes
0
answers
77
views
the Perf event issues of hardware prefetcher (all_pf_data_rd and pf_l2_data_rd)
My platform is 2nd generation scalable Xeon, equipped with a non-inclusive cache. I run a series of tests that had the L2 stream prefetcher aggressively prefetching.
I use Perf to monitor performance, ...
0
votes
2
answers
264
views
MOESI Protocol: What happens when Owned is dirty and other processors read the line in Shared?
I've been thinking about the "owned" state of the MOESI protocol. So let's say the following situation exists:
P0 has line A in O state.
P1 has line A in S state.
P0 writes to line A in its ...
1
vote
1
answer
130
views
How to store items in the LIFO stack in a cache-friendly manner?
EDIT / DISCLAIMER:
It appears that my understanding of cache lines was wrong which makes this question not relevant and misleading. I thought whenever CPU tries to fetch a memory at a specific index, ...
5
votes
1
answer
276
views
Optimization Challenge Due to L1 Cache with Numba
I've been working on optimizing the calculation of differences between elements in NumPy arrays. I have been using Numba for performance improvements, but I get a 100-microsecond jump when the array ...
2
votes
0
answers
128
views
is it possible to fetch data into cpu cache while the cpu works on something else?
I wonder if its possible to improve performance by getting the cpu to load something into cache while it still works on something else. I'm not very knowledgeable about the inner workings of a cpu and ...
4
votes
0
answers
225
views
How CPUs Use the LOCK Prefix to Implement Cache Locking and ensure memory consistency
In Java, adding the volatile keyword to a variable guarantees memory consistency (or visibility).
On the x86 platform, the Hotspot virtual machine implements volatile variable memory consistency by ...
0
votes
0
answers
130
views
How to check whether the PCIe Memory-mapped BAR region is cacheable or uncacheable
I want to know the way to check the PCIe Memory-mapped BAR region is cacheable or not.
Is there any way to check the setting value or not? Or is it just configured uncacheable in hardware-way??
(I saw ...
1
vote
2
answers
458
views
Are RISC-V SH and SB instructions allowed to communicate with the cache?
Are risc-v instructions such as sb and sh allowed to access the cache? Or does it communicate directly with the main memory? I have seen the Wstrb event in main memory structures, but generally not in ...
1
vote
1
answer
659
views
Performance implications of aliasing in VIPT cache
What are the performance implications of virtual address synonym (aliasing) in a VIPT cache? I'm specifically interested in recent x86_64 architectures but knowing more about others wouldn't hurt.
...
0
votes
2
answers
176
views
Why do fast memory writes when run over multiple threads take much more time vs when they are run on a single thread?
I have a program which allocates some memory (200 million integers), does some quick compute and then writes the data to the allocated memory.
When run on a single thread the process takes about 1 ...
1
vote
0
answers
49
views
question regarding the behavior of the program in Meltdown attack
I am doing the Meltdown attack lab using Ubuntu 16.04 32-bit, and an old CPU (Intel i5 7th Gen). There is a secret value 83 stored in 0xfbce3000 by a kernel module, the user program cannot directly ...
0
votes
1
answer
514
views
OS cache/memory hierarchy: How does writing to a new file work?
I know how read/load operations are theoretically supposed to work in OSes. A read instruction causes a TLB lookup, then a look through caches, then a look in main memory, and finally a read from disk ...
0
votes
2
answers
584
views
Can there be a cache block with the same Tag-ID in different Sets?
I am currently investigating some previous exams for my CA course.
There is one question which i found really confusing, here is the the data to work with:
Considering a 32-bit address (tag 20bits, ...
1
vote
0
answers
84
views
why is there a need to stop prefetching to pages when a write happens to it?
I read in this StackOverflow answer that prefetching does not happen for dirty pages.
In which condition DCU prefetcher start prefetching?
It seems to me that the prefetcher is receiving the dirty ...
0
votes
1
answer
174
views
is it possible that a cpu has several L3 level caches?
Must the cores of a multi-core CPU all share L3 caches? is it possible that a cpu has several L3 level caches? For example, suppose a cpu has 24 cores, and no three cores share a L3 cache, so there ...
0
votes
1
answer
215
views
Are 64-byte CPU cache line reads aligned on 64-byte boundaries? [duplicate]
CPU cache lines are typically 64-bytes. When a CPU (say modern Intel processor) reads a cache line from memory, does the CPU read from 64-byte aligned blocks of memory, or any contiguous 64-byte block?...