I'm trying to understand the practical value of "cache-friendly" design in lock-free queues. I often see people go to great lengths to pad structures, align data, and avoid false sharing — especially around head/tail pointers or buffer elements.
However, in a real-world, high-throughput system with multiple threads (say 10+), each thread is doing a lot of processing after dequeuing a value. For example:
- A thread pops a value from the queue
- Looks up other data in unordered_maps
- Does string manipulations
- Allocates temporary memory
- Performs various calculations
- .....
All this activity trashes the thread's local L1/L2 cache anyway. So what’s the point of carefully optimizing the cache layout of the queue?
If thread A is constantly running and working with new data, and thread B is doing the same, doesn’t that mean any "cache locality" or "cache line isolation" will be short-lived or useless?
To be clear, I'm not questioning the theory behind false sharing — I understand that writing to the same cache line from multiple cores causes coherence traffic. But in practice, does padding and aligning in the queue really matter when everything gets evicted from cache almost immediately during downstream processing?
Would love clarification from someone who has benchmarked or dealt with this in production.