1

EDIT / DISCLAIMER:

It appears that my understanding of cache lines was wrong which makes this question not relevant and misleading. I thought whenever CPU tries to fetch a memory at a specific index, it also grabs indexes right after the first one which is NOT true. See How do cache lines work? for reference.


I was about to write a data container for storing a continuous and re-sizable memory block, in which items would only be possible to access from one side by pushing or popping - basically a LIFO stack. I've been reading a lot about CPU cache and memory access patterns and wanted said container to be as cache-friendly as possible.

So I've taken a look at common implementations of LIFO stacks on the internet. Most of them suggest using a dynamic array as a base and accessing the data by appending or removing items from the end of the array. In this case the empty capacity gap is stored after the data in the following way:

 stack bottom            stack top
 array begin             V  array end
 V                       V  V
|V          data         V |V     gap     |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|8 |7 |6 |5 |4 |3 |2 |1 |0 |  |  |  |  |  |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
                        |       cache line       |

However, this does not seem very cache-friendly to me. Whenever one looks up the item on the top of the stack, CPU would fetch the memory from gap containing garbage, or at least this is how I understand it.

Would the approach where gap appears as the begging of memory block and the data at the end be more performent? In this case the indexes of the items would be reversed, same with bottom and top. In this case the cache line could hit more items like so:

                stack top               stack bottom
                V                       V
|      gap     |V          data         V |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
|  |  |  |  |  |0 |1 |2 |3 |4 |5 |6 |7 |8 |
+--+--+--+--+--+--+--+--+--+--+--+--+--+--+
               |       cache line      |

Is my understanding correct here? Which approach is more performent and cache-friendly?

I could be totally wrong with my assumptions. As I said most of the implementations that I saw use the 1st approach so there must be something to it.

Cheers!

5
  • 2
    The top of stack changes from time to time. Whether your stack grows toward higher or lower memory addresses, the number of items on the stack in the same cache line as the top of stack will change, going through all possible values. The direction of growth does not affect that unless the particular place your containing array starts or ends and the frequency distribution of the number of items on the stack happen to it particular patterns relative to cache. Commented Jun 30, 2024 at 21:26
  • 1
    So the short answer is "it depends" ? Commented Jun 30, 2024 at 21:43
  • Two questions. Does it matter? Do you need to micro-optimize? If yes - find a different approach Commented Jun 30, 2024 at 21:45
  • @0___________ No, not really. I was just wondering if it would actually matter and make any difference. Commented Jun 30, 2024 at 21:49
  • 1
    I think more performance relevant is that you want to use a re-sizable memory block which means whenever you reach the max capacity you need to copy the whole array to a new allocated bigger one. That will cost you (except you can avoid this by limiting the max size and allocating that amount of memory in the first place). Commented Jul 1, 2024 at 7:16

1 Answer 1

1

LIFO stacks using an array are equally cache-friendly for either growth direction, unless you have a specific pattern of sizes that happens to favour an odd-sized allocation and growing down.

APIs for efficient reallocation (without copying the data) like Linux mremap or C realloc are designed around adding new space at the end of an allocation, not the start, so that's a point in favour of the normal way if your stack can grow beyond the initial allocation. At least if you're using an allocation API that supports those, not like C++'s crippled new/delete which essentially forces std::vector to suck.

Other than that, they're equal unless your high water mark is commonly 1 or 2 elements more than a multiple of 64 bytes, like you've shown here to make your other version look good. In that version, the first element pushed is in a cache line by itself, but the top-of-stack hasn't yet crossed into a new cache line.

If you push one more element in your second diagram, it will be the only in-use element in its cache line. It only looks good because you picked a number of elements for the current stack state that happens to work well with the misalignment (relative to cache-line boundaries) you chose for the first element (the "bottom" of the stack, at the highest address for your grows-down stack).

With your grows-down gap-at-the-front design, potentially the first element pushed (highest address) is in the same cache line as some other useful data, so that cache-line staying hot can be helpful (if your stack frequently becomes empty so you actually access that element).
Memory allocators often round up allocations to multiples of some larger size like 16 bytes, but there could still be other allocations in one cache line.
Still, unless you know something special about your expected stack sizes and/or what comes next, normally you'd want to allocate a chunk of memory that's a multiple of the cache-line size.

Fun fact: the asm callstack grows down on mainstream ISAs (and calling conventions for ISAs like MIPS which don't favour one stack-growth direction). e.g. x86 push and call decrement the stack pointer register (RSP). The reasons for this aren't cache efficiency, more historical, dating back to before CPUs had caches. It doesn't present any particular problems for CPU caches, though.

CPU/asm stacks are allocated such that the highest-address cache line is fully used, unlike yours where the end of the allocation isn't the end of a cache line. (Typically "the stack" is a multiple of the page size, so there won't be other stuff next to it in the same line anyway.)

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for the explanation and references. Your point about realloc is enough to make the 2nd approach not very useful. "you've shown here to make your other version look good" - this is true, I was wrong about how cache lines work (added an EDIT about it).
@ilikebananas: Oh I see; yeah cache lines are naturally aligned so the boundaries are fixed. Stacks that grow down can work if you reserve address-space below them to grow into (by mapping more pages there), so there's nothing fundamentally problematic, which is why asm callstacks do in fact work that way without it being a performacne problem. But doing that for your own data structure in a high-level language would require a lot of low-level rolling-your-own work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.