Optimising 2D array indexing for cache line

Question

I'm trying to optimise the indexing of a large 2D (well, 1D treated as 2D) byte array, to maximise the number of consecutive lookups from the same cache line with a size of 64 bytes. Each lookup is one away from the previous, alternating between moving horizontal and vertical. Whether the movement is positive or negative can be treated as random (really it follows Langton's ant rulestring RLR, but I don't think that information is strictly relevant), meaning the path meanders chaotically, tending to stay in the same general area for a reasonably long time.

With normal indexing a row at a time, a horizontal movement is probably within the same cache line but a vertical movement never is. My solution is to index the array in 8x8 blocks, here's an example as if the cache line size was 9 with a 6x6 array:

 24 25 26 33 34 35
 21 22 23 30 31 32
 18 19 20 27 28 39
  6  7  8 15 16 17
  3  4  5 12 13 14
  0  1  2  9 10 11

It doesn't show as well with 3x3 blocks, but it should allow the cache line to be re-used much more:

  .
  .
  .
 56 57 58 59 60 61 62 63
 48 49 50 51 52 53 54 55
 40 41 42 43 44 45 46 47
 32 33 34 35 36 37 38 39
 24 25 26 27 28 29 30 31
 16 17 18 19 20 21 22 23
  8  9 10 11 12 13 14 15
  0  1  2  3  4  5  6  7 ...

I've benchmarked with normal indexing and this indexing, and this indexing is slower. It may be because it has to do more to work out the desired index (it's in a tight loop, see here for the normal indexed version: How to optimise this Langton's ant sim? ). I can't quite rule out the potential for this indexing being more efficient (working out the new index may be optimisable, thinking about cache is new to me and I'm probably doing something badly).

1) Sanity check: Is what I'm trying to do sensible, is it likely to work? Would it work under different conditions?

2) Longshot: Is there a magic gcc compiler flag that re-orders the indexing for you, to try and optimise for a 2D array instead of 1D?

3) Can I (or do I need to) do anything to try and keep certain cache lines in the CPU? Currently I'm assuming the most recent data is kept until overwritten.

4) If you have a better solution, please describe it.

64 bit linux, gcc, i5-2500k

Edit: It turns out that: 1) This way of thinking was not sensible, 2) N/A, 3) See accepted answer, 4) See accepted answer

Then you would need only a single byte to reach any element of 16x16 matrix this way, using the already-known starting address. Assuming the single byte is not a valid addressing mode in assembly, compiler may be casting it to int or long to have a legal addressing, causing the slowdowns. But indexing needs 1/4(even 1/8) memory space, this could be good for weak-memory installations. — huseyin tugrul buyukisik
– huseyin tugrul buyukisik, Commented Jul 13, 2013 at 0:10
What does your indexing look like? Just the most obvious thing, something like (x&7) + ((x & ~7) << (as required for block size)) + ((y&7) << 3) + ((y &~7) << (as required for block size)) (extemporaneously, could be wrong)? — Tommy
– Tommy, Commented Jul 13, 2013 at 0:52
This isn't entirely clear. Is the aim of the reordering to minimise the probability of walking beyond the current cache line? Bear in mind that if you stay "local" (in a geometric sense), then it's likely that the cache will end up holding a bunch of "local" data, thus minimising the cache miss rate. Thus there's probably no need to do anything fancy with your indexing. — Oliver Charlesworth
– Oliver Charlesworth, Commented Jul 13, 2013 at 6:04

Eric Postpischil · Accepted Answer · 2013-07-13 11:07:14Z

I do not see justification for maximizing consecutive use of one cache line. The cache does not operate “one line at a time”, and there is generally no advantage to using one cache line repeatedly versus using any of the lines that are in cache.

A better goal is to maximize the number of accesses that are served from a line in L1 cache, instead of needing to be fetched from slower cache or memory. As long as an access “hits” a line currently in cache, we do not care which cache line it is.

The i5-2500K is a Sandy Bridge processor. The Sandy Bridge L1 data cache is 32 KiB and is eight-way associative, with 64-byte cache lines. This means the 32,768-byte cache has 512 lines, and they are organized into 64 sets of eight lines each. Each memory address maps to exactly one set, as shown below. Within each set, eight cache lines are retained, from lines that have been recently used in that set. (The replacement algorithm is not least-recently-used, but it is an attempt to be useful and may have similar results to least-recently-used.)

Cache lookups work in this way:

Given a byte address x, let t = floor(x/64) (due to cache line size).
Let s = t % 64 (to select the set).
Check set s to see whether it contains the byte at address x.

Consider the effect of row length on these cache lookups. With a row length of 65,536 bytes, the addresses of array elements a[i][j] and a[i+1][j] differ by 65,536 bytes. That means their values for t in the above lookup procedure differ by exactly 1024, and their values for s are identical. Therefore, they map to the same set.

Once the algorithm moves up or down by more than eight rows, without changing columns outside of a cache line, the single cache set being used cannot handle the nine recently used cache lines. One of them must be evicted. In effect, the cache size is eight lines (512 bytes) instead of 512 lines (32,768 bytes).

A simple way to address this is to pad the array so that the rows are 65,536+p bytes long, for some padding amount p. The array would be allocated with extra space and defined with longer rows than normal. The extra columns may generally be ignored. There is no need to initialize them; we do not care about their contents, just the effect they have on the addresses. (Alternately, they could be used for supplementary data if that is convenient for the program.)

With this padding, the distance between a[i][j] and a[i+1][j] is 65,536+p bytes, so the difference in t values is 1024+p/64, and the difference in s values is p/64 % 64. E.g., if p is 64 or 320, the difference in s values is 1 or 5, respectively.

I suggest testing 9*64 for p. Any value 64 or greater will ensure that array elements in the same column in consecutive rows map to different cache sets. However, the algorithm described in the question wanders in columns as well as in rows. So, if p were small, our fix to make consecutive rows map to different cache sets might be negated by column-wandering that meanders back to the same cache set. Other values of p should be tried as well.

This is not intended to be a complete solution to the problem, as there are many factors that affect performance.

That's very helpful. My justification for trying to stay in the same cache line was to maximise the 'likely to be used' data in the L1 cache. 8x8 aligned blocks roughly centred around a location better cover the next x moves than 64x1 blocks do. I didn't consider that the L1 cache was large enough for 64x1 blocks to be efficient (16KiB and up allows 64x1 to perform pretty well). I also understand your explanation of cache lookups, and know what to read up on. Thank you :D

user555045 · Accepted Answer · 2013-07-15 10:31:59Z

This is probably not useful, but it may be interesting.

You could use Z-order addressing. It will map aligned 8x8 blocks to cache lines, so as long as you stay within one aligned 8x8 block you're always using the same cache line. But weird things sometimes happen when you cross from one block into the next.

Generating a Z-order address from an (x, y) pair is a little annoying:

static uint Interleave(uint x, uint y)
{
    y = (y | (y << 1)) & 0x00FF00FF;
    y = (y | (y << 2)) & 0x0F0F0F0F;
    y = (y | (y << 4)) & 0x33333333;
    y = (y | (y << 8)) & 0x55555555;

    x = (x | (x << 1)) & 0x00FF00FF;
    x = (x | (x << 2)) & 0x0F0F0F0F;
    x = (x | (x << 4)) & 0x33333333;
    x = (x | (x << 8)) & 0x55555555;

    return x | (y << 1);
}

(this is C#, should be easy to convert to C)

Less annoying if you have a CPU that supports PDEP, which is, so far, only Haswell.

But you probably don't need to do that often. You can directly increment or decrement the x or y part of a Z-order address (it's generalizable to adding any pair of constants (c1, c2) to a Z-address, if they are both nonzero it takes a little more code), like this: (those don't do any bounds checking)

static uint IncX(uint z)
{
    uint xsum = (z | 0xAAAAAAAA) + 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint IncY(uint z)
{
    uint ysum = (z | 0x55555555) + 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

static uint DecX(uint z)
{
    uint xsum = (z & 0x55555555) - 1;
    return (xsum & 0x55555555) | (z & 0xAAAAAAAA);
}

static uint DecY(uint z)
{
    uint ysum = (z & 0xAAAAAAAA) - 2;
    return (ysum & 0xAAAAAAAA) | (z & 0x55555555);
}

You can even throw in some kinds of bounds-checking. I have routines for saturating increment/decrement, I'm only about 90% sure that they work though. Wrapping modulo a power of two is trivial, just do a binary AND on the result.

The addressing with the Z-coordinate is trivial, just add it to the base of the array. Moving is slightly more complicated than in (x, y) space, but if you combine this with the idea from your other post (looking up an area) you never even need to move (except in the computation of that lookup table, obviously). Packing a good surrounding area may be harder. But there is a less-good surrounding area that becomes trivial: offset the Z coordinate directly in Z-space in both directions and take everything in between (say, from Z-8 to Z+7). That will simulate fewer steps at once on average since it's not usually a square block and the current position is usually not in the middle, but the index into the lookup table would be easier to compute.

edit: it's probably better to take an aligned block instead of a range, because the ant can never walk from one of the "parts" of an unaligned range into an other "part" (the parts are, at best, diagonally connected, so it would have to step outside). That's also easy, just AND the least significant bits off of the Z-coordinate to get the start of the aligned block. The lookup table will need those least significant bits though so they have to become part of the index.

I don't expect this approach to win. But it's interesting, IMO.

Collectives™ on Stack Overflow

Optimising 2D array indexing for cache line

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related