I'm trying to optimise the indexing of a large 2D (well, 1D treated as 2D) byte array, to maximise the number of consecutive lookups from the same cache line with a size of 64 bytes. Each lookup is one away from the previous, alternating between moving horizontal and vertical. Whether the movement is positive or negative can be treated as random (really it follows Langton's ant rulestring RLR, but I don't think that information is strictly relevant), meaning the path meanders chaotically, tending to stay in the same general area for a reasonably long time.
With normal indexing a row at a time, a horizontal movement is probably within the same cache line but a vertical movement never is. My solution is to index the array in 8x8 blocks, here's an example as if the cache line size was 9 with a 6x6 array:
24 25 26 33 34 35
21 22 23 30 31 32
18 19 20 27 28 39
6 7 8 15 16 17
3 4 5 12 13 14
0 1 2 9 10 11
It doesn't show as well with 3x3 blocks, but it should allow the cache line to be re-used much more:
.
.
.
56 57 58 59 60 61 62 63
48 49 50 51 52 53 54 55
40 41 42 43 44 45 46 47
32 33 34 35 36 37 38 39
24 25 26 27 28 29 30 31
16 17 18 19 20 21 22 23
8 9 10 11 12 13 14 15
0 1 2 3 4 5 6 7 ...
I've benchmarked with normal indexing and this indexing, and this indexing is slower. It may be because it has to do more to work out the desired index (it's in a tight loop, see here for the normal indexed version: How to optimise this Langton's ant sim? ). I can't quite rule out the potential for this indexing being more efficient (working out the new index may be optimisable, thinking about cache is new to me and I'm probably doing something badly).
1) Sanity check: Is what I'm trying to do sensible, is it likely to work? Would it work under different conditions?
2) Longshot: Is there a magic gcc compiler flag that re-orders the indexing for you, to try and optimise for a 2D array instead of 1D?
3) Can I (or do I need to) do anything to try and keep certain cache lines in the CPU? Currently I'm assuming the most recent data is kept until overwritten.
4) If you have a better solution, please describe it.
64 bit linux, gcc, i5-2500k
Edit: It turns out that: 1) This way of thinking was not sensible, 2) N/A, 3) See accepted answer, 4) See accepted answer
(x&7) + ((x & ~7) << (as required for block size)) + ((y&7) << 3) + ((y &~7) << (as required for block size))(extemporaneously, could be wrong)?