2
void foo(Node* p[], int size){

    _uint64 arr_of_values[_MAX_THREADS];


    for (int i=0 ; i < size ; i++ ){
         arr_of_values[i] = p[i]->....;

         // much code here 
         // 
      }
 }

vs

void foo(Node* p[], int size){

    _uint64 arr_of_values[_MAX_THREADS];

    Node* p_end = p[size];
    for ( ; p != p_end ; ){            
         arr_of_values[i] = (*p)->.....;
         p++;


         // much code here 
         // 
     }

}

I created this function to demonstrate what i am asking:

what is more efficient from the cache efficiency aspect : taking p[i] or using *p++?

(i'll never use the p[i-x] in the rest of the code, but i may use p[i] or *p in the following calculation)

2
  • what is more efficient from cache aspect , increasing p and then loading *p OR increasing the counter i and asking *(p+i) ? Commented Jun 27, 2012 at 13:48
  • 1
    Cachewise, there's no difference. At most there could be a data dependency difference (but the compiler should optimize that). Commented Jun 27, 2012 at 14:21

2 Answers 2

2

the most important is to avoid false sharing in the arr_of_values. Each thread write into its own slot, but 8 or 16 slots share a cache line (depending on CPU) causing a massive false sharing problem. Add padding between the slots to cache align each thread's slot, or accumulate on stack and write only once at the end:

void foo(Node* p[], int size){

    _uint64 arr_of_values[_MAX_THREADS];

    Node* p_end = p[size];
    for ( ; p != p_end ; ){            
         temp = .....;
         p++;
         // much code here 
         // 
     }  
     arr_of_values[i] = temp;
}

the question of access by pointer or access by index is irrelevant with today's compiler.s

Your next actions should be: grab a copy of the Software optimization Cookbook. Read it. Measure. Fix the measured hotspot, not the guesstimates.

Sign up to request clarification or add additional context in comments.

7 Comments

This would be a great answer if there were any threads involved at all... but this appears to be a question about sequential code.
@_BenVoigt: Agreed, but I took a leap of faith and assumed that the array being dimensioned by [_MAX_THREADS] is used as one slot-per-thread.
Remus- you are right, actually this snippet is based on a complicated code with multiple hierarchy of multiple threads that merge items from multiple queues into local queues +then more threads merge items from those local queues into a central. my code has bad performance, and i believe that the bottleneck is this function which iterates over multiple queues: i have an option to increase the pointer on that queue or to iterate over it with and index. false sharing is very relevant here.
But really, your best bet is to measure right now. VTune, F1 Profiler. Find the hot spots and then address them.
@Michael: You're asking a question about performance, and cache behavior specifically, and didn't mention that you have multiple threads? Please try to include important information in your future questions.
|
1

The problem from a cache point of view isn't the way you are accessing the elements. In this case using a pointer or the array index is equivalent.

BTW Node* p[] is an array of pointer. So you could have possibly allocated your Node objects into distant memory areas. (For example using several ptr = new Node()). The best cache performance are gainable if:

  1. Your Node are stored contiguosly into memory
  2. Node size doesn't exceed the cache size.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.