-1

I have basically two vectors one for a large number of elements and a second for a small number of probes used to sample data of the elements. I stumbled upon the question in which order to implement the two loops. Naturally I thought having the outer loop over the larger vector would be beneficially

Implementation 1:

for(auto& elem: elements) {
    for(auto& probe: probes) {
        probe.insertParticleData(elem);
    }
}

However it seems that the second implementation takes only half of the time

Implementation 2:

for(auto& probe: probes) {
    for(auto& elem: elements) {
        probe.insertParticleData(elem);
    }
}

What could be the reason for that?

Edit:

Timings were generated by the following code

clock_t t_begin_ps = std::clock();
... // timed code
clock_t t_end_ps = std::clock();
double elapsed_secs_ps = double(t_end_ps - t_begin_ps) / CLOCKS_PER_SEC;

and on inserting the elements data I do basically two things, testing if the distance to the probe is below a limit and the computing an average

probe::insertParticleData (const elem& pP) {
   if (!isInside(pP.position())) {return false;}
   ... // compute alpha and beta
   avg_vel = alpha*avg_vel + beta*pP.getVel();
   return true;
}

To get an idea of the memory usage I have approx. 10k elements which are objects with 30 double data members. For the test I used 10 probes containing 15 doubles.

8
  • 2
    First of all, how did you measure the timings? Commented Nov 26, 2014 at 8:02
  • 6
    That may also depend of what probe.insertParticleData(elem); does. Commented Nov 26, 2014 at 8:02
  • 1
    It really depends on the number and sizes of the elements, and the memory access pattern. Could you add some more detail ? Commented Nov 26, 2014 at 8:09
  • @Jarod42 added implementation of insertParticleData Commented Nov 26, 2014 at 8:23
  • @Bathsheba, I included the code for the timings Commented Nov 26, 2014 at 8:23

2 Answers 2

5

Todays CPUs are heavily optimized for linear access to memory. Therefore a few long loops will beat many short loops. You want the inner loop to iterate over the long vector.

Sign up to request clarification or add additional context in comments.

2 Comments

What does linear access mean exactly?
Linear in this context generally means contiguous - loading whole cache lines and using only part of each cache line is inefficient.
2

My guess: if insertParticleData is virtual, the compiler will treat the function's address as a constant within the inner loop and move the vtable fetch outside the inner loop. I.e. effectively generate code which looks like:

   for (auto& probe: probes) {
      funcPtr p = probe.insertParticleData;
      for (auto& elem: elements) {
        (*p)(elem);
      }
   }

whereas in the first version, p would be computed for every inner iteration.

1 Comment

No it is not a virtual function

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.