c++ std::log function doesn't scale

Question

I'm running a piece of code to test if the log() function in scales. I ran it on a 4-core machine and the result shows that it does no scale. My code is as below:

#include<iostream>
#include<cmath>
#include<omp.h>
#include<chrono>
using namespace std;

typedef std::chrono::milliseconds ms;

int main(){
        #pragma omp parallel for schedule(static)
        for(int i=0;i<4;i++){
                auto start = std::chrono::high_resolution_clock::now();

                double tmp=1.0;
                for(double j=0.0;j<10000000;j++){
                        tmp=log(j);
                }

                auto end = std::chrono::high_resolution_clock::now();
                #pragma omp critical
                {
                        cout<<"tmp="<<tmp<<endl;
                        cout<<"Thread "<<omp_get_thread_num()<<" calculated tmp, time used: "<<std::chrono::duration_cast<ms>(end - start).count() << "ms" << endl;
                }
        }

        return 0;
}

If I use 4 threads, the result is:

Thread 1 calculated tmp, time used: 21ms
Thread 0 calculated tmp, time used: 21ms
Thread 2 calculated tmp, time used: 21ms
Thread 3 calculated tmp, time used: 21ms

If only using 1 thread, the result is:

Thread 0 calculated tmp, time used: 20ms
Thread 0 calculated tmp, time used: 16ms
Thread 0 calculated tmp, time used: 16ms
Thread 0 calculated tmp, time used: 15ms

So when running in parallel, each thread takes longer than running in sequential. Does anybody know why it doesn't scale? How does the std::log actually work (may there's something that the threads have to share)? Is there any way to have a log() function that scales? Thanks!

EDIT1: I increased the number of iterations to 10e10 times, but the result shows that the parallel version is even slower, so maybe it's not the thread creation time which is dominant here. 4 threads:

Thread 0 calculated tmp, time used: 17890ms
Thread 2 calculated tmp, time used: 17890ms
Thread 1 calculated tmp, time used: 17892ms
Thread 3 calculated tmp, time used: 17892ms

1 thread:

Thread 0 calculated tmp, time used: 15664ms
Thread 0 calculated tmp, time used: 15659ms
Thread 0 calculated tmp, time used: 15660ms
Thread 0 calculated tmp, time used: 15647ms

EDIT2: I let the tmp variable to be printed out in the end, so that the log() can't be optimized out. But the result is still like before. Any other ideas?

EDIT3: So the speed-up of the total execution time is 3.5, and it does not become higher anymore even if the number of iterations increases. I'm not sure whether it's a reasonable speed-up, because I was expecting a speed-up like 3.7 or 3.8 on for a simple program like this.

Thread creation is expensive. You need to do more work in the thread vs it's creation time. — Richard Critten
– Richard Critten, Commented Jul 9, 2016 at 8:03
Your log calculation probably gets optimized out. It doesn't do anything observable. — juanchopanza
– juanchopanza, Commented Jul 9, 2016 at 8:07
What happens if you make loop to iterate over integers, not doubles? — Revolver_Ocelot
– Revolver_Ocelot, Commented Jul 9, 2016 at 8:22
What is the total execution time in each case? Even if each thread takes longer, they can still work faster together. — Revolver_Ocelot
– Revolver_Ocelot, Commented Jul 9, 2016 at 8:37
Is this a 4 core machine, or a 2 core machine with hyperthreading? Does it have single-core boost, where when running single core it overclocks that core? In short, what exactly is your hardware? Finally, multithreaded code rarely scales perfectly. The case where it does is the surprising one. — Yakk - Adam Nevraumont
– Yakk - Adam Nevraumont, Commented Jul 9, 2016 at 12:15

Community · Accepted Answer · 2017-05-23 12:08:34Z

In short

You are completely right. But to fully measure the multicore performance improvement, you shouldn't rely solely on timing individual threads : you should time as well measure the overall execution.

Multicore architectures seem to achieve higher throughput at the expense of a slight decrease of each core when several of them are active. Here another benchmark (using std::thread instead of OMP) with similar observations.

So each individual log calculation doesn't scale, but the oveall system does very well.

Additional details

If you'd run some overall end-to-end measurements:

int main()
{
    auto common_start = std::chrono::high_resolution_clock::now();
    ... 
    auto common_end = std::chrono::high_resolution_clock::now();
    cout << "Overall calculated time : " << std::chrono::duration_cast<ms>(common_end - common_start).count() << "ms" << endl;
    return 0; 
}

you'd certainly observe a better overall performance in parallel.

Here my own timings with 4 threads:

Thread 2 calculated tmp, time used: 269ms
Thread 3 calculated tmp, time used: 274ms
Thread 0 calculated tmp, time used: 281ms
Thread 1 calculated tmp, time used: 289ms
Overall calculated time : 296ms

and with one:

Thread 0 calculated tmp, time used: 218ms
Thread 0 calculated tmp, time used: 218ms
Thread 0 calculated tmp, time used: 229ms
Thread 0 calculated tmp, time used: 224ms
Overall calculated time : 903ms

As you already observed, with one thread the calculations are performed 22% faster. But overall, it takes 3 times the time needed in multithreading to do the same number of calculations.

So it's all about throughput: run at only 44 K-iterations/ms monothread compared to 135 K-iterations/ms with multithreading.

Collectives™ on Stack Overflow

c++ std::log function doesn't scale

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related