I'm running a piece of code to test if the log() function in scales. I ran it on a 4-core machine and the result shows that it does no scale. My code is as below:
#include<iostream>
#include<cmath>
#include<omp.h>
#include<chrono>
using namespace std;
typedef std::chrono::milliseconds ms;
int main(){
#pragma omp parallel for schedule(static)
for(int i=0;i<4;i++){
auto start = std::chrono::high_resolution_clock::now();
double tmp=1.0;
for(double j=0.0;j<10000000;j++){
tmp=log(j);
}
auto end = std::chrono::high_resolution_clock::now();
#pragma omp critical
{
cout<<"tmp="<<tmp<<endl;
cout<<"Thread "<<omp_get_thread_num()<<" calculated tmp, time used: "<<std::chrono::duration_cast<ms>(end - start).count() << "ms" << endl;
}
}
return 0;
}
If I use 4 threads, the result is:
Thread 1 calculated tmp, time used: 21ms
Thread 0 calculated tmp, time used: 21ms
Thread 2 calculated tmp, time used: 21ms
Thread 3 calculated tmp, time used: 21ms
If only using 1 thread, the result is:
Thread 0 calculated tmp, time used: 20ms
Thread 0 calculated tmp, time used: 16ms
Thread 0 calculated tmp, time used: 16ms
Thread 0 calculated tmp, time used: 15ms
So when running in parallel, each thread takes longer than running in sequential. Does anybody know why it doesn't scale? How does the std::log actually work (may there's something that the threads have to share)? Is there any way to have a log() function that scales? Thanks!
EDIT1: I increased the number of iterations to 10e10 times, but the result shows that the parallel version is even slower, so maybe it's not the thread creation time which is dominant here. 4 threads:
Thread 0 calculated tmp, time used: 17890ms
Thread 2 calculated tmp, time used: 17890ms
Thread 1 calculated tmp, time used: 17892ms
Thread 3 calculated tmp, time used: 17892ms
1 thread:
Thread 0 calculated tmp, time used: 15664ms
Thread 0 calculated tmp, time used: 15659ms
Thread 0 calculated tmp, time used: 15660ms
Thread 0 calculated tmp, time used: 15647ms
EDIT2: I let the tmp variable to be printed out in the end, so that the log() can't be optimized out. But the result is still like before. Any other ideas?
EDIT3: So the speed-up of the total execution time is 3.5, and it does not become higher anymore even if the number of iterations increases. I'm not sure whether it's a reasonable speed-up, because I was expecting a speed-up like 3.7 or 3.8 on for a simple program like this.