I recently started looking into OpenMP, since I will work on some highly computational expensive image analysis project. I use Windows 7 with an Intel i7 (8 cores) and mingw64 gcc 4.8.1. I code in Code::Blocks and I set everything up in order to compile and run it. On several parts in my code I will do some pixel-wise operations which I thought would be a good candidate for parallel processing. To my surprise, it turns out that sequential is faster than parallel processing. I tried different versions of gcc (4.7 - 4.8) for both 32-bit and 64-bit and on two separate computers, but I get always the same performance issue. I then tried to run it with my old Visual Studio 2008 that I had on one of these two computers for which I get a performance increase as expected. Therefore, my question is - why am I not able to see the same effect using gcc. Is there anything I do wrong?
Here is a minimum working example.
#include <omp.h>
#include <cstdlib>
#include <iostream>
int main(int argc, char * argv[])
{
/* process a stack of images - set the number to 1000 for testing */
int imgStack = 1000;
double start_t = omp_get_wtime();
for (int img = 0; img < imgStack; img++)
{
omp_set_num_threads(8);
#pragma omp parallel for default(none)
for (int y = 0; y < 1000000000; y++) /* increased the number of pixels to make it worthwhile and to see a difference*/
{
for (int x = 0; x < 1000000000; x++)
{
unsigned char pixel[4];
pixel[0] = 1;
pixel[1] = 2;
pixel[2] = 3;
pixel[3] = 4;
/* here I would do much more but removed it for testing purposes */
}
}
}
double end_t = (omp_get_wtime() - start_t) * 1000.0;
std::cout << end_t << "ms" << std::endl;
return 0;
}
In the building log I have following
x86_64-w64-mingw32-g++.exe -Wall -O2 -fopenmp -c C:\Code\omptest\main.cpp -o obj\Release\main.o
x86_64-w64-mingw32-g++.exe -o bin\Release\omptest.exe obj\Release\main.o -s C:\mingw-builds\x64-4.8.1-posix-seh-rev5\mingw64\bin\libgomp-1.dll
The output is following
for 1 thread : 43ms
for 8 threads: 594ms
I also tried to turn off the optimisation (-O0) in case the compiler does some loop unrolling. I read about the false sharing issue, therefore I kept any variable within the loop private to make sure that this is not the problem. I'm not good in profiling so I can't tell what is going on underneath, such as internal locks that causes all the threads to wait.
I can't figure out what I'm doing wrong here.
- Edit -
Thanks to everyone. In my real code I have an image stack with 2000 images, each of 2000x2000 pixels size. I tried to simplify the example so that everyone can easily reproduce the issue, in which I simplified it too much with the consequence of causing other issues. You all were completely right. In my real code I use Qt for opening and displaying my images, as well as my own image manager that loads and iterates through the stack to give me one image at a time. I thought providing the whole sample would just be too much and complicate things (i.e. not providing a minimum working example).
I pass all the variables (imageHeight, imageWidth, etc) as const only the pointer to my image as shared. Initially that was a pointer to a QImage. In the loop I set the final pixel value using qtimg->setPixel(...) and it seems that the MSVC compiler deals differently with that compared to the gcc compiler. Finally I replaced the QImage pointer with a pointer to an unsigned char array, which gave me a performance increase as expected.
@Hristo Iliev: Thanks for the information about the thread pool. That's really good to know.
QImage::setPixel()uses an internal lock, e.g. to make the operation thread-safe, calling it from multiple threads at once would only serialise their execution.