1

I recently started looking into OpenMP, since I will work on some highly computational expensive image analysis project. I use Windows 7 with an Intel i7 (8 cores) and mingw64 gcc 4.8.1. I code in Code::Blocks and I set everything up in order to compile and run it. On several parts in my code I will do some pixel-wise operations which I thought would be a good candidate for parallel processing. To my surprise, it turns out that sequential is faster than parallel processing. I tried different versions of gcc (4.7 - 4.8) for both 32-bit and 64-bit and on two separate computers, but I get always the same performance issue. I then tried to run it with my old Visual Studio 2008 that I had on one of these two computers for which I get a performance increase as expected. Therefore, my question is - why am I not able to see the same effect using gcc. Is there anything I do wrong?

Here is a minimum working example.

#include <omp.h>
#include <cstdlib>
#include <iostream>

int main(int argc, char * argv[])
{
   /* process a stack of images - set the number to 1000 for testing */
   int imgStack = 1000;

   double start_t = omp_get_wtime();
   for (int img = 0; img < imgStack; img++)
   {
      omp_set_num_threads(8);
      #pragma omp parallel for default(none)
      for (int y = 0; y < 1000000000; y++) /* increased the number of pixels to make it worthwhile and to see a difference*/
      {
         for (int x = 0; x < 1000000000; x++)
         {
            unsigned char pixel[4];
            pixel[0] = 1;
            pixel[1] = 2;
            pixel[2] = 3;
            pixel[3] = 4;

            /* here I would do much more but removed it for testing purposes */

         }
      }
   }
   double end_t = (omp_get_wtime() - start_t) * 1000.0;
   std::cout << end_t << "ms" << std::endl;

   return 0;
}

In the building log I have following

x86_64-w64-mingw32-g++.exe -Wall -O2 -fopenmp -c C:\Code\omptest\main.cpp -o obj\Release\main.o
x86_64-w64-mingw32-g++.exe -o bin\Release\omptest.exe obj\Release\main.o -s C:\mingw-builds\x64-4.8.1-posix-seh-rev5\mingw64\bin\libgomp-1.dll

The output is following

for 1 thread :   43ms
for 8 threads:  594ms

I also tried to turn off the optimisation (-O0) in case the compiler does some loop unrolling. I read about the false sharing issue, therefore I kept any variable within the loop private to make sure that this is not the problem. I'm not good in profiling so I can't tell what is going on underneath, such as internal locks that causes all the threads to wait.

I can't figure out what I'm doing wrong here.

- Edit -

Thanks to everyone. In my real code I have an image stack with 2000 images, each of 2000x2000 pixels size. I tried to simplify the example so that everyone can easily reproduce the issue, in which I simplified it too much with the consequence of causing other issues. You all were completely right. In my real code I use Qt for opening and displaying my images, as well as my own image manager that loads and iterates through the stack to give me one image at a time. I thought providing the whole sample would just be too much and complicate things (i.e. not providing a minimum working example).

I pass all the variables (imageHeight, imageWidth, etc) as const only the pointer to my image as shared. Initially that was a pointer to a QImage. In the loop I set the final pixel value using qtimg->setPixel(...) and it seems that the MSVC compiler deals differently with that compared to the gcc compiler. Finally I replaced the QImage pointer with a pointer to an unsigned char array, which gave me a performance increase as expected.

@Hristo Iliev: Thanks for the information about the thread pool. That's really good to know.

2
  • You're not doing anything at all in your inner loop. The compiler should optimize that away completely, so all you'll see is the cost of setting up the threads and distributing (no) work to them. Commented Oct 19, 2013 at 22:05
  • If QImage::setPixel() uses an internal lock, e.g. to make the operation thread-safe, calling it from multiple threads at once would only serialise their execution. Commented Oct 20, 2013 at 15:13

2 Answers 2

1

Due to pixels only being assigned to and then never used, the whole inner loop gets completely removed by the GCC's optimiser with -O2 as one could easily verify by enabling the tree dumps:

; Function <built-in> (main._omp_fn.0, funcdef_no=1036, decl_uid=21657, cgraph_uid=256)

<built-in> (void * .omp_data_i)
{
<bb 2>:
  return;

}

And what you do is that you effectively measure the OpenMP runtime overhead.

With -O0 all the code is kept in place and the run time scales as expected with the number of threads but I doubt that you have ever tested it with a 1000000000 x 1000000000 image.

Sign up to request clarification or add additional context in comments.

Comments

1

Given the code example, I can't repeat your result. You have to show your real stack size and image size. Because if the work can be done in only 5ms with 1 thread, multi-thread won't make it quicker. Launching multiple threads will introduce a large overhead, especially when you launched them imgStack times.

1 Comment

Since many years GCC, MSVC, Intel and most others have their OpenMP runtimes implement worker threads with thread pools. Only the very first parallel region is expensive. Unless more threads are required later, further entries into parallel regions are not as expensive as you might expect.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.