Why is my parallel code slower than sequential code?

Question

I have implemented a parallel code in C for merge sort using OPENMP. I get speed up of 3.9 seconds which is quite slower that the sequential version of the same code(for which i get 3.6). I am trying to optimise the code to the best possible state but cant increase the speedup. Can you please help out with this? Thanks.

 void partition(int arr[],int arr1[],int low,int high,int thread_count)
 {
int tid,mid;

#pragma omp if
if(low<high)
{
    if(thread_count==1)
    {
            mid=(low+high)/2;
            partition(arr,arr1,low,mid,thread_count);
            partition(arr,arr1,mid+1,high,thread_count);
                sort(arr,arr1,low,mid,high);
    }
    else
    {
        #pragma omp parallel num_threads(thread_count) 
        {
                mid=(low+high)/2;
                #pragma omp parallel sections  
                {
                    #pragma omp section
                    {
                        partition(arr,arr1,low,mid,thread_count/2);
                        }
                    #pragma omp section
                    {   
                        partition(arr,arr1,mid+1,high,thread_count/2);
                    }
                }
        }
        sort(arr,arr1,low,mid,high);

    }
}
 }

It would be nice to have a version we could compile and test. Your source is missing includes for stdio and stdlib and the definition of "sort", which would probably be better named as "merge". Also the main function "partition" would be better named "sort" or "mergesort". — Daniel Landau
– Daniel Landau, Commented Sep 16, 2012 at 9:43
@DanielLandau Have added the full version. Hope you could give me a good solution :) — Rigorous implementation
– Rigorous implementation, Commented Sep 16, 2012 at 11:47
The code does not compile because of several errors. For instance it seems not to be C conforming (return type of ‘main’ is not ‘int’) and #pragma omp if does not exist. — Massimiliano
– Massimiliano, Commented Sep 16, 2012 at 12:03
possible duplicate of Parallel Merge Sort with threads /much/ slower than Seq. Merge Sort. Help — Bo Persson
– Bo Persson, Commented Sep 16, 2012 at 12:38
This is not really a duplicate, since the attempts at solution are very different. — Daniel Landau
– Daniel Landau, Commented Sep 16, 2012 at 13:25

Massimiliano · Accepted Answer · 2012-09-16 17:28:55Z

As was correctly noted, there are several mistakes in your code that prevent its correct execution, so I would first suggest to review these errors.

Anyhow, taking into account only how OpenMP performance scales with thread, maybe an implementation based on task directives would fit better as it overcomes the limits already pointed by a previous answer:

Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause

You can find a trace of such an implementation below:

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <sys/time.h>

void getTime(double *t) {

  struct timeval tv;

  gettimeofday(&tv, 0);
  *t = tv.tv_sec + (tv.tv_usec * 1e-6);
}

int compare( const void * pa, const void * pb ) {

  const int a = *((const int*) pa);
  const int b = *((const int*) pb);

  return (a-b);
}

void merge(int * array, int * workspace, int low, int mid, int high) {

  int i = low;
  int j = mid + 1;
  int l = low;

  while( (l <= mid) && (j <= high) ) {
    if( array[l] <= array[j] ) {
      workspace[i] = array[l];
      l++;
    } else {
      workspace[i] = array[j];
      j++;
    }
    i++;
  }
  if (l > mid) {
    for(int k=j; k <= high; k++) {
      workspace[i]=array[k];
      i++;
    }
  } else {
    for(int k=l; k <= mid; k++) {
      workspace[i]=array[k];
      i++;
    }
  }
  for(int k=low; k <= high; k++) {
    array[k] = workspace[k];
  }
}

void mergesort_impl(int array[],int workspace[],int low,int high) {

  const int threshold = 1000000;

  if( high - low > threshold  ) {
    int mid = (low+high)/2;
    /* Recursively sort on halves */
#ifdef _OPENMP
#pragma omp task 
#endif
    mergesort_impl(array,workspace,low,mid);
#ifdef _OPENMP
#pragma omp task
#endif
    mergesort_impl(array,workspace,mid+1,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
    /* Merge the two sorted halves */
#ifdef _OPENMP
#pragma omp task
#endif
    merge(array,workspace,low,mid,high);
#ifdef _OPENMP
#pragma omp taskwait
#endif
  } else if (high - low > 0) {
    /* Coarsen the base case */
    qsort(&array[low],high-low+1,sizeof(int),compare);
  }

}

void mergesort(int array[],int workspace[],int low,int high) {
  #ifdef _OPENMP
  #pragma omp parallel
  #endif
  {
#ifdef _OPENMP
#pragma omp single nowait
#endif
    mergesort_impl(array,workspace,low,high);
  }
}

const size_t largest = 100000000;
const size_t length  = 10000000;

int main(int argc, char *argv[]) {

  int * array = NULL;
  int * workspace = NULL;

  double start,end;

  printf("Largest random number generated: %d \n",RAND_MAX);
  printf("Largest random number after truncation: %d \n",largest);
  printf("Array size: %d \n",length);
  /* Allocate and initialize random vector */
  array     = (int*) malloc(length*sizeof(int));
  workspace = (int*) malloc(length*sizeof(int));
  for( int ii = 0; ii < length; ii++)
    array[ii] = rand()%largest;
  /* Sort */  
  getTime(&start);
  mergesort(array,workspace,0,length-1);
  getTime(&end);
  printf("Elapsed time sorting: %g sec.\n", end-start);
  /* Check result */
  for( int ii = 1; ii < length; ii++) {
    if( array[ii] < array[ii-1] ) printf("Error:\n%d %d\n%d %d\n",ii-1,array[ii-1],ii,array[ii]);
  }
  free(array);
  free(workspace);
  return 0;
}

Notice that if you seek performances you also have to guarantee that the base case of your recursion is coarse enough to avoid substantial overhead due to recursive function calls. Other than that, I would suggest to profile your code so you can have a good hint on which parts are really worth optimizing.

+1 for suggesting OpenMP tasks. Although OpenMP 3.0 is already 4 years old, still almost nobody seems to know about OpenMP tasks.

Daniel Landau · Accepted Answer · 2012-09-16 16:16:14Z

2

It took some figuring out, which is a bit embarassing, since when you see it, the answer is so simple.

As it stands in the question, the program doesn't work correctly, instead it randomly on some runs duplicates some numbers and loses others. This appears to be a totally parallel error, that doesn't arise when running the program with the variable thread_count == 1.

The pragma "parallel sections", is a combined parallel and sections directive, which in this case means, that it starts a second parallel region inside the previous one. Parallel regions inside other parallel regions are fine, but I think most implementation don't give you extra threads when they encounter a nested parallel region.

The fix is to replace

 #pragma omp parallel sections

with

 #pragma omp sections

After this fix, the program starts to give correct answers, and with a two core system and for a million numbers I get for timing the following results.

One thread:

time taken: 0.378794

Two threads:

time taken: 0.203178

Since the sections directive only has two sections, I think you won't get any benefit from spawning more threads than two in the parallel clause, so change num_threads(thread_count) -> num_threads(2)

But because of the fact that at least the two implementations I tried are not able to spawn new threads for nested parallel regions, the program as it stands doesn't scale to more than two threads.

answered Sep 16, 2012 at 16:16

Daniel Landau

2,3242 gold badges15 silver badges19 bronze badges

2 Comments

Rigorous implementation Over a year ago

My bad. The parallel keyword did make a big difference and I didnt realise it. The load balancing will be skipped if I use 4 threads but still I am using in order to test the machine's maximum capability.

Hristo Iliev Over a year ago

Most OpenMP implementations would give you more threads for nested parallel regions if you enable nested parallelism by calling omp_set_nested(1) or by setting the environment variable OMP_NESTED to true.

Collectives™ on Stack Overflow

Why is my parallel code slower than sequential code?

2 Answers 2

1 Comment

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related