Slower parallel program with OpenMP and PThreads than sequential

Question

I got a problem with the parallelization of the following program for the matrix multiplication. The optimized versions are slower or just a very few faster than the sequential one. I was allready seraching the mistake, but couldn't find it... I tested it as well on an other machine, but got the same...

Thanks for your help allready

Main:

int main(int argc, char** argv){

    if((matrixA).size != (matrixB).size){
     fprintf(ResultFile,"\tError for %s and %s - Matrix A and B are not of the same size ...\n", argv[1], argv[2]);
    }
    else{
     allocateResultMatrix(&resultMatrix, matrixA.size, 0);

     if(*argv[5] == '1'){ /* Sequentielle Ausfuehrung */
      begin = clock();
      matrixMultSeq(&matrixA, &matrixB, &resultMatrix);
      end = clock();
     };

     if(*argv[5] == '2'){ /* Ausfuehrung mit OpenMP */
      printf("Max number of threads: %i \n",omp_get_max_threads());
      begin = clock();
      matrixMultOmp(&matrixA, &matrixB, &resultMatrix);
      end = clock();
     };

     if(*argv[5] == '3'){ /* Ausführung mittels PThreads */
      pthread_t  threads[NUMTHREADS];
      pthread_attr_t attr;
      int i;
      struct parameter arg[NUMTHREADS];

      pthread_attr_init(&attr); /* Attribut initialisieren */

      begin = clock();

      for(i=0; i<NUMTHREADS; i++){ /* Initialisierung der einzelnen Threads */
       arg[i].id = i;
       arg[i].num_threads = NUMTHREADS;
       arg[i].dimension = matrixA.size;
       arg[i].matrixA = &matrixA;
       arg[i].matrixB = &matrixB;
       arg[i].resultMatrix = &resultMatrix;
       pthread_create(&threads[i], &attr, worker, (void *)(&arg[i]));
      }

      pthread_attr_destroy(&attr);

      for(i=0; i<NUMTHREADS; i++){ /* Warten auf Rückkehr der Threads */
       pthread_join(threads[i], NULL);
      }

      end = clock();
    }

    t=end - begin;
    t/=CLOCKS_PER_SEC;
    if(*argv[5] == '1')
      fprintf(ResultFile, "\tTime for sequential multiplication: %0.10f seconds\n\n", t);
    if(*argv[5] == '2')
      fprintf(ResultFile, "\tTime for OpenMP multiplication: %0.10f seconds\n\n", t);
    if(*argv[5] == '3')
      fprintf(ResultFile, "\tTime for PThread multiplication: %0.10f seconds\n\n", t);
    }
  }
}

void matrixMultOmp(struct matrix * matrixA, struct matrix * matrixB, struct matrix * resultMatrix){
  int i, j, k, l;
  double sum = 0;

  l = (*matrixA).size;
#pragma omp parallel for private(j,k) firstprivate (sum)
  for(i=0; i<=l; i++){
   for(j=0; j<=l; j++){
      sum = 0;
      for(k=0; k<=l; k++){
         sum = sum + (*matrixA).matrixPointer[i][k]*(*matrixB).matrixPointer[k][j];
      }
      (*resultMatrix).matrixPointer[i][j] = sum;
    }
  }
}

void mm(int thread_id, int numthreads, int dimension, struct matrix* a, struct matrix* b, struct matrix* c){
  int i,j,k;
  double sum;
  i = thread_id;
  while (i <= dimension) {
    for (j = 0; j <= dimension; j++) {
      sum = 0;
      for (k = 0; k <= dimension; k++) {
    sum = sum + (*a).matrixPointer[i][k] * (*b).matrixPointer[k][j];
      }
      (*c).matrixPointer[i][j] = sum;
    }
    i+=numthreads;
 }
}

void * worker(void * arg){
  struct parameter * p = (struct parameter *) arg;
  mm((*p).id, (*p).numthreads, (*p).dimension, (*p).matrixA, (*p).matrixB, (*p).resultMatrix);
  pthread_exit((void *) 0);
}

Here is the Output with the times: Starting calculating resultMatrix for matrices/SimpleMatrixA.txt and matrices/SimpleMatrixB.txt ... Size of matrixA: 6 elements Size of matrixB: 6 elements Time for sequential multiplication: 0.0000030000 seconds

Starting calculating resultMatrix for matrices/SimpleMatrixA.txt and matrices/SimpleMatrixB.txt ...
    Size of matrixA: 6 elements
    Size of matrixB: 6 elements
    Time for OpenMP multiplication: 0.0002440000 seconds

Starting calculating resultMatrix for matrices/SimpleMatrixA.txt and matrices/SimpleMatrixB.txt ...
    Size of matrixA: 6 elements
    Size of matrixB: 6 elements
    Time for PThread multiplication: 0.0006680000 seconds

Starting calculating resultMatrix for matrices/ShortMatrixA.txt and matrices/ShortMatrixB.txt ...
    Size of matrixA: 100 elements
    Size of matrixB: 100 elements
    Time for sequential multiplication: 0.0075190002 seconds

Starting calculating resultMatrix for matrices/ShortMatrixA.txt and matrices/ShortMatrixB.txt ...
    Size of matrixA: 100 elements
    Size of matrixB: 100 elements
    Time for OpenMP multiplication: 0.0076710000 seconds

Starting calculating resultMatrix for matrices/ShortMatrixA.txt and matrices/ShortMatrixB.txt ...
    Size of matrixA: 100 elements
    Size of matrixB: 100 elements
    Time for PThread multiplication: 0.0068080002 seconds

Starting calculating resultMatrix for matrices/LargeMatrixA.txt and matrices/LargeMatrixB.txt ...
    Size of matrixA: 1000 elements
    Size of matrixB: 1000 elements
    Time for sequential multiplication: 9.6421155930 seconds

Starting calculating resultMatrix for matrices/LargeMatrixA.txt and matrices/LargeMatrixB.txt ...
    Size of matrixA: 1000 elements
    Size of matrixB: 1000 elements
    Time for OpenMP multiplication: 10.5361270905 seconds

Starting calculating resultMatrix for matrices/LargeMatrixA.txt and matrices/LargeMatrixB.txt ...
    Size of matrixA: 1000 elements
    Size of matrixB: 1000 elements
    Time for PThread multiplication: 9.8952226639 seconds

Starting calculating resultMatrix for matrices/HugeMatrixA.txt and matrices/HugeMatrixB.txt ...
    Size of matrixA: 5000 elements
    Size of matrixB: 5000 elements
    Time for sequential multiplication: 1981.1383056641 seconds

Starting calculating resultMatrix for matrices/HugeMatrixA.txt and matrices/HugeMatrixB.txt ...
    Size of matrixA: 5000 elements
    Size of matrixB: 5000 elements
    Time for OpenMP multiplication: 2137.8527832031 seconds

Starting calculating resultMatrix for matrices/HugeMatrixA.txt and matrices/HugeMatrixB.txt ...
    Size of matrixA: 5000 elements
    Size of matrixB: 5000 elements
    Time for PThread multiplication: 1977.5153808594 seconds

what are the run times? perhaps the matrices are too small and parallel overhead dominates. have you tries OpenBLAS, MKL or some other established library with support for dense matrix multiplication? — jev
– jev, Commented Dec 11, 2015 at 12:56
And most inportantly, instead of using clock(), use omp_get_wtime() or gettimeofday() functions. clock() is not precise and accurate for parallel codes. 4 cores should be okay to get enough visible performance differences for the said dimensions. — user4085386
– user4085386, Commented Dec 11, 2015 at 13:37
man clock(3): [...]The clock() function returns an approximation of processor time used by the program.[...]. Note the processor time. — user3185968
– user3185968, Commented Dec 11, 2015 at 14:24
Possible duplicate of No performance gain after using openMP on a program optimize for sequential running — High Performance Mark
– High Performance Mark, Commented Dec 13, 2015 at 12:11

NoseKnowsAll · Accepted Answer · 2015-12-11 17:03:28Z

2

As already mentioned in the comments, your first and main problem is using clock(). It returns the processor time of your program's execution. What you are looking for is the wall time of your program's execution. In sequential code, these are the same, but with multiple cores that is not at all true. Luckily, OpenMP already has you covered: use the function omp_get_wtime() instead.

Lastly, you need larger matrices to see any benefit from multithreading. If the overhead of creating/managing the threads is more expensive than the actual job the threads are working on, you'll never see any benefits from parallelism. It's pointless to time a 6x6 matrix multiplication because of this. I would start with 1000x1000 and check 2000x2000 and 8000x8000 at the least.

answered Dec 11, 2015 at 17:03

NoseKnowsAll

4,6342 gold badges27 silver badges47 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Johannes Over a year ago

Thank you, worked well. I used gettimeoftoday() so I got a good Speedup. Already visible at an matrix with 1000 elements. The smaller ones are just for checking if the calculation works fine. At 5000 elements the OpenMP-Version is doing it at about halft the time. :)

Collectives™ on Stack Overflow

Slower parallel program with OpenMP and PThreads than sequential

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related