I can't figure out why the performance of this function is so bad. I have a core 2 Duo machine and I know its only creating 2 trheads so its not an issue of too many threads. I expected the results to be closer to my pthread results.
these are my compilation flags (purposely not doing any optimization flags) gcc -fopenmp -lpthread -std=c99 matrixMul.c -o matrixMul
These are my results
Sequential matrix multiply: 2.344972
Pthread matrix multiply: 1.390983
OpenMP matrix multiply: 2.655910
CUDA matrix multiply: 0.055871
Pthread Test PASSED
OpenMP Test PASSED
CUDA Test PASSED
void openMPMultiply(Matrix* a, Matrix* b, Matrix* p)
{
//int i,j,k;
memset(*p, 0, sizeof(Matrix));
int tid, nthreads, i, j, k, chunk;
#pragma omp parallel shared(a,b,p,nthreads,chunk) private(tid,i,j,k)
{
tid = omp_get_thread_num();
if (tid == 0)
{
nthreads = omp_get_num_threads();
}
chunk = 20;
// #pragma omp parallel for private(i, j, k)
#pragma omp for schedule (static, chunk)
for(i = 0; i < HEIGHT; i++)
{
//printf("Thread=%d did row=%d\n",tid,i);
for(j = 0; j < WIDTH; j++)
{
//#pragma omp parallel for
for(k = 0; k < KHEIGHT ; k++)
(*p)[i][j] += (*a)[i][k] * (*b)[k][j];
}
}
}
}
Thanks for any help.