0

I have this very simple parallel code that I am using to learn openmp which is embarrassingly parallel. However, I don't get the superlinear or at least linear performance increase expected.

#pragma omp parallel num_threads(cores) 
{
   int id = omp_get_thread_num(); 
   cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans, row, column, column, 1.0, MatrixA1[id], column, MatrixB[id], column, 0.0, Matrixmultiply[id], column); 
} 

On Visual studio using intel c++ compiler xe 15.0 and computing sgemm (matrix multiplication) for 288 by 288 matrices, i get 350microsecs for cores=1 and 1177microsecs for cores=4, which just seems like a sequential code. I set the Intel MKL property to Parallel (also tested with sequential) and Language settings to Generate Parallel Code (/Qopenmp). Anyway to improve this? I am running in a quad core haswell processor

7
  • What's your CPU load like? (Using cores=1 and cores=4) Commented Mar 17, 2015 at 14:01
  • @Pixelchemist it's a quadcore machine Commented Mar 17, 2015 at 14:05
  • Yeah and what utilization do you see using cores=1 and cores=4? 25% and 100% respectively? Commented Mar 17, 2015 at 14:06
  • @Pixelchemist yes approximately. Commented Mar 17, 2015 at 14:09
  • What Pixelchemist is getting at is if you CPU load on a quadcore machine is at 25% that is probably one core at 100%. You can have a multi threaded program which only gets allocated to one CPU core, each thread will get some CPU time, but its will be roughly the same overall performance as a single thread application. Commented Mar 17, 2015 at 14:29

1 Answer 1

1

If your input size takes only some microseconds to be computed, as you say, there is no way 4 threads take less than that. Essentially, your input data is too small for parallelization, because there is overhead in creating threads.

Try to increase the input data so it takes some good seconds and repeat the experiment.

You might also then have false sharing for example, but at this point that is nothing to be considered.

What you can do to improve performance that is to vectorize the code (but in this case you can't because you are using a library call, i.e. you'd have to write the function by yourself).

Sign up to request clarification or add additional context in comments.

2 Comments

Basically i run the code 5000 times and take the average. I want to run sgemm sequentially on each core. Thus I have 4 288x288 matrices i want to compute on 4 different cores independent of each other. However, even after setting the mkl to use the sequential library, the results show they are kind of still dependent on each other: i.e each core seems to wait for the previous to complete. I want to disable this dependency.
Even if that holds and you can get rid of the dependency, you won't benefit much from parallel computing here, thats my point.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.