1

I'm currently facing a performance issue when calling intel mkl inside an openmp loop. Let me explain my problem in more detail after posting a simplified code.

program Test
use omp_lib
implicit none
double complex, allocatable :: RhoM(:,:), Rho1M(:,:)
integer :: ik, il, ij, N, M, Y

M = 20
Y = 2000000
N = 500

allocate(RhoM(M,N),Rho1M(M,N))
RhoM = (1.0d0,0.0d0)
Rho1M = (0.0d0,1.0d0)

call omp_set_num_threads(4)

do il=1,Y
Rho1M = (0.0d0,1.0d0)
!$omp parallel do private(ik)
 do ik=1,N
  call zaxpy(M, (1.0d0,0.0d0), RhoM(:,ik:ik), 1, Rho1M(:,ik:ik), 1)
 end do
 !$omp end parallel do
end do    
end program Test

Basically, this program does an in-place matrix summation. However, it does not make any sense, it is just a simplified code. I'm running Windows 10 Pro and using the intel fortran compiler(Version 19.1.0.166). I compile with: ifort -o Test.exe Test.f90 /fast /O3 /Qmkl:sequential /debug:all libiomp5md.lib /Qopenmp. Since the "vectors" used by zaxpy aren't that large, I tried to use openmp in order to speed up the program. I checked the running time with the vtune tool from intel (thats the reason for the debug all flag). I have a i5 4430 meaning 4 threads and 4 physical cores.

Time with openmp: 107s; Time without openmp: 44s

The funny thing is that with increasing amount of threads, the program is slower. Vtune tells me that more threads are used, however, the computational time increases. This seems to be very counter intuitive.

Of course, I am not the first one facing problems like this. I will attach some links and discuss why it did not work for me.

Intel provides information about how to choose parameters (https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications). However, I'm linking with the sequential intel mkl. If I try the suggested parameters with parallel intel mkl, I'm still slow.

It seems to be important to switch on omp_set_nested(1) (Number of threads of Intel MKL functions inside OMP parallel regions). Firstly, this parameter is deprecated. When I use omp_set_max_active_levels() I cannot see any difference.

This is probably the most suitable question (Calling multithreaded MKL in from openmp parallel region). However, I use sequential intel mkl and have not to care about the mkl threads.

This one here (OpenMP parallelize multiple sequential loops) says I should try using schedule. I tried dynamic and static with different values of chunk size, however, it did not help at all, since the amount of work that has to be done per thread it exactly the same.

It would be very nice, if you have an idea why the program slows down by increasing the thread size.

If you need any further information, please tell me.

1 Answer 1

1

Seems to be the case that openmp destroys and creates the splitting into the threads 2000000 times. That causes the additional computational time. See the post from Andrew (https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/733673) and the post from Jim Dempsey.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.