Call sequential intel mkl from openmp loop

Question

I'm currently facing a performance issue when calling intel mkl inside an openmp loop. Let me explain my problem in more detail after posting a simplified code.

program Test
use omp_lib
implicit none
double complex, allocatable :: RhoM(:,:), Rho1M(:,:)
integer :: ik, il, ij, N, M, Y

M = 20
Y = 2000000
N = 500

allocate(RhoM(M,N),Rho1M(M,N))
RhoM = (1.0d0,0.0d0)
Rho1M = (0.0d0,1.0d0)

call omp_set_num_threads(4)

do il=1,Y
Rho1M = (0.0d0,1.0d0)
!$omp parallel do private(ik)
 do ik=1,N
  call zaxpy(M, (1.0d0,0.0d0), RhoM(:,ik:ik), 1, Rho1M(:,ik:ik), 1)
 end do
 !$omp end parallel do
end do    
end program Test

Basically, this program does an in-place matrix summation. However, it does not make any sense, it is just a simplified code. I'm running Windows 10 Pro and using the intel fortran compiler(Version 19.1.0.166). I compile with: ifort -o Test.exe Test.f90 /fast /O3 /Qmkl:sequential /debug:all libiomp5md.lib /Qopenmp. Since the "vectors" used by zaxpy aren't that large, I tried to use openmp in order to speed up the program. I checked the running time with the vtune tool from intel (thats the reason for the debug all flag). I have a i5 4430 meaning 4 threads and 4 physical cores.

Time with openmp: 107s; Time without openmp: 44s

The funny thing is that with increasing amount of threads, the program is slower. Vtune tells me that more threads are used, however, the computational time increases. This seems to be very counter intuitive.

Of course, I am not the first one facing problems like this. I will attach some links and discuss why it did not work for me.

Intel provides information about how to choose parameters (https://software.intel.com/en-us/articles/recommended-settings-for-calling-intel-mkl-routines-from-multi-threaded-applications). However, I'm linking with the sequential intel mkl. If I try the suggested parameters with parallel intel mkl, I'm still slow.

It seems to be important to switch on omp_set_nested(1) (Number of threads of Intel MKL functions inside OMP parallel regions). Firstly, this parameter is deprecated. When I use omp_set_max_active_levels() I cannot see any difference.

This is probably the most suitable question (Calling multithreaded MKL in from openmp parallel region). However, I use sequential intel mkl and have not to care about the mkl threads.

This one here (OpenMP parallelize multiple sequential loops) says I should try using schedule. I tried dynamic and static with different values of chunk size, however, it did not help at all, since the amount of work that has to be done per thread it exactly the same.

It would be very nice, if you have an idea why the program slows down by increasing the thread size.

If you need any further information, please tell me.

user553052 · Accepted Answer · 2020-03-31 20:54:52Z

1

Seems to be the case that openmp destroys and creates the splitting into the threads 2000000 times. That causes the additional computational time. See the post from Andrew (https://software.intel.com/en-us/forums/intel-fortran-compiler/topic/733673) and the post from Jim Dempsey.

answered Mar 31, 2020 at 20:54

user553052

214 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Call sequential intel mkl from openmp loop

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related