Consider three versions of the following Fortran code using coarrays
Version 1:
program test
implicit none
integer :: dp = kind(0.d0)
complex(dp), allocatable :: G(:,:,:,:,:,:,:)[:]
complex(dp), allocatable :: s_fft(:,:,:,:,:)
real(dp) :: L3
integer :: L, u, v, a, b, idx
L = 6
L3 = L**3
allocate(G(L,L,L,5,5,4,4)[*])
allocate(s_fft(L,L,L,5,4))
G = (0.d0,0.d0)
100 idx = this_image()
do n = 1, 1e4
s_fft = (1.d0,0.d0) !in practice this is updated by a very fast FFT routine
do u = 1, 4
do v = 1, 4
do a = 1, 5
do b = 1, 5
200 G(:,:,:,b,a,v,u) = G(:,:,:,b,a,v,u) + s_fft(:,:,:,b,v)*conjg(s_fft(:,:,:,a,u))/L3
end do
end do
end do
end do
end program
Version 2: Same as version 1 but with
200 G(:,:,:,b,a,v,u)[idx] = G(:,:,:,b,a,v,u)[idx] + s_fft(:,:,:,b,v)*conjg(s_fft(:,:,:,a,u))/real(L**3,dp)
Version 3: Same as version 2 but with
100 idx = 1 + modulo( this_image()+1 , num_images() )
Compiling with ifx -O3 -coarray I find on my machine (8 coarray images) the following performance:
V1 : 2 seconds
V2 : 15 seconds
V3 : 100 seconds
I have two questions:
- Why is V2 slower than V1 when they perform the same task (writing to the local copy of
G)? - Is there anything I can do to improve the performance of V3 (writing to another image's copy of
G)?
In practice I want V3: I am implementing a parallel tempering markov chain monte carlo algorithm. Each image keeps a copy of the state of the markov chain, and different images can swap their temperatures, here represented by idx. Measurements performed at each step of the markov chain need to be collected by temperature. Each image keeps the data arrays (here G) from its initial temperature, so that these arrays do not need to be copied each time temperatures are swapped. However, I am finding that line 200 is consuming about 80% of the runtime of my entire code (i.e. it takes 5x longer than running the markov chain updates, making and saving other measurements, and performing FFT's).
ifx... I do not see issues ingfortranwhile analysing the resulting code. Besides, the code produce byifxis very inefficient because it does not vectorize the code and calls pow because ofL**3(it does not vectorize the code even whenL*L*Lis use instead). Moreover, it looks likeifxuse a library for coarray operations by default. Thus, the generated code might not be optimal and my explain the difference since the inner loop is rather cheap (only 6*6*6 computed items). Note I am not an expert of coarray. I was just curious to see what happens.s_ffthas an incompatible dimension between its use, declaration and its allocation) andGis modified but never read so a clever compiler can just remove the expensive main loop (gfortrandoes that). It would help to have a fully reproducible code.Garray are too big to fit in the L1/L2 cache of most CPUs and this is done in a hot loop).