2

Consider three versions of the following Fortran code using coarrays

Version 1:

program test

    implicit none

    integer :: dp = kind(0.d0)

    complex(dp), allocatable :: G(:,:,:,:,:,:,:)[:]                           
    complex(dp), allocatable :: s_fft(:,:,:,:,:)                         

    real(dp) :: L3

    integer :: L, u, v, a, b, idx

    L  = 6
    L3 = L**3

    allocate(G(L,L,L,5,5,4,4)[*])
    allocate(s_fft(L,L,L,5,4))

    G   = (0.d0,0.d0)

100 idx = this_image()

    do n = 1, 1e4

        s_fft = (1.d0,0.d0) !in practice this is updated by a very fast FFT routine
    
        do u = 1, 4
            do v = 1, 4
                do a = 1, 5
                    do b = 1, 5
200                     G(:,:,:,b,a,v,u) = G(:,:,:,b,a,v,u) + s_fft(:,:,:,b,v)*conjg(s_fft(:,:,:,a,u))/L3
                    end do
                end do
            end do
        end do

end program

Version 2: Same as version 1 but with

200   G(:,:,:,b,a,v,u)[idx] = G(:,:,:,b,a,v,u)[idx] + s_fft(:,:,:,b,v)*conjg(s_fft(:,:,:,a,u))/real(L**3,dp)

Version 3: Same as version 2 but with

100  idx = 1 + modulo( this_image()+1 , num_images() )

Compiling with ifx -O3 -coarray I find on my machine (8 coarray images) the following performance:

V1 : 2   seconds
V2 : 15  seconds
V3 : 100 seconds

I have two questions:

  1. Why is V2 slower than V1 when they perform the same task (writing to the local copy of G)?
  2. Is there anything I can do to improve the performance of V3 (writing to another image's copy of G)?

In practice I want V3: I am implementing a parallel tempering markov chain monte carlo algorithm. Each image keeps a copy of the state of the markov chain, and different images can swap their temperatures, here represented by idx. Measurements performed at each step of the markov chain need to be collected by temperature. Each image keeps the data arrays (here G) from its initial temperature, so that these arrays do not need to be copied each time temperatures are swapped. However, I am finding that line 200 is consuming about 80% of the runtime of my entire code (i.e. it takes 5x longer than running the markov chain updates, making and saving other measurements, and performing FFT's).

4
  • I wonder if this comes from ifx... I do not see issues in gfortran while analysing the resulting code. Besides, the code produce by ifx is very inefficient because it does not vectorize the code and calls pow because of L**3 (it does not vectorize the code even when L*L*L is use instead). Moreover, it looks like ifx use a library for coarray operations by default. Thus, the generated code might not be optimal and my explain the difference since the inner loop is rather cheap (only 6*6*6 computed items). Note I am not an expert of coarray. I was just curious to see what happens. Commented Feb 28 at 14:18
  • Note that the current code does not build btw (for example s_fft has an incompatible dimension between its use, declaration and its allocation) and G is modified but never read so a clever compiler can just remove the expensive main loop (gfortran does that). It would help to have a fully reproducible code. Commented Feb 28 at 14:21
  • For the later version, a significant overhead seems normal to me since there are data transfer between images (the whole loop should already be memory-bound, the G array are too big to fit in the L1/L2 cache of most CPUs and this is done in a hot loop). Commented Feb 28 at 14:24
  • @JérômeRichard apologies I have updated the code to work correctly now. I actually re-ran it and got even worse performance for the third version. Thanks for your comments though, I'll try testing with another compiler and vectorization options and see if I get anything different Commented Feb 28 at 14:55

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.