I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I achieve a good speedup on this platform given the simulation parameters I need to use. The exact same version can also be compiled to target NVIDIA platforms, and explicit data transfer directives are activated. I use nvfortran with options -O3 -mp=gpu -Minfo=mp -gpu==cc90 (I am targeting H100 GPUs). The issue is that the kernel is painfully slow, but I cannot pinpoint the exact issue even after profiling with nsys and ncu. I also made a fine-grained check of data transfers with NV_ACC_NOTIFY to make sure that no implicit and unwanted transfer is done.
For this reason, I created a minimal example to compute the addition between two arrays, an embarrassingly parallel operation. Even if the array size is 1e9, the GPU version is slower. Here is the dummy program.
program vector_addition
use omp_lib
implicit none
integer, parameter :: n = 1000000000
real, allocatable, dimension(:) :: a, b, c, c_cpu
real :: start_time, end_time
integer i
! Allocate arrays
allocate(a(n), b(n), c(n), c_cpu(n))
! Initialize arrays
call random_number(a)
call random_number(b)
! ==========================================================
! OpenMP CPU execution
! ==========================================================
write(*,*) 'Starting CPU computation...'
call cpu_time(start_time)
do i = 1, n
c_cpu(i) = a(i) + b(i)
end do
call cpu_time(end_time)
print *, 'Time taken for CPU: ', end_time - start_time, ' seconds'
! ==========================================================
! OpenMP Offload to GPU
! ==========================================================
call cpu_time(start_time)
!$omp target teams distribute parallel do map(to:a,b) map(from:c)
do i = 1, n
c(i) = a(i) + b(i)
end do
!$omp end target teams distribute parallel do
call cpu_time(end_time)
print *, 'Time taken for GPU offload: ', end_time - start_time, ' seconds'
! compare the results
do i = 1, n
if (c(i) /= c_cpu(i)) then
print *, 'Mismatch at index ', i, ': CPU = ', c(i), ', GPU = ', c_cpu(i)
exit
end if
end do
! Deallocate arrays
deallocate(a, b, c, c_cpu)
end program vector_addition
The output is the following:
Time taken for CPU: 0.9353199 seconds
Time taken for GPU offload: 1.690033 seconds
Do you have any idea why even such a simple case is not working? Am I missing any fundamental concept here?