Fortran OpenMP offloading painfully slow on NVIDIA architectures

Question

I am currently trying to porting a big portion of a Fortran code to GPU devices with OpenMP. I have a working version for AMD, specifically for the MI300A which features unified shared memory. I achieve a good speedup on this platform given the simulation parameters I need to use. The exact same version can also be compiled to target NVIDIA platforms, and explicit data transfer directives are activated. I use nvfortran with options -O3 -mp=gpu -Minfo=mp -gpu==cc90 (I am targeting H100 GPUs). The issue is that the kernel is painfully slow, but I cannot pinpoint the exact issue even after profiling with nsys and ncu. I also made a fine-grained check of data transfers with NV_ACC_NOTIFY to make sure that no implicit and unwanted transfer is done.

For this reason, I created a minimal example to compute the addition between two arrays, an embarrassingly parallel operation. Even if the array size is 1e9, the GPU version is slower. Here is the dummy program.

program vector_addition
  use omp_lib
  implicit none

  integer, parameter :: n = 1000000000
  real, allocatable, dimension(:) :: a, b, c, c_cpu
  real :: start_time, end_time
  integer i

  ! Allocate arrays
  allocate(a(n), b(n), c(n), c_cpu(n))

  ! Initialize arrays
  call random_number(a)
  call random_number(b)

  ! ==========================================================
  !        OpenMP CPU execution
  ! ==========================================================
  write(*,*) 'Starting CPU computation...'
  call cpu_time(start_time)

  do i = 1, n
     c_cpu(i) = a(i) + b(i)
  end do

  call cpu_time(end_time)

  print *, 'Time taken for CPU: ', end_time - start_time, ' seconds'

  ! ==========================================================
  !        OpenMP Offload to GPU
  ! ==========================================================
  call cpu_time(start_time)

  !$omp target teams distribute parallel do map(to:a,b) map(from:c)
  do i = 1, n
     c(i) = a(i) + b(i)
  end do
  !$omp end target teams distribute parallel do

  call cpu_time(end_time)
  print *, 'Time taken for GPU offload: ', end_time - start_time, ' seconds'

  ! compare the results
  do i = 1, n
     if (c(i) /= c_cpu(i)) then
        print *, 'Mismatch at index ', i, ': CPU = ', c(i), ', GPU = ', c_cpu(i)
        exit
     end if
  end do

  ! Deallocate arrays
  deallocate(a, b, c, c_cpu)

end program vector_addition

The output is the following:

Time taken for CPU:    0.9353199      seconds
Time taken for GPU offload:     1.690033      seconds

Do you have any idea why even such a simple case is not working? Am I missing any fundamental concept here?

Please note that the CPU time is sequential while the whole GPU is used for the target part. Thus, the benchmark is not fair. Actually, there is no chance for the GPU to be faster because this operation should saturate the RAM bandwidth on CPU when multiple cores are used. The GPU cannot more data faster than the RAM permit. In fact, it will be slower because of the interconnect as pointed ou by Joachim. You should avoid data transfer when using GPUs by storing data as much as possible in device RAM. If you cannot then GPUs are useless unless the computational intensity is pretty high. — Jérôme Richard
– Jérôme Richard, Commented Jul 29 at 18:38

Joachim · Accepted Answer · 2025-07-29 18:14:19Z

The execution time of the target region is dominated by the data transfer to and from the GPU. The computation is trivial and has no chance to amortize the cost of data movements.

To check the cost of the individual steps, I modified the code into enter data, compute and exit data and add time measurement to the individual parts, see below.

The output is now something like:

 Starting CPU computation...
 Time taken for CPU:     1.135145      seconds
 Time taken for enter data:    0.8227592      seconds
 Time taken for GPU offload:    5.9058666E-03  seconds
 Time taken for exit data:    0.9464250      seconds

The execution of the computation is clearly faster than the execution on the CPU.

Here is the modified code, note the necessary alloc and delete/release mappings when splitting the bounded mapping semantics into stand-alone mappings with enter/exit data. The map clauses still present on the target region have no effect, because the data is already on the device and the reference count for the mapping is larger than 0. They could also be removed. I also added a warm-up target region just to make sure than we don't include set-up time for connecting the device to the time measurement.

program vector_addition
  use omp_lib
  implicit none

  integer, parameter :: n = 1000000000
  real, allocatable, dimension(:) :: a, b, c, c_cpu
  real :: start_time, end_time
  integer i

  ! Allocate arrays
  allocate(a(n), b(n), c(n), c_cpu(n))

  ! Initialize arrays
  call random_number(a)
  call random_number(b)

  ! ==========================================================
  !        OpenMP CPU execution
  ! ==========================================================
  write(*,*) 'Starting CPU computation...'
  call cpu_time(start_time)

  do i = 1, n
     c_cpu(i) = a(i) + b(i)
  end do

  call cpu_time(end_time)

  print *, 'Time taken for CPU: ', end_time - start_time, ' seconds'

  ! make sure GPU is set up
  !$omp target map(from:i)
    i = n
  !$omp end target
  ! ==========================================================
  !        OpenMP Offload to GPU
  ! ==========================================================

  call cpu_time(start_time)
  !$omp target enter data map(to:a,b) map(alloc:c)
  call cpu_time(end_time)

  print *, 'Time taken for enter data: ', end_time - start_time, ' seconds'

  call cpu_time(start_time)

  !$omp target teams distribute parallel do map(to:a,b) map(from:c)
  do i = 1, n
     c(i) = a(i) + b(i)
  end do
  !$omp end target teams distribute parallel do

  call cpu_time(end_time)

  print *, 'Time taken for GPU offload: ', end_time - start_time, ' seconds'

  call cpu_time(start_time)
  !$omp target exit data map(delete:a,b) map(from:c)
  call cpu_time(end_time)

  print *, 'Time taken for exit data: ', end_time - start_time, ' seconds'

  ! compare the results
  do i = 1, n
     if (c(i) /= c_cpu(i)) then
        print *, 'Mismatch at index ', i, ': CPU = ', c(i), ', GPU = ', c_cpu(i)
        exit
     end if
  end do

  ! Deallocate arrays
  deallocate(a, b, c, c_cpu)

end program vector_addition

Collectives™ on Stack Overflow

Fortran OpenMP offloading painfully slow on NVIDIA architectures

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related