Non-blocking copy from Host to Buffer with enqueue_copy

Question

I am writing a code with PyOpenCL to offload heavy computations to GPU. To optimize the algorithm, I would like to parallel some of the memory transfer operations with further calculations. However, I am observing a blocking behaviour regardless:

import  numpy as np
import pyopencl as cl
from timeit import default_timer as dt

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

a = np.random.random((1000, 1000, 500)).astype(np.float64)
mf = cl.mem_flags
start = dt()
a_buff = cl.Buffer(ctx, mf.READ_WRITE | mf.COPY_HOST_PTR, hostbuf=a)
print(f'Buffer creation time: {dt()-start:0.4f} s')

start = dt()
event1 = cl.enqueue_copy(queue, a_buff, a, is_blocking=False)
event1.wait()
print(f'Copy time blocking 1: {dt()-start:0.4f} s')

start = dt()
event2 = cl.enqueue_copy(queue, a_buff, a, is_blocking=False)
event2.wait()
print(f'Copy time blocking 2: {dt()-start:0.4f} s')

start = dt()
event3 = cl.enqueue_copy(queue, a_buff, a, is_blocking=False)
print(f'Copy time non-blocking 1: {dt()-start:0.4f} s')

Console output:

Buffer creation time: 0.8559 s
Copy time blocking 1: 1.1018 s
Copy time blocking 2: 0.4177 s
Copy time non-blocking 1: 0.4364 s

The times of blocking and non-blocking copies are almost identical despite having is_blocking=False argument. I have read that if the NannyEvent obect returned by the copy is not kept until transfter finish, the operation would be blocking anyway. But that does did not help either. Also, the first copy to the Buffer is substantially longer than the second.

My question is how can I achive non-blocking behaviour?

I don't know why it appears to be blocking, but opencl often copies a buffer into its own 4k aligned host memory before transferring it to the GPU, so that might be why the second is faster than the first. You can see when it does this, it uses about twice as much memory as you expected. Maybe it realised in the second call it doesn't have to make a third copy. I'm guessing though. — Simon Goater
– Simon Goater, Commented Dec 11, 2024 at 13:46
@SimonGoater this is apparently true. However, the main reason is CUDA not allowing asyncronous memory transfers when host memory is paged/non-pinned. — MrCheatak
– MrCheatak, Commented Jan 21 at 12:22

MrCheatak · Accepted Answer · 2025-01-21 13:29:09Z

This behaviour is NVIDIA specific due to the limitations on memory transfers in CUDA implementation. CUDA docs, PyOpenCL issue. Internally, OpenCL calls to CUDA API, thus the limitations propagate to OpenCL.

Firstly, host allocated memory can be of two types: pinned(non-paged) and non-pinned(paged), whereby only pinned memroy (that is stored only in RAM and not offloaded to disk) transfers can be performed non-blocking or asyncronously in CUDA.
Secondly, if the memory is paged, CUDA first copies it to a pinned buffer and then transfers to device memory and the whole operation is blocking, see StackOverflow Answer. Supposedly, this explains such a long copy time of the first transfer.
In order to use asyncronous memory transfers and kernel execution, only pinned memory must be used. To use pinned memory, it has to be allocated by OpenCL itself. Arrays created by Numpy are usually created with paged memory and Numpy has no functionality to explicitly use pinned memory.

To create an array with pinned memory, numpy arrays should be created using a buffer allocated by OpenCL.

The first step is to create a Buffer:
buffer = cl.Buffer(ctx, cl.mem_flags.READ_WRITE|cl.mem_flags.ALLOC_HOST_PTR, size=a.size)
This allocates memory on both host and device. The ALLOC_HOST_PTR flag forces OpenCL to allocate pinned memory on host. Unlike with the COPY_HOST_PTR flag, this memory is created empty and is not tied to an existing Numpy array.

Then, the buffer has to be mapped to a Numpy array:
mapped, event = cl.enqueue_map_buffer(queue, buffer, cl.map_flags.WRITE, 0, shape=a.shape, dtype=a.dtype)
mapped is a Numpy array that then can be used conventionally in Python.

Finally, the mapped array can be filled with data from target array:
mapped[...] = a

Now, running the same benchmark shows non-blocking behaviour:

import  numpy as np
import pyopencl as cl
from timeit import default_timer as dt

ctx = cl.create_some_context()
queue = cl.CommandQueue(ctx)

a = np.random.random((1000, 1000, 500)).astype(np.float64)
mf = cl.mem_flags
start = dt()
size = a.size * a.itemsize
a_buff = cl.Buffer(ctx, mf.READ_WRITE | mf.ALLOC_HOST_PTR, size=size)
a_mapped, event = cl.enqueue_map_buffer(queue, a_buff, cl.map_flags.WRITE, 0, shape=a.shape, dtype=a.dtype)
a_mapped[:] = a
cl.enqueue_copy(queue, a_buff, a_mapped, is_blocking=False)
print(f'Buffer creation time: {dt()-start:0.4f} s')

start = dt()
event1 = cl.enqueue_copy(queue, a_buff, a_mapped, is_blocking=True)
print(f'Copy time blocking 1: {dt()-start:0.4f} s')

start = dt()
event2 = cl.enqueue_copy(queue, a_buff, a_mapped, is_blocking=False)
print(f'Copy time non-blocking (Host to Device): {dt()-start:0.4f} s')

start = dt()
event3 = cl.enqueue_copy(queue, a_mapped, a_buff, is_blocking=False)
print(f'Copy time non-blocking (Device to Host): {dt()-start:0.4f} s')

Result:

Buffer creation time: 1.8355 s
Copy time blocking 1: 0.3096 s
Copy time non-blocking (Host to Device): 0.0001 s
Copy time non-blocking (Device to Host): 0.0000 s

PS: as you can see, having non-blocking functionality changes the underlying memory allocation. It would require refactoring of all array creation routines, which means it cannot be implemented 'on top' without significantly changing source code.

Collectives™ on Stack Overflow

Non-blocking copy from Host to Buffer with enqueue_copy

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related