memory pool in thrust execution policy

Question

I am looking for solutions to use a memory pool within thrust as I want to limit the number of calls to cudaMalloc. device_vector definitely accepts an allocator, but it's not so easy to deal with thrust::sort which apparently will allocate a temporary buffer.

Based on the answer to How to use CUDA Thrust execution policy to override Thrust's low-level device memory allocator it seems that Thrust can be hooked to use special allocators by tweaking the execution policy, but it's quite old and I can't seem to find any doc about execution policies that explain how to proceed exactly.

For completeness, there is thrust/examples/cuda/custom_temporary_allocation.cu, but it's not very satisfying as it's using a memory pool hooked as a global variable.

I think it would be quite likely that the Thrust developer have thought about that, and would have included some mechanism to allow injecting a custom memory manager within the exec policy, I just can't find it.

Not sure why you looked at and linked the version of that example from 2012. The newest version of it (2 years old) does not use a global variable. — paleonix
– paleonix, Commented Jul 1, 2023 at 13:11
While the sample is using a basic custom pool, there are proper pool memory resources in Thrust as well. — paleonix
– paleonix, Commented Jul 1, 2023 at 13:14
For more control I would recommend taking a look at the RAPIDS Memory Manager (RMM) (RAPIDS is part of Nvidia and nowadays the main driver of innovation in Thrust and CUB, I think) or using CUB directly (it is basically the CUDA backend of Thrust), i.e. cub::DeviceRadixSort (for primitive types) or cub::DeviceMergeSort (generally applicable). — paleonix
– paleonix, Commented Jul 1, 2023 at 13:19
see github.com/ingowald/cudaKDTree/pull/7 for my changes @paleonix it would be great if you could give it a quick review (it's a pretty small PR) — brice rebsamen
– brice rebsamen, Commented Jul 3, 2023 at 23:10
Wrong comment thread? Either way, you will find thrust/examples/mr_basic.cu to be of interest in terms of allocators in Thrust. — paleonix
– paleonix, Commented Jul 3, 2023 at 23:32

Abator Abetor · Accepted Answer · 2023-07-01 20:28:55Z

2

The following is an example allocator for stream-ordered memory allocation that uses cudaMallocAsync to allocate from the default cuda memory pool on a specific stream. Together with the par_nosync execution policy, this allows for fully asynchronous thrust::sort.

#include <thrust/device_malloc_allocator.h>

template <class T>
struct ThrustAllocatorAsync : public thrust::device_malloc_allocator<T> {
public:
    using Base      = thrust::device_malloc_allocator<T>;
    using pointer   = typename Base::pointer;
    using size_type = typename Base::size_type;

    ThrustAllocatorAsync(cudaStream_t stream_) : stream{stream_} {}

    pointer allocate(size_type num){
        T* result = nullptr;
        cudaMallocAsync(&result, sizeof(T) * num, stream);
        return thrust::device_pointer_cast(result);
    }

    void deallocate(pointer ptr, size_type num){
        cudaFreeAsync(thrust::raw_pointer_cast(ptr), stream);
    }

private:
    cudaStream_t stream;
};

...

thrust::sort(
   thrust::cuda::par_nosync(ThrustAllocatorAsync<char>(stream)).on(stream),
   data.begin(),
   data.end()
);

The same can be achieved with RMM as suggested in the comments.

#include <rmm/mr/device/cuda_async_memory_resource.hpp> 
#include <rmm/exec_policy.hpp>

...
// could use any other class derived from rmm::mr::device_memory_resource
rmm::mr::cuda_async_memory_resource mr; 

thrust::sort(
   rmm::exec_policy_nosync(stream, &mr),
   data.begin(),
   data.end()
);

answered Jul 1, 2023 at 20:28

Abator Abetor

2,7181 gold badge12 silver badges13 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

brice rebsamen Over a year ago

this is great, thanks. I am reading the exec_policy code but I'm having troubles getting the hang of it. Is there a way to get the allocator out of the policy? or the stream? I'd like to simplify the arguments to my function to use the exec policy to allocate the memory to a device_buffer, and also for the sort and sync operations

Collectives™ on Stack Overflow

memory pool in thrust execution policy

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related