Allocation using Cuda with alignment

Question

What is the most efficient way to allocate memory using cudaMallocHost() and cudaMalloc() so that the memory is aligned to a certain value.

Just like __mm_malloc where you can pass the size with the alignment as parameters.

There's an automatic alignment to at least 256 bytes if I remember correctly. Is it not enough for you needs ? — wohlstad
– wohlstad, Commented Jan 24, 2023 at 12:02
what do you need? docs state: "The allocated memory is suitably aligned for any kind of variable." — 463035818_is_not_an_ai
– 463035818_is_not_an_ai, Commented Jan 24, 2023 at 12:09
I suggest you edit your question and add some information about your use case and specific needs. — wohlstad
– wohlstad, Commented Jan 24, 2023 at 12:10
At least for device side, I would leave the alignment to the CUDA runtime and just use the cudaMallocPitch function. — sgarizvi
– sgarizvi, Commented Jan 25, 2023 at 4:52

huseyin tugrul buyukisik · Accepted Answer · 2023-01-24 13:48:59Z

2

Pointers returned from cudaMallocHost() can be post-processed with a bit bigger total size for the allocation:

auto alignedPtr = ptrAsULL + 4096 - (ptrAsULL%4096);

then encapsulated like this:

struct AlignedCudaBuf
{
     char * ptrOriginal;
     char * ptrAligned;
     AlignedCudaBuf()
     {
        cudaMallocHost( &ptrOriginal, size + extra );
        ptrAligned = ((unsigned long long)ptrOriginal) + 4096 - (((unsigned long long)ptrOriginal)%4096);
     }

     ~AlignedCudaBuf()
     {
         if(ptrOriginal)
         {
             cudaFreeHost(ptrOriginal);
             ptrOriginal=0;
         }
     }
}

This does not put too much overhead unless they are frequently created/destructed (that makes CUDA-side add API latency too).

Maybe to evade accidental copies, you could use smart pointers instead of raw pointers so that the last remaining instance of AlignedCudaBuf would destroy it only once:

struct AlignedCudaBuf
{
     // will it be shared by multiple graphics cards & threads?
     std::shared_ptr<char> ptrOriginal;

     char * ptrAligned;
     AlignedCudaBuf()
     {


        char * tmp;
        cudaMallocHost( &tmp, size + extra );
        
        ptrAligned = ((unsigned long long)tmp) + 4096 - (((unsigned long long)tmp)%4096);

        ptrOriginal = std::shared_ptr<char>(
             tmp,
             [](char * ptr0){ cudaFreeHost(ptr0); } // custom destructor     
        );

     }

}

edited Jan 24, 2023 at 13:48

answered Jan 24, 2023 at 13:20

huseyin tugrul buyukisik

12k6 gold badges53 silver badges109 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

MIA Over a year ago

is there a reson why you chose to represent alignment as 4096? or this is just a random number

huseyin tugrul buyukisik Over a year ago

It's for page-alignment for any future pinning of regions for fast transfers between device and host. Pinned buffers go directly through pcie while non-pinned non-aligned ones require an extra copy to a pinned place before going pcie.

Collectives™ on Stack Overflow

Allocation using Cuda with alignment

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related