0

What is the most efficient way to allocate memory using cudaMallocHost() and cudaMalloc() so that the memory is aligned to a certain value.

Just like __mm_malloc where you can pass the size with the alignment as parameters.

8
  • There's an automatic alignment to at least 256 bytes if I remember correctly. Is it not enough for you needs ? Commented Jan 24, 2023 at 12:02
  • @wohlstad in somecases I actually need more that this Commented Jan 24, 2023 at 12:08
  • what do you need? docs state: "The allocated memory is suitably aligned for any kind of variable." Commented Jan 24, 2023 at 12:09
  • 4
    I suggest you edit your question and add some information about your use case and specific needs. Commented Jan 24, 2023 at 12:10
  • 1
    At least for device side, I would leave the alignment to the CUDA runtime and just use the cudaMallocPitch function. Commented Jan 25, 2023 at 4:52

1 Answer 1

2

Pointers returned from cudaMallocHost() can be post-processed with a bit bigger total size for the allocation:

auto alignedPtr = ptrAsULL + 4096 - (ptrAsULL%4096);

then encapsulated like this:

struct AlignedCudaBuf
{
     char * ptrOriginal;
     char * ptrAligned;
     AlignedCudaBuf()
     {
        cudaMallocHost( &ptrOriginal, size + extra );
        ptrAligned = ((unsigned long long)ptrOriginal) + 4096 - (((unsigned long long)ptrOriginal)%4096);
     }

     ~AlignedCudaBuf()
     {
         if(ptrOriginal)
         {
             cudaFreeHost(ptrOriginal);
             ptrOriginal=0;
         }
     }
}

This does not put too much overhead unless they are frequently created/destructed (that makes CUDA-side add API latency too).

Maybe to evade accidental copies, you could use smart pointers instead of raw pointers so that the last remaining instance of AlignedCudaBuf would destroy it only once:

struct AlignedCudaBuf
{
     // will it be shared by multiple graphics cards & threads?
     std::shared_ptr<char> ptrOriginal;

     char * ptrAligned;
     AlignedCudaBuf()
     {


        char * tmp;
        cudaMallocHost( &tmp, size + extra );
        
        ptrAligned = ((unsigned long long)tmp) + 4096 - (((unsigned long long)tmp)%4096);

        ptrOriginal = std::shared_ptr<char>(
             tmp,
             [](char * ptr0){ cudaFreeHost(ptr0); } // custom destructor     
        );

     }

}
Sign up to request clarification or add additional context in comments.

2 Comments

is there a reson why you chose to represent alignment as 4096? or this is just a random number
It's for page-alignment for any future pinning of regions for fast transfers between device and host. Pinned buffers go directly through pcie while non-pinned non-aligned ones require an extra copy to a pinned place before going pcie.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.