1

I'd like to share some constants between CPU and GPU in order to allow for execution of the same code (wrapped in between) on either. That includes some compile-time parameters which are most reasonably written as arrays and I'd like them to be constexpr such that the compiler can (ideally) elide the arrays during compilation.

When I try

#include <stdio.h>
using fe_ftype = double;
__device__  constexpr fe_ftype vars[2] = {100.0, 300.0};
//__constant__ constexpr fe_ftype vars[2] = {100.0, 300.0};
const fe_ftype l = 3.0;

__global__ void foo() {
    printf("%lf %lf %lf\n", vars[0], vars[1], l);
}

int main(void) {
    foo<<<1,1>>>();
    cudaDeviceSynchronize();
    printf("%lf %lf %lf\n", vars[0], vars[1], l);
}

(see https://godbolt.org/z/19bYj34f8)

It seems to work in that compilation succeeds and I get the same output from both the GPU and CPU side, with the constants either as constant read-only data for CPU or being moved into the registers directly on the GPU side. However, the compiler complains with warning #20091-D: a __device__ variable "vars" cannot be directly read in a host function. If it were not a constant, I would be accessing some random uninitialized memory on the host, but with the constexpr annotation it seems to successfully grok it on the host side. It also seems to work identically if I use __constant__ instead of __device__.

Can I rely on this behaviour? Does the warning generally not apply to constexpr variables? Extra condition: I can only use C++17, so no consteval.

11
  • 1
    Starting with CUDA version ≥ 11.4 your code is ok. Check that your CUDA version is ok with that and also check that your compiler is informed of CUDA version and is supporting CUDA ≥ 11.4. Otherwise this should be fine, as long as you are not taking an address of your constants, which would land them into memory and prevent values inlining. Commented Jun 24 at 6:14
  • 3
    There have been two related questions recently: How are constexpr device variables accessible from host? and C++ builtin constexpr vs CUDA __constant__ for higher dimension array. Commented Jun 24 at 8:02
  • Arguably constexpr w/o __device__ might be the best solution as discussed below these questions. But special use cases could favor different combinations. Commented Jun 24 at 8:05
  • Neat that neither question showed up in my searching. If I saw that right, relying on constexpr w/o __device__ requires using a consteval wrapper which requires C++20 but I'm stuck on C++17. With C++17 I get a compile time error since the variable isn't visible on the device. The solution with cuda::std::array looks nice, though I'll have to wrap it for non-CUDA compilation. Commented Jun 24 at 9:05
  • Two things: 1. I used consteval in that context b/c it made sense but it should work with constexpr instead, assuming the argument is constexpr. That is why the cuda::std::array works. It's operator[] is constexpr. 2. cuda::std::array is available for host code as well. You might just have to tell your host compiler where to find CCCL. Commented Jun 25 at 7:51

1 Answer 1

1

Can I rely on this behaviour?

No, you cannot. NVIDIA forbids access to __device__ variables in host-side code. In fact, I have an open bug about (non-constexpr) __device__ variables being magically accessible in __host__ __device_ functions (bug 5307292; but you probably can't access it since the bugs are not public); and NVIDIA's current reply is that it's difficult for them to detect this happening, not that it's fine the way it is.

Does the warning generally not apply to constexpr variables?

It does. You see, constexpr variables still have addresses. And it possible to take their address and use them. So, in principle, you would have your host-side code reading directly from an address in device-global memory; that's not supposed to work. Compiler optimization might "make it happen", but unless NVIDIA guarantees that's what happens, it is basically a fluke.

So, how can you share constants?

Two possibilities to consider:

Sign up to request clarification or add additional context in comments.

3 Comments

Second possibility should be predicated on a uniform access pattern.
First one is sadly out since I am stuck on C++17 (said in comments, now edited the question, sorry). For the second one I'd have separate variables for the host and device side, right? I've coded up a small mock example at godbolt.org/z/59K6x63zj and with __constant__ I obviously get loads from (constant) memory in the PTX for function work(). In work2 where I use the preprocessor to either access a host or device constexpr variable it seems to generate equivalent code to hardcoding the values.
indeed, separate variables for the host and the device. But reading from constant memory is not as useful as having data as proper compile-time constants, which the compiler can use in its optimizations.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.