std::bit_cast equivalent for CUDA device side code?

Question

I have a couple of "magic"¹ floating point constants, which I want to use bit-exact in CUDA device side computation, in the form of constexpr symbols. On the host side you'd use std::bit_cast<float>(0x........) for that. However NVCC doesn't "like" std::bit_cast in device side code.

In GLSL you'd use intBitsToFloat, however I see no built-in function in the CUDA C++ language extensions that can do this.

1: well, they're not that "magic", basically they're the floating point equivalent of 0.999…·2ⁿ, that is all bits of the mantissa set to 1 with -(n+1) added to exponent "0" (i.e. 0x7E-n-1).

Reinterpretcast from unsigned int to float? Or better, simple memcpy — Abator Abetor
– Abator Abetor, Commented Mar 27, 2024 at 11:17
Yeah, reinterpret_cast seems to work for type-punning in CUDA (not sure if the compiler just doesn't use the strict aliasing rule in the absence of __restrict__ or if I just haven't seen a case where it can fail), but the correct/secure C++ way is using memcpy. In my experience the compiler is able to avoid the actual memcpy so one doesn't have to worry about performance. — paleonix
– paleonix, Commented Mar 27, 2024 at 13:05
But I admit that none of those two methods is suited for initializing a constexpr float which is probably what you are looking for? In this case aliasing (of a literal) isn't really a problem either. — paleonix
– paleonix, Commented Mar 27, 2024 at 13:16
@paleonix I'm well aware of the memcpy way. But I'm explicit interested in a constexpr way. For what it's worth, I intend to pass that value by means of a template parameter (which only supports integer types). — datenwolf
– datenwolf, Commented Mar 27, 2024 at 14:14
@paleonix yeah, I've ssen that --expt-relaxed-constexpr suggestion message by NVCC, but the wording it comes with carries a big Here be Dragons vibe. — datenwolf
– datenwolf, Commented Mar 27, 2024 at 14:16

paleonix · Accepted Answer · 2025-01-24 14:25:37Z

Update 2 - `cuda::std::bit_cast` is here!

The newest version of libcu++ has

Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support

It is available in the CUDA Toolkit >= 12.8 and from the CCCL repo.

Update 1 - Why you might not want to use `--expt-relaxed-constexpr`

My view of --expt-relaxed-constexpr has changed after finding some funny behavior similar to what is described in this issue in a Nvidia project. I.e. they know about these problems which might be the reason for the flag being deemed experimental.

While I don't think that usage std::bit_cast in particular in device code is problematic, compiling with this flag could cause accidental usage of other constexpr functions that are less basic and less safe. Also note that the flag does not only allow the usage of constexpr functions at compile time as I previously thought, but also at runtime (i.e. with non-constexpr input) which is the cause of these issues. This was probably fine at the time of introduction as constexpr functions were very restricted but with newer C++ standards more and more functionality became available in constexpr functions that is not available in device code and seem to be simply ignored which is dangerous. With CUDA 12.8 Nvidia has added information regarding this issue to the documentation.

Initial answer - You could use `--expt-relaxed-constexpr`

Given a host compiler that supports it, you can use std::bit_cast in CUDA C++20 device code (i.e. CUDA >=12) to initialize a constexpr variable. You just need to tell nvcc to make it possible by passing --expt-relaxed-constexpr.

This flag is labeled as an "Experimental flag", but to me it sounds more like "this flag might be removed/renamed in a future release" than a "here be dragons" in terms of its results. It is also already quite old, which gives me some confidence. See the CUDA 8.0 nvcc docs from 2016 (docs for even older versions are not available online as html, so I didn't check further back).

As constexpr code is evaluated by the compiler on the host independent of the surrounding device context, I would not expect this flag to be some brittle "black magic". It just needs to pass off the evaluation to the host compiler and use the resulting value/object.

Given all this context I would rather expect the --expt-relaxed-constexpr-behavior to become the default in some future CUDA version than it vanishing without a replacement.

If you don't need `constexpr`

For anyone who needs a non-constexpr version of bit_cast, see Safe equivalent of std::bit_cast in C++11 (just add __device__).

CygnusX1 · Accepted Answer · 2024-03-28 08:30:37Z

2

In CUDA __device__ code there is an intrinsic function __int_as_float, as well as __float_as_int which does what you need. The only limitation is that it is not constexpr.

As a walkaround, I believe you could keep the constexpr value as an integer and call __int_as_float only at the spot where you transition into the run-time that executes on GPU.

answered Mar 28, 2024 at 8:30

CygnusX1

22.1k5 gold badges76 silver badges121 bronze badges

5 Comments

paleonix Over a year ago

While I like learning about intrinsics, there doesn't seem to be any advantage compared to a custom bit_cast (using memcpy) as implemented in the Q&A linked at the end of my answer that is also more general in terms of types. Or do these intrinsics somehow produce even better code?

CygnusX1 Over a year ago

One would have to compare the produced code to be certain, but in general CUDA "hates" taking addresses of local variables (or doing memcpy). As long as variable is in a register, it is fast and efficient, but if an address is needed the variable lands in local memory (which is in global or cache). It may happen that compiler figures out that bit_cast and memcpy can be resolved without actually touching memory, but you need to check the assembly to be certain. --ptxas-options=-v nvcc option may also give a hint how many registers and how much local memory is actually used.

CygnusX1 Over a year ago

@paleonix __int_as_float I am fairly certain it produces no code as all. Just informs that given content of a register should be now treated as a float.

paleonix Over a year ago

The point is that compilers will do the same (i.e. nothing) for memcpy. I tried it before on Compiler Explorer with nvcc.

CygnusX1 Over a year ago

@paleonix If that is the case in your use case - go for it! I am not sure it will work in all possible cases - sometimes compiler can get lost on the programmer's intend - so this is something you have to monitor.

Collectives™ on Stack Overflow

std::bit_cast equivalent for CUDA device side code?

2 Answers 2

Update 2 - `cuda::std::bit_cast` is here!

Update 1 - Why you might not want to use `--expt-relaxed-constexpr`

Initial answer - You could use `--expt-relaxed-constexpr`

If you don't need `constexpr`

Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Update 2 - cuda::std::bit_cast is here!

Update 1 - Why you might not want to use --expt-relaxed-constexpr

Initial answer - You could use --expt-relaxed-constexpr

If you don't need constexpr

Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Update 2 - `cuda::std::bit_cast` is here!

Update 1 - Why you might not want to use `--expt-relaxed-constexpr`

Initial answer - You could use `--expt-relaxed-constexpr`

If you don't need `constexpr`