How can I convert from float (1bit sign, 8bit exp, 23bit mantissa) to Bfloat16 (1bit sign, 8bit exp, 7bit mantissa) in C++?
3 Answers
As demonstrated in the answer by Botje it is sufficient to copy the upper half of the float value since the bit patterns are the same. The way it is done in that answer violates the rules about strict aliasing in C++. The way around that is to use memcpy to copy the bits.
static inline tensorflow::bfloat16 FloatToBFloat16(float float_val)
{
tensorflow::bfloat16 retval;
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
memcpy(&retval, &float_val, sizeof retval);
#else
memcpy(&retval, reinterpret_cast<char *>(&float_val) + sizeof float_val - sizeof retval, sizeof retval);
#endif
return retval;
}
If it's necessary to round the result rather than truncating it, you can multiply by a magic value to push some of those lower bits into the upper bits.
float_val *= 1.001957f;
3 Comments
memcpy wouldn't compile for me in the little endian case for some reason. This is my solution. I have it as a struct here so that I can easily access the data and run through different ranges of values to confirm that it works properly.
struct bfloat16{
unsigned short int data;
public:
bfloat16(){
data = 0;
}
//cast to float
operator float(){
unsigned int proc = data<<16;
return *reinterpret_cast<float*>(&proc);
}
//cast to bfloat16
bfloat16& operator =(float float_val){
data = (*reinterpret_cast<unsigned int *>(&float_val))>>16;
return *this;
}
};
//an example that enumerates all the possible values between 1.0f and 300.0f
using namespace std;
int main(){
bfloat16 x;
for(x = 1.0f; x < 300.0f; x.data++){
cout<<x.data<<" "<<x<<endl;
}
return 0;
}
3 Comments
operator >> overload for cin: stackoverflow.com/a/56017304/1413259reinterpret_cast for type punning invokes undefined behavior?. You need to use bit_cast instead: Why was std::bit_cast added, if reinterpret_cast could do the same?>> can lose NaN's when the lower bits are shifted out (NaN is represented as a maximal exponent and any nonzero bits in the fraction/mantissa part, meaning anything greater than infinity). To preserve NaN's, try ... = (float32AsUint >> 16) | ((float32AsUint & 0x7FFFFFFF) > 0x7F800000) ? 1 : 0);.From the Tensorflow implementation:
static inline tensorflow::bfloat16 FloatToBFloat16(float float_val) {
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
return *reinterpret_cast<tensorflow::bfloat16*>(
reinterpret_cast<uint16_t*>(&float_val));
#else
return *reinterpret_cast<tensorflow::bfloat16*>(
&(reinterpret_cast<uint16_t*>(&float_val)[1]));
#endif
}
8 Comments
unsigned char's), it'll be right.float_val , if we are just reading and never writing to the different pointer type , is it still a violation or UB ?
frexpcan be used to break afloatdown into components. Assembling it back into whatever structure you callBfloat16is left as an exercise for the reader.