Convert FP32 to Bfloat16 in C++

Question

How can I convert from float (1bit sign, 8bit exp, 23bit mantissa) to Bfloat16 (1bit sign, 8bit exp, 7bit mantissa) in C++?

frexp can be used to break a float down into components. Assembling it back into whatever structure you call Bfloat16 is left as an exercise for the reader. — Igor Tandetnik
– Igor Tandetnik, Commented Mar 20, 2019 at 3:44
I imagine you want to do this efficiently, since the only reason for such a small floating point format is when you have a very large number of them. I also imagine it needs to do proper rounding. — Mark Ransom
– Mark Ransom, Commented Mar 20, 2019 at 3:49
@IgorTandetnik but that would be expensive. Bfloat16 is designed as the top half of float so that you can truncate it easily — phuclv
– phuclv, Commented Mar 20, 2019 at 5:02

Mark Ransom · Accepted Answer · 2019-03-21 23:08:44Z

5

As demonstrated in the answer by Botje it is sufficient to copy the upper half of the float value since the bit patterns are the same. The way it is done in that answer violates the rules about strict aliasing in C++. The way around that is to use memcpy to copy the bits.

static inline tensorflow::bfloat16 FloatToBFloat16(float float_val)
{
    tensorflow::bfloat16 retval;
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    memcpy(&retval, &float_val, sizeof retval);
#else
    memcpy(&retval, reinterpret_cast<char *>(&float_val) + sizeof float_val - sizeof retval, sizeof retval);
#endif
    return retval;
}

If it's necessary to round the result rather than truncating it, you can multiply by a magic value to push some of those lower bits into the upper bits.

float_val *= 1.001957f;

answered Mar 21, 2019 at 23:08

Mark Ransom

310k44 gold badges423 silver badges660 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

ChipK Over a year ago

This is sufficient but does NOT take into account correct rounding.

Mark Ransom Over a year ago

@ChipK I include a way to do rounding if necessary, in what way is it not correct?

ChipK Over a year ago

You are correct. I missed that part about multiplying by a magic number.

Akihito Nakayama · Accepted Answer · 2020-10-23 03:26:01Z

4

memcpy wouldn't compile for me in the little endian case for some reason. This is my solution. I have it as a struct here so that I can easily access the data and run through different ranges of values to confirm that it works properly.

struct bfloat16{
   unsigned short int data;
   public:
   bfloat16(){
      data = 0;
   }
   //cast to float
   operator float(){
      unsigned int proc = data<<16;
      return *reinterpret_cast<float*>(&proc);
   }
   //cast to bfloat16
   bfloat16& operator =(float float_val){
      data = (*reinterpret_cast<unsigned int *>(&float_val))>>16;
      return *this;
   }
};

//an example that enumerates all the possible values between 1.0f and 300.0f
using namespace std;

int main(){
   bfloat16 x;
   for(x = 1.0f; x < 300.0f; x.data++){
      cout<<x.data<<" "<<x<<endl;
   }
   
   return 0;
}

answered Oct 23, 2020 at 3:26

Akihito Nakayama

412 bronze badges

3 Comments

wolfram77 Over a year ago

Works great :). Do check this answer, which also include operator >> overload for cin: stackoverflow.com/a/56017304/1413259

phuclv Over a year ago

using reinterpret_cast for type punning invokes undefined behavior?. You need to use bit_cast instead: Why was std::bit_cast added, if reinterpret_cast could do the same?

Dwayne Robinson Over a year ago

Converting float32 to bfloat16 via >> can lose NaN's when the lower bits are shifted out (NaN is represented as a maximal exponent and any nonzero bits in the fraction/mantissa part, meaning anything greater than infinity). To preserve NaN's, try ... = (float32AsUint >> 16) | ((float32AsUint & 0x7FFFFFFF) > 0x7F800000) ? 1 : 0);.

Botje · Accepted Answer · 2019-03-20 05:49:01Z

1

From the Tensorflow implementation:

static inline tensorflow::bfloat16 FloatToBFloat16(float float_val) {
#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
    return *reinterpret_cast<tensorflow::bfloat16*>(
        reinterpret_cast<uint16_t*>(&float_val));
#else
    return *reinterpret_cast<tensorflow::bfloat16*>(
        &(reinterpret_cast<uint16_t*>(&float_val)[1]));
#endif
}

answered Mar 20, 2019 at 5:49

Botje

32.2k4 gold badges36 silver badges47 bronze badges

8 Comments

Alexey Frunze Over a year ago

I think that implementation is flawed because it violates aliasing. If you copy the two bytes as individual bytes (e.g. unsigned char's), it'll be right.

user179156 Over a year ago

@AlexeyFrunze : why is de-referencing violating aliasing ? It seems like it should be fine since we never update/modify or write to float_val , if we are just reading and never writing to the different pointer type , is it still a violation or UB ?

Alexey Frunze Over a year ago

@user179156 Doesn't matter. You can legally peek/poke an object through a pointer to a compatible type, a pointer to the (un)signed variant of the same type, or a pointer to a char. float and uint16_t is not an allowed pair here (unless uint16_t is a type name that stands for unsigned char).

user179156 Over a year ago

@AlexeyFrunze : got it . thanks . Could you please give an example why such a rule was decided. Understand that this is what standard says but do not understand the rationale and what can possibly go wrong , any code snippet or example would help

Alexey Frunze Over a year ago

@user179156 For one, different types may have different alignments and using a badly aligned pointer may not work how one may expect it to (it may not work at all, just crash your program). For another, this rule gives the compiler the ability to know that certain pointers in the code never point to the same object and the compiler can safely make optimizations (like caching the object in a register, knowing it won't mysteriously change through some other pointer).

|

Collectives™ on Stack Overflow

Convert FP32 to Bfloat16 in C++

3 Answers 3

3 Comments

3 Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

3 Comments

3 Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related