IEEE 754 standard for float64, 32 and 16 use a signed significand and a biased exponent. As a student designing hardware architectures, it makes more sense to me to use two's complement for the significand and exponent parts.
For example, 32 bit (half precision) float is defined such that the first bit represents sign, next 8 bits - exponent (biased by 127) and last 23 bits represent the mantissa. To implement addition/multiplication (of negative numbers), we need to convert mantissa to two's complement and back. The resulting hardware is quite complicated.
Instead, consider if the first 8 bits represent exponent and last 24 bits represent mantissa, both in two's complement. bit shifting, adding and multiplying are relatively straightforward and the hardware is less complicated. In addition, we have a unique zero for significand (two zeros for signed bit representation)
I searched for months to find reasons for these design decisions and found these:
- 2's complement representations are more difficult to compare.
This is true, we need an adder (subtracter) to compare 2's complement. However, for pipelined architectures such as GPUs and my own FPGA based CNN accelerator, we need to avoid variable delay. Comparing a signed representation bit by bit iteratively makes it impossible to predetermine the delay. In my opinion, a subtraction is better in this case.
- Historic reasons: Handling NANs and infs
Maybe we could allocate one or two bits for this. And make significand 23 bits.
- +0 and -0 zero, such that 1/+0 = +inf and 1/-0 = -inf
Now this is a valid reason. It's not really applicable to my use case, but i wonder if it would better if they had implemented this with an additional bit.
My use case
I am building a CNN accelerator on an FPGA. Having predefined delays for multiplication and addition and minimizing hardware complexity are crucial for me. I don't perform division and also I don't have to worry about infs and NANs.
Therefore I have decided to use a custom internal representation of floating points using two's complement representation as described above. Are there any obvious disadvantages I should be careful about?