9

IEEE 754 standard for float64, 32 and 16 use a signed significand and a biased exponent. As a student designing hardware architectures, it makes more sense to me to use two's complement for the significand and exponent parts.

For example, 32 bit (half precision) float is defined such that the first bit represents sign, next 8 bits - exponent (biased by 127) and last 23 bits represent the mantissa. To implement addition/multiplication (of negative numbers), we need to convert mantissa to two's complement and back. The resulting hardware is quite complicated.

Instead, consider if the first 8 bits represent exponent and last 24 bits represent mantissa, both in two's complement. bit shifting, adding and multiplying are relatively straightforward and the hardware is less complicated. In addition, we have a unique zero for significand (two zeros for signed bit representation)

I searched for months to find reasons for these design decisions and found these:

  1. 2's complement representations are more difficult to compare.

This is true, we need an adder (subtracter) to compare 2's complement. However, for pipelined architectures such as GPUs and my own FPGA based CNN accelerator, we need to avoid variable delay. Comparing a signed representation bit by bit iteratively makes it impossible to predetermine the delay. In my opinion, a subtraction is better in this case.

  1. Historic reasons: Handling NANs and infs

Maybe we could allocate one or two bits for this. And make significand 23 bits.

  1. +0 and -0 zero, such that 1/+0 = +inf and 1/-0 = -inf

Now this is a valid reason. It's not really applicable to my use case, but i wonder if it would better if they had implemented this with an additional bit.

My use case

I am building a CNN accelerator on an FPGA. Having predefined delays for multiplication and addition and minimizing hardware complexity are crucial for me. I don't perform division and also I don't have to worry about infs and NANs.

Therefore I have decided to use a custom internal representation of floating points using two's complement representation as described above. Are there any obvious disadvantages I should be careful about?

5
  • Your approach loses tons of useful properties, such as unique representation of nonzero numbers, or symmetry between representable positive and negative numbers, or simple comparison (the order relationship between two bit patterns interpreted as IEEE 754 floats is the same as the order relationship between those bit patterns interpreted as sign-magnitude ints, as long as the floats are finite). Commented Jul 13, 2019 at 7:00
  • 2
    Can you elaborate how those useful properties translate to lesser hardware complexity or processing time? Commented Jul 13, 2019 at 7:06
  • Do you think those are the only things the IEEE 754 designers cared about? Commented Jul 13, 2019 at 7:08
  • Forgive me, i don't understand. The whole point of coming up with a floating-point representation is to represent the continuum of real numbers on limited hardware and perform operations on them using minimal hardware while retaining maximum possible precision right? Commented Jul 13, 2019 at 7:13
  • Why do you need to convert the significand to two's complement to do multiplications? Wouldn't you just treat the significands as unsigned? Commented Jul 13, 2019 at 8:34

3 Answers 3

11

This is a well-studied topic, and there are systems that are using 2's complement floating-point representations; typically those that predate IEEE-754, though recent incarnations are available too. See this paper for a study of the properties of such a system: https://hal.archives-ouvertes.fr/hal-00157268/document

Kahan himself (the designer of the IEEE754 standard) did argue that having separate +/-0 is important for the approximations that floating-point is typically used for, where it is important if a floating-point 0 result is essentially positive or negative. See https://people.freebsd.org/~das/kahan86branch.pdf for details.

So, yes: It is entirely possible to have 2's complement floats; but the standard picked sign-magnitude representation. Whichever you pick, some operations will be easy and some will be harder; comparison being the most obvious. Of course, there's nothing stopping you from picking whatever representation suits your needs the best if you're designing your own hardware! In particular, you can even go with so called unum's and posit's where exponent and significand portions are not fixed size, but rather depend on where you land on the range. See here: https://www.johndcook.com/blog/2018/04/11/anatomy-of-a-posit-number/

Sign up to request clarification or add additional context in comments.

Comments

4

The reason 2s complement is used for integer operations is because it allows the same hardware and instructions to be used for both signed and unsigned operations, with just a tiny difference as to how overflow is detected. With floating point, noone cares about "unsigned" floating point, so there's no benefit (savings) to using 2s complement if you're implementing it at the bit level. The only way I can see an advantage to using 2s complement is if you are using hardware that already has 2s-complement ALUs of some kind.

2s complement has major asymmetry problems in its representation (there are more representable values <0 than >0) that causes all kinds of mathematical stability issues if you try to use it in any situation that requires rounding or potential loss of precision, such as floating-point is commonly used for.

Comments

0

So I would note several issues:

1. The implied bit is no longer always 1, but in twos-complement it is `!sign` the logical negation of the sign bit. This in itself is actually just an implementation detail and not in it of itself hard to deal with.

2. NaN payload issues - consistent behavior will be troublesome as the quiet vs signal NaN would potentially also need to be !sign for quietness, and when changing data type sizes, the consequences of how to extend negative numbers could be relevant.

3. Of course the biggest issue is the ambiguity in representation when sign=1 && mantissa==0 and it leads to two options:
a. Don't strictly follow two's complement and treat it as negative powers of 2, or negative infinity. By far this is the easiest option.

b. Do not have a negative zero, and instead have ambiguity in all negative powers of 2, except a special case for the negative maximum exponent power of 2. This will make an asymmetric exponent range for positive and negative numbers. But it further raises questions about infinity - essentially there would only be positive infinity, and everything else is some sort of NaN.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.