4

The typical reason given for using a biased exponent (also known as offset binary) in floating-point numbers is that it makes comparisons easier.

By arranging the fields such that the sign bit takes the most significant bit position, the biased exponent takes the middle position, then the significand will be the least significant bits and the resulting value will be ordered properly. This is the case whether or not it is interpreted as a floating-point or integer value. The purpose of this is to enable high speed comparisons between floating-point numbers using fixed-point hardware.

However, because the sign bit of IEEE 754 floating-point numbers is set to 1 for negative numbers and 0 for positive numbers, the integer representation of negative floating-point numbers is greater than that of the positive floating-point numbers. If this were reversed, then this would not be the case: the value of all positive floating-point numbers interpreted as unsigned integers would be greater than all negative floating-point numbers.

I understand this wouldn't completely trivialize comparisons because NaN != NaN, which must be handled separately (although whether or not this is even desirable is questionable as discussed in that question). Regardless, it's strange that this is the reason given for using a biased exponent representation when it is seemingly defeated by the specified values of the sign and magnitude representation.

There is more discussion on the questions "Why do we bias the exponent of a floating-point number?" and "Why IEEE floating point number calculate exponent using a biased form?" From the first, the accepted answer even mentions this (emphasis mine):

The IEEE 754 encodings have a convenient property that an order comparison can be performed between two positive non-NaN numbers by simply comparing the corresponding bit strings lexicographically, or equivalently, by interpreting those bit strings as unsigned integers and comparing those integers. This works across the entire floating-point range from +0.0 to +Infinity (and then it's a simple matter to extend the comparison to take sign into account).

I can imagine two reasons: first, using a sign bit of 1 for negative values allows the definition of IEEE 754 floating-point numbers in the form -1s x 1.fe-b; and second, the floating-point number corresponding to a bit string of all 0s is equal to +0 instead of -0.

I don't see either of these as being meaningful especially considering the common rationale for using a biased exponent.

9
  • All-zero bits being +0 might actually be a desirable property since x/-0 is -INF, right? Same with stuff like copysign. Being able to memset(x, 0) is a nice feature. Having the bit pattern and various effects be different from *x = 0.f could create some gotchas down the line Commented Mar 1, 2023 at 10:08
  • It might be useful for some applications at the least but I'm not sure if that would have played any role in the IEEE standardization process or rationale. In any case it only really changes the memset(..., 0) behavior, slightly, which for what it's worth is undefined behavior in C for integer types. Commented Mar 1, 2023 at 10:21
  • It's not just memset btw. It's also the default initialization of global variables, static arrays, etc. Commented Mar 1, 2023 at 11:09
  • No, refer to C99 §6.7.8 Initialization, "If an object that has static storage duration is not initialized explicitly, then: [...] if it has arithmetic type, it is initialized to (positive or unsigned) zero" and later "all subobjects that are not initialized explicitly shall be initialized implicitly the same as objects that have static storage duration" etc. Using memset to initialize objects is generally undefined, and would only ever be valid for floating-point types if __STDC_IEC_559__ is defined (in theory). Commented Mar 2, 2023 at 0:52
  • The same clause is used to specify the default initialization of pointer types to null pointers. It is a common misconception that the C standard specifies that a null pointer must have an object representation of all null characters (corresponding to an all 0 bit string), the same as for integer types. As far as I know all modern systems do work this way, but it's still not a good idea because it is unnecessary and introducing undefined behavior can in some cases cause bad compiler optimizations. Commented Mar 2, 2023 at 0:56

2 Answers 2

0

Back in the day, signed integers were encoded using 2's complement (ubiquitous today), 1s' complement and signed magnitude - with some variations on -0 and trap values.

All 3 could be realized well enough in hardware with similar performance and hardware complexity. A sizeable amount of hardware and software designs exist for all 3.

IEEE Floating point can do compares quite easily when viewed as signed magnitude.

OP's suggested "If this were reversed" creates a 4th integer encoding.


Why do IEEE 754 floating-point numbers use a sign bit of 1 for negative numbers?

To mimic the symmetry of signed magnitude integers, take advantage of prior art and not yet another encoding.

Sign up to request clarification or add additional context in comments.

8 Comments

Isn't it the case that all floating-point formats preceding those defined by IEEE-754 and going back to the 22-bit floating-point format used in Zuse's Z3 computer (designed in 1938 and completed in 1941), continuing with IBM and DEC floating-point formats and many others, already used the convention: sign bit=1 is negative, sign bit=0 is positive? Which means the IEEE-754 committee applied the principle of "least surprise" by continuing the convention. While compatibility with integer conventions may have played a role in some earlier decisions I am not aware of a primary source that says so.
@njuffa What I do recall, circa late 70's, was learning how to convert an integer in 1 of 3 formats to the other 2 both in HW/SW. Also hearing then FP was like signed-magnitude. IEEE-754 committee needed to gain acceptance and using a known integer like format (echoing your "least surprise") was certainly a motivation. IIRC, earliest mechanical/electronic HW was all signed magnitude given its simplistic match to how we do math by hand.
@njuffa It is only changes (like 2's complement) with its somewhat simpler HW realization (and 2^n values) that caused it to with the integer race, yet that is asymmetric for FP.
@njuffa Now if could all agree on a common endian....
C.R. Severance, "IEEE 754: An Interview with William Kahan", Computer 31(3):114-115: "WK: The existing DEC VAX format had the advantage of a broadly installed base. Originally, the DEC double-precision format [...] too few exponent bits for some double-precision computations. DEC addressed this by introducing its G double-precision format, which supported an 11-bit exponent and which was the same as the CDC floating-point format. With the G format, the major remaining difference between the Intel format and the VAX format was gradual underflow."
|
0

I found the reference "Radix Tricks" on the Wikipedia article for the IEEE 754 standard, where in the section titled "Floating point support" the author describes the steps necessary to compare two floating-point numbers as unsigned 2's complement integers (specifically, 32-bit IEEE 754 single-precision floating-point numbers).

In it, the author points out that simply flipping the sign bit is insufficient because the encoded significand of a large (higher magnitude) negative number interpreted as an unsigned integer will be greater than that of a smaller negative number, when of course a larger negative number should be lesser than a smaller one. Similarly, a negative number with a larger biased exponent is actually less than one with a smaller biased exponent, such that negative numbers with the unbiased exponent emax are less than those with the unbiased exponent emin.

In order to correct for this, the sign bit should be flipped for positive numbers, and all bits should be flipped for negative numbers. The author presents the following algorithm:

uint32_t cmp(uint32_t f1, uint32_t f2)
{
    uint32_t f1 = f1 ^ (-(f1 >> 31) | 0x80000000);
    uint32_t f2 = f2 ^ (-(f2 >> 31) | 0x80000000);
    return f1 < f2;
}

The purpose in explaining this is to clarify that inverting the sign bit does not make it possible to directly compare finite floating-point numbers as unsigned 2's complement integers. On the contrary, using sign and magnitude hardware (which must interpret the sign bit as a sign bit, and not as a part of an unsigned integer) requires no additional bitwise operations and should therefore result in the simplest, smallest, and most efficient design.

It is possible to create a floating-point format encoding that uses 2's complement, and it has been studied as detailed in this paper. However, this is far beyond the scope of the question and involves many additional complexities and problems to be solved. Perhaps there is a better way, but the IEEE 754 design has the advantage that it is obviously satisfactory for all use cases.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.