Floating Point Representation in Hexadecimal using C Langauge

Question

I typed the following C code:

typedef unsigned char* byte_pointer;

void show_bytes(byte_pointer start, size_t len) 
{
  int i;
  for (i = 0; i < len; i++)
  printf(" %.2x", start[i]);
  printf("\n");
}

 void show_float(float x) 
  {
       show_bytes((byte_pointer) &x, sizeof(float));
  }
 void main()
    {
         float f = 3.9;
         show_float(f);
    }

The output of this code is: Output: 0x4079999A

Manual Calculations 1.1111001100110011001100 x 2 power 1

M: 1111001100110011001100

E: 1 + 127 = 128(d) = 10000000(b)

Final Bits: 01000000011110011001100110011000

Hex: 0x40799998

Why this last A is displayed despite of 8.

What happens if you have float f = 3.9f; ? Your 3.9 is a double value. Note that 3.9 cannot be converted exactly to finite binary floating point representation, so the implementation details will cover whether the approximation is a round-up or a round-down. — Weather Vane
– Weather Vane, Commented Nov 4, 2021 at 10:42
@WeatherVane: it should not change much. The truncation to float of the double 3.9 value is likely to be the float value 3.9f. I would assume that the representation in multi-precision should be 0x4079999999999.... And that is normally rounded to the closer representable floating point value which is 0x4079999A. The question is which manual calculation could lead to 0x40799998... — Serge Ballesta
– Serge Ballesta, Commented Nov 4, 2021 at 10:53
@SergeBallesta yes in practice I found no difference. A deleted comment queried the basis of the manual calculation ;) — Weather Vane
– Weather Vane, Commented Nov 4, 2021 at 10:56
@WeatherVane: Just FYI, the shortest decimal numeral that differs when converted first to a double and then a float has seven significant digits, and there is only one of them, except for the sign. — Eric Postpischil
– Eric Postpischil, Commented Nov 4, 2021 at 11:53

Eric Postpischil · Accepted Answer · 2021-11-04 11:44:02Z

As per manual calculations the answer in Hex should supposed to be: Output: 0x40799998

Those undisclosed manual calculations must be wrong. The correct result is 4079999A₁₆.

In the format commonly used for float, IEEE-754 binary32 or “single precision,” numbers are represented as an integer with magnitude less than 224 multiplied by a power of two within certain limits. (The floating-point representation is often described in other forms, such as sign, a 24-digit binary significand with radix point after the first digit, and a power of two. These forms are mathematically equivalent.)

The two numbers in this form closest to 3.9 are 16,357,785•2⁻²³ and 16,357,786•2⁻²³. These are, respectively, 3.8999998569488525390625 and 3.900000095367431640625. Lining them up, we can see the latter is closer to 3.9:

3.8999998569488525390625
3.9000000000000000000000
3.9000000953674316406250

as the former differs by 1.5 at the seventh digit after the decimal point, whereas the latter differs by about 9.5 at the eighth digit after the decimal point.

Therefore, the best conversion of 3.9 to this float format produces 16,357,786•2⁻²³. In hexadecimal, 16,357,786 is F9999A₁₆. In the encoding of the representation into the bits of a float, the low 23 bits of the significand are put into the primary significand field. The low 23 bits are 79999A₁₆, and that is what we should see in the primary significand field.

Also note we can easily see the binary for 3.9 is 11.11100110011001100110011001100110011001100110…₂. The bold marks the 24 bits that fit in the float significand. Immediately after them is 1001…, which we can see ought to round up, since it exceeds half of the previous bit, and therefore the last four bits of the significand should be 1010.

(Also note that good C implementations convert numerals in source text to the nearest representable number, especially for numbers without many decimal digits, but the C standard does not require this. It says “For decimal floating constants, and also for hexadecimal floating constants when FLT_RADIX is not a power of 2, the result is either the nearest representable value, or the larger or smaller representable value immediately adjacent to the nearest representable value, chosen in an implementation-defined manner. However, the encoding shown in the question 40799998₁₆, is not for either of the adjacent representable values, 40799999₁₆ and 4079999A₁₆. It is farther away than either.)

Collectives™ on Stack Overflow

Floating Point Representation in Hexadecimal using C Langauge

1 Answer 1

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related