IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (>= 43-bit, not commonly used) and double-extended precision (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard, the others are optional. Many languages specify that they implement IEEE arithmetic, although sometimes it is optional. The C programming language for example allows but does not require IEEE arithmetic. IEEE is commonly used in C where float implemented IEEE single precision and double implements IEEE double precision.
Also known as IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic for microprocessor systems".
Table of contents |
|
Following is a description of the standard's format for floating point numbers.
Bits within a word of width W are indexed with integers in the range 0
to W-1 inclusive. Bit 0 is drawn on the right. When considering the
word or regions within the word as binary numbers then usually the
lowest indexed bit will also be the least significant.
A binary floating point number is stored in a 32 bit word:
S - sign
Exp - Exponent
The set of possible data values can be divided into the following
classes:
Each class can be distinguished by the value of the Exp field (well,
nearly):
Consider the Exp and Fraction fields as unsigned binary numbers
(Exp will be in the range 0-255):
Zeroes 0 0
Denormalised numbers 0 non zero
Normalised numbers 1-254 any
Infinities 255 0
NaN (Not a Number) 255 non zero
For normalised numbers, the most common, Exp is the biased exponent and
Fraction is the fractional part of the mantissa. The number has value v:
v = s * 2^e * m
Where
s = 1 (positive numbers) when S is 0
s = -1 (negative numbers) when S is 1
e = Exp - 127 (in other words the exponent is stored with 127 added to
it, also called "biased with 127")
m = 1.Fraction in binary (the binary number 1 followed by the point
followed by the binary bits of Fraction). Note 1 <= m < 2
Denormalised numbers are the same but e = -126 and m is 0.Fraction.
Note that -126 is the smallest exponent for a normalised number.
There are two Zeroes, +0 (S is 0) and -0 (S is 1), and two Infinities
+Inf (S is 0) and -Inf (S is 1).
Notice that NaNs and Infinities have all 1s in the Exp field.
Double precision is basically the same but the fields are wider:
NaNs and Infinities are represented with Exp being all 1s (2047).
For Normalised numbers the exponent bias is +1023 (so e is Exp - 1023).
For Denormalised numbers the exponent is -1022 (the minimum exponent for
a normalised number).
An interesting feature of this particular representation is that it makes comparisons of most of the numbers simple. For positive numbers (sign bit is 0) a and b then a < b whenever the unsigned binary integers with the same bit patterns as a and b are also ordered the same way. In other words if you are comparing two positive floating point numbers you can just used an unsigned binary integer comparison using the same bits.
See also:
Let's Go To The (Floating) Point by Chris Hecker
Also note that the IEEE 754 standard is currently (2004) under revision. See: http://grouper.ieee.org/groups/754/ and current draft at: http://754r.ucbtest.org/
Anatomy of a floating point number
Bit Conventions
Single Precision 32 bit
1 8 23 width in bits
+-+--------+-----------------------+
|S| Exp | Fraction |
+-+--------+-----------------------+
31 30 23 22 0 bit index (0 on right)
bias +127
(NaNs are used to represent exceptional cases, such as the square root of a negative number)
Class Exp Fraction
Double Precision 64 bit
1 11 52
+-+-----------+----------------------------------------------------+
|S| Exp | Fraction |
+-+-----------+----------------------------------------------------+
63 62 52 51 0
bias +1023
Comparing floating point numbers
This article (or an earlier version of it) contains material from FOLDOC's article on IEEE Floating Point Standard, used with permission.