IEEE floating-point standard

The IEEE floating-point standard (IEEE 754) is an IEEE standard, used by many CPUs and FPUs, which defines formats for representing floating point numbers; representations of special values (i.e. zero, infinity, very small values (denormal numbers), and bit combinations that don't represent a number (NaN)); five exceptions, when they occur, and what happens when they do occur; four rounding modes; and a set of floating-point operations that will work identically on any conforming system.

IEEE 754 specifies four formats for representing floating-point values: single-precision (32-bit), double-precision (64-bit), single-extended precision (>= 43-bit, not commonly used) and double-extended precision (>= 79-bit, usually implemented with 80 bits). Only 32-bit values are required by the standard, the others are optional. Many languages specify that they implement IEEE arithmetic, although sometimes it is optional. The C programming language for example allows but does not require IEEE arithmetic. IEEE is commonly used in C where float implemented IEEE single precision and double implements IEEE double precision.

Also known as IEEE Standard for Binary Floating-Point Arithmetic (ANSI/IEEE Std 754-1985) and IEC 559: "Binary floating-point arithmetic for microprocessor systems".

Table of contents

1 Anatomy of a floating point number

1.1 Bit Conventions
1.2 Single Precision 32 bit
1.3 Double Precision 64 bit
1.4 Comparing floating point numbers

Anatomy of a floating point number

Following is a description of the standard's format for floating point numbers.

Bit Conventions

Bits within a word of width W are indexed with integers in the range 0 to W-1 inclusive. Bit 0 is drawn on the right. When considering the word or regions within the word as binary numbers then usually the lowest indexed bit will also be the least significant.

Single Precision 32 bit

A binary floating point number is stored in a 32 bit word:

 1     8               23              width in bits
+-+--------+-----------------------+
|S|  Exp   |  Fraction             |
+-+--------+-----------------------+
31 30    23 22                    0    bit index (0 on right)
   bias +127

S - sign

Exp - Exponent

The set of possible data values can be divided into the following classes:

Zeroes
Normalised numbers
Denormalised numbers
Infinities
NaN (Not a Number)

(NaNs are used to represent exceptional cases, such as the square root of a negative number)

Each class can be distinguished by the value of the Exp field (well, nearly):

Consider the Exp and Fraction fields as unsigned binary numbers (Exp will be in the range 0-255):

Class Exp Fraction

Zeroes 0 0 Denormalised numbers 0 non zero Normalised numbers 1-254 any Infinities 255 0 NaN (Not a Number) 255 non zero

For normalised numbers, the most common, Exp is the biased exponent and Fraction is the fractional part of the mantissa. The number has value v:

v = s * 2^e * m

Where

s = 1 (positive numbers) when S is 0

s = -1 (negative numbers) when S is 1

e = Exp - 127 (in other words the exponent is stored with 127 added to it, also called "biased with 127")

m = 1.Fraction in binary (the binary number 1 followed by the point followed by the binary bits of Fraction). Note 1 <= m < 2

Denormalised numbers are the same but e = -126 and m is 0.Fraction. Note that -126 is the smallest exponent for a normalised number.

There are two Zeroes, +0 (S is 0) and -0 (S is 1), and two Infinities +Inf (S is 0) and -Inf (S is 1).

Notice that NaNs and Infinities have all 1s in the Exp field.

Double Precision 64 bit

Double precision is basically the same but the fields are wider:

 1     11                                52
+-+-----------+----------------------------------------------------+
|S|  Exp      |  Fraction                                          |
+-+-----------+----------------------------------------------------+
63 62       52 51                                                 0
   bias +1023

NaNs and Infinities are represented with Exp being all 1s (2047).

For Normalised numbers the exponent bias is +1023 (so e is Exp - 1023). For Denormalised numbers the exponent is -1022 (the minimum exponent for a normalised number).

Comparing floating point numbers

An interesting feature of this particular representation is that it makes comparisons of most of the numbers simple. For positive numbers (sign bit is 0) a and b then a < b whenever the unsigned binary integers with the same bit patterns as a and b are also ordered the same way. In other words if you are comparing two positive floating point numbers you can just used an unsigned binary integer comparison using the same bits.

Also note that the IEEE 754 standard is currently (2004) under revision. See: http://grouper.ieee.org/groups/754/ and current draft at: http://754r.ucbtest.org/

This article (or an earlier version of it) contains material from FOLDOC's article on IEEE Floating Point Standard, used with permission.