Floating decimal point

The numbers with floating decimal point are the numbers most often used in a Ordinateur to represent whole values not . They are approximations of Real numbers.

The numbers with floating decimal point have a sign S (in {- 1, 1}), a Mantisse m (also called significande) and a exhibitor E . Such a triplet represents a reality s.m.be where B is the base of representation (generally 2 on computer, but also 16 on certain old machines, 10 on many computers, or possibly any other value). While varying E , one makes “float” the decimal point. Generally, m is of a fixed size.

This is opposed to the representation known as in Fixed point , where the exhibitor E is fixed.

The differences in representation interns and of behavior of the floating numbers of a computer with another obliged to finely take again the calculation programmes scientific to carry them of a machine to another until a standard is proposed by the IEEE.

Implementations

IEEE 754 normalizes

The standard IEEE   754 (recovery by the international standard CEI   60559) specifies two number formats in floating decimal point (and two optional wide formats) and the associated operations. The near total of the current Architecture S of computers, including IA32, PowerPC, and AMD64, include a material implementation of calculations on floating IEEE, directly in the microprocessor, guaranteeing a fast execution.

Two formats fixed by the standard IEEE   754 are out of 32 bits (“single precision”) and 64 bits (“double precision double precision”). The distribution of the bits is the following one, where 1 ≤ M < 2:

The table above indicates the bits represented. The first bit of the mantissa of a number standardized being always 1, it is not represented in these two formats: one speaks about implicit bit. For these two formats, the precise details are thus respectively of 24 and 53 bits.

Floating wide

Certain implementations add one or more types of higher precision (thus, IA32 has a type extended on 80 bits). The standard IEEE 754 envisages minimal sizes for these wide types: These “wide” representations inevitably do not use the implicit bit of the mantissa.

In practice, only the double precision wide double precision is still used, in its minimal form (1+15+64 = 80 bits, the famous wide standard of the IA32).

When floating the IEEE offer an insufficient precision, one can have to resort to calculations on the floating ones in higher precision. Let us quote in particular library MPFR.

Precautions for use

Floating-point calculations are practical, but present various nuisances, in particular:
  • their limited precision, which results in round-offs (due to the operations, but also to the implicit changes basic, if the base is different from 10) which can accumulate in an awkward way. For this reason, work of accountancy is not carried out in floating decimal point, because all must fall right there except for the hundredth. In particular, the subtraction of two very close numbers causes a great loss of relative precision: one speaks about “annulment”.
  • a beach of exhibitors limited, being able to give places to “overflows” (when the result of an operation is larger than the greatest representable value) and to “underflows” (when a result is smaller, in absolute value, that smallest floating standardized positive), then with results not having more any direction.

It is for example trying to reorganize expressions in floating decimal point as it would be done mathematical expressions. That is however not pain-killer, because floating-point calculations, contrary to calculations on realities, are not associative S. For example, in a calculation into floating IEEE double precision double precision, (260+1) - 260 does not give 1, but 0. The reason is that 260+1 is not representable exactly and is approached by 260.

A particular value of the field of exhibitor is reserved for the representation of special values:

  • NaN (“not has number”), which will be for example the result of the attempt at floating division of zero by zero, or of the square root of a strictly negative number. NaN are propagated: the majority of the operations utilizing NaN give NaN (exceptions are possible, like NaN power 0, which can give 1).
  • infinite positive and infinite negative, which is for example the result of a “overflow” in rounded to nearest.
Another value of the field of exhibitor is reserved for the zeros (signed) and with denormalized.

External bonds

  • Course on arithmetic floating the
  • Page allowing automatically to code a number in floating decimal point
  • Warning on various nonintuitive behaviors of floating the

Random links:The Master of Olympe: Zeus | Map of the world (re-examined) | Mario Merz | Strait of Hinlopen | Canada Company