# The Joy and Perils of Floating-Point Computations

A tale about Floating Point Fairy, angry carrots and other stuff ## A long time ago in a galaxy far, far away.... USS Enterprise (NCC-1701)
"Power up the engines!!!"  "Sir yes sir!"

## What's going on? • Floating point precision is not uniformly distributed
• The further into the space, the bigger the interval between the two adjacent values
• At some point the added value is less than this interval, and the ship stops
• ...this, and many other confusing things

$python >>> 0.1 + 0.2 == 0.3 False >>> 0.1 + 0.2 0.30000000000000004 >>> 0.3 0.29999999999999999 >>> WAT!?..  • 0.1 is not "0.1" • I'ts "0.001001100110011001(1001)..." • None of these numbers (0.1, 0.2, 0.3) are represented exactly • Humans have ten fingers, computers have two • Almost all of the computing harware novadays ## Anecdote! In Soviet Russia it's ternary • Built in 1958 at the Moscow State University • Balanced ternary, usesing the three digits: -1, 0, and +1 • Natural representation of negative numbers (no "sign bit") • Some computations are more efficient • Donald Knuth argued it might be used in future hardware • ...actually not that bad idea! ### Decimal to binary conversion: whole numbers • Always finite representation • Iteratively divide by 2, modulo goes into binary digits, to the left • Current Decimal Current Binary 135 671 3311 16111 8 0111 4 00111 2 000111 1 1000111 10000111 • 135 (Decimal) == 10000111 (Binary) ### Decimal to binary conversion: fractions • Iteratively multiply by 2, leftmost digit goes into the binary digits, to the right • Current Decimal Current Binary 0.6875 0.3751 0.75 10 0.5101 1011 • 0.6875 (Decimal) == 0.1011 (Binary) ### Decimal to binary conversion: fractions • However, only finite when power of two in denominator • Current Decimal Current Binary 0.8 0.61 0.2 11 0.4110 0.81100 0.611001 ...... • Goes into infinite loop. 0.8 (Decimal) == 0.11001100(1100)... ### Inexact representation: rounding • What to do with these infinite fractions?.. • Chop, of course! • • The last two digits get modified because of rounding • The image shows the standard way of representing it, in "single precision" ## IEEE 754 Standard • The problem: A lot of different hardware, each with own pecularities. Anarchy. • Computer scientists could not rely on their code giving expected results. • William Kahan, "the Father of Floating Point". • Was closely tied to 8087 FP co-processor. "Intel's gift to society". ## IEEE 754 Standard: Scope • Basic and extended floating-point number formats • Add, subtract, multiply, divide, square root, remainder, and compare operations • Conversions between integer and floating-point • Conversion from/to decimal string • Exceptions and their handling ## Single precision: binary32 • The actual value on the picture: • s: sign, 1 bit • e: exponent, 8 bits • Exponent is "biased" by 127 (1111100 = 124, 124-127 = -3) • m: mantissa, 23 bits • One bit in mantissa is implicit, "always present" • This is called "normalized" representation ## "Seven digits after the dot" • "The amount of precision that a floating number can have" • Urban myth... but partially true: • 24 because: 23 bits for mantissa + 1 implicit bit • In C++, std::numeric_limits::digits10 (in C, macro FLT_DIG) gives 6. • Wikipedia says "6 to 9". Get the same number if converting Decimal->IEEE 754->Decimal • All decimals with 6 digits survive "trip and back" • But technically, from 0 to 100+ digits: denormals have fewer, exact numbers have more ## Double precision: binary64 • s: sign, 1 bit • e: exponent, 11 bits • m: mantissa, 52 bits • Same as binary32, but uses twice the amount of bits for everything • Considerably better precision ## Uneven density of the floating point numbers • Precision decreases with increasing magnitude • • Because of exponent/mantissa representation • That's why it is called "floating point" • Actually, logarithmic: • ### Rounding • If digits do not fit, they get chopped (rounded): • • The last two digits get modified because of rounding • There are different ways of rounding. IEEE 754 suggests 5 • The default rounding mode is "round to nearest, tie to even" ## Rounding modes Rounding mode +11.5 +12.5 -11.5 -12.5 to nearest, ties to even +12.0+12.0-12.0-12.0 to nearest, ties from zero+12.0+13.0-12.0-13.0 toward 0 +11.0+12.0-11.0-12.0 toward +inf +12.0+13.0-11.0-12.0 toward -inf +11.0+12.0-12.0-13.0 ## Subnormal numbers • "Gradual underflow", AKA "denormals" • Very useful for some algorithms (e.g. prevents divizion by 0) • But generally very slow! (because of special treatment) • No implicit leading 1! (i.e. binary not normalized) • Use less bits for mantissa. Different representation • For C++/FORTRAN, can be disabled on compiler level ("flush-to-zero" and "denormals-are-zero") ## Special values • +0 and -0 (they are equal by the standard!) • +Inf and -Inf • NaN (not-a-number), QNaN (quiet not-a-number) ## Exceptions • 5 types of exceptions should be signaled when detected • Invalid operation: involving NaN, sqrt(-1), float->int overflow • Division by zero. The result is signed Inf. • Overflow: when rounding and converting • Underflow • Inexact ## Anecdote! USS Yorktown (CG-48) • On 21 Sept. 1997, maneuvering off the coast of Cape Charles • Crewman entered a blank fiend into database • The blank was treated as zero • Caused Divide-by-Zero Exception • Which was not handled properly by the software • The exception was propagated to the system (WinNT 4.0) • ...the vessel paralyzed for 3 hours ## Exceptions: Traps • User should be able to specify a handler (or disable) • Many programming languages just don't care (such as Java) • C/C++ has an API: #include void main() { unsigned int fp_control_word, new_fp_control_word; _controlfp_s(&fp_control_word, 0, 0); // make the new fp env same as the old one, // except for the changes we're going to make new_fp_control_word = fp_control_word | _EM_INVALID | _EM_DENORMAL | _EM_ZERODIVIDE | _EM_OVERFLOW | _EM_UNDERFLOW | _EM_INEXACT; // update the control word with our changes _controlfp_s(&fp_control_word, new_fp_control_word, _MCW_EM) }  ## Anecdote! Ariane5 rocket launch • Launched in June 1996, 10 years and$7 billion to build
• Crashed in 36.7 seconds after launch
• 64 bit floating point number (horizontal velocity of the rocket) converted to 16 bit integer
• The number bigger than in Ariane4; unchecked overflow
• Internal software exception, shut down the navigation system
• ...the ultimate cause is the lack of proper FP exception handling

## Carrots Planting #define NUM_CARROTS 10000          /* the number of carrotts to plant */
const float BOBWALK_WIDTH  = 4.8f; /* the width of the bobwalk (feet) */
const float PLANT_INTERVAL = 1.25f;/* the planting interval (feet) */

float carrot_pos[NUM_CARROTS];     /* carrot positions go here */

int main() {
int i;
/* Plant the carrots! */
for (i = 0; i < NUM_CARROTS; i++) {
carrot_pos[i] = BOBWALK_WIDTH + i*PLANT_INTERVAL;
printf("Carrot %d goes to %g feet!\n", i, carrot_pos[i]);
}
return 0;
}


## Carrots Planting

const float EPSILON = 0.0001f; /* epsilon to check placement */
void test_carrots() {
/* verify that carrots are all planted evenly! */
int i, num_bad = 0;
for (i = 1; i < NUM_CARROTS; i++) {
float d = carrot_pos[i] - carrot_pos[i - 1];
if (fabs(d - PLANT_INTERVAL) > EPSILON) {
printf( "The carrot %d is planted wrong!!!\n" , i);
}
}
if (num_bad == 0) printf( "All carrots placed well!!!");
else printf("%d of %d carrots placed incorrectly... :(",
}

The carrot 3273 is planted wrong!!!
1 of 10000 carrots placed incorrectly... :( ## Carrots Planting

offs = single(4.8) + single(1.25)*[0:9999];
delta = offs(2:end) - offs(1:end-1);
error = delta - single(1.25);

plot(error, 'r-', 'LineWidth', 2);
grid on;
title('Planting distance error');
xlabel('Carrot index');
ylabel('Error (feet)'); ## Carrots Planting

octave> find(abs(error) >= 1e-4)
ans =  3273
octave> non_exact = find(abs(error) != 0)
non_exact =
9     48    201    816   3273
octave> non_exact_offsets = 4.8 + non_exact*1.25
non_exact_offsets =
16.050     64.800    256.050   1024.800   4096.050
octave> error(non_exact)
ans =
-9.5367e-007  3.8147e-006  -1.5259e-005  6.1035e-005  -2.4414e-00

• In majority of the cases the rounding errors cancel each other
• But in a few curious cases (around values 16, 64, 256, 1024, 4096) there is a catastrofic cancellation
• Rounding error in the lowest bits was enough
• "True story" (tm)

## Problem: Catastrofic cancellation

a = 31.006276686241002
b = 31.006276685299816
• a and b are close to each other
• a and b have some garbage in the lower bits
• After the substraction, the "good bits" cancel each other:
• 31.006276686241002 - 31.006276685299816 = 0.000000000941186
• The garbage gets promoted and claims to be a meaningful computation result:
• 10000000000 * 0.000000000941186 = 9.41186

## Problem: Going out of the definition domain ## Problem: Going out of the definition domain

import math
def dot((x1, y1, z1), (x2, y2, z2)):
""" Dot product of two 3-dimensional vectors"""
return x1*x2 + y1*y2 + z1*z2

def normalize((x,y,z)):
""" Normalizes a 3-dimensional vector"""
len = math.sqrt(dot((x, y, z), (x, y, z)))
return (x/len, y/len, z/len)

def angle(a, b):
""" Finds an angle between two 3-dimensional vectors"""
anorm = normalize(a)
bnorm = normalize(b)
return math.acos(dot(anorm, bnorm))


## Problem: Going out of the definition domain

>>> from dot import *
>>> v1=(16, 16, 16)
>>> v2=(32, 32, 32)
>>> v3=(1,2,3)
>>> angle(v1, v3)
0.38759668665518016
>>> angle(v1, v2)
Traceback (most recent call last):
File "", line 1, in
File "dot.py", line 16, in angle
return math.acos(dot(anorm, bnorm))
ValueError: math domain error

>>> dot(normalize(v1), normalize(v2))
1.0000000000000002

• Arccosine is only defined inside [-1, 1]
• Trying to take arccosine from 1.0000000000000002
• Result of accumulated rounding errors

## Problem: Bad conditioning • Bad conditioning number: small changes in input cause big changes in output
• Can happen e.g. when going "continuous"->discrete via decimal rounding/truncation
• The problem: number of inserts (pink dots) defined from the ("constant") length via rounding
• In one case: 2.499999786 -> 2, in other 2.50000000 -> 3
• Need different algorithm/architecture
• "True story" (tm)

## Precision vs Accuracy

Are not the same!!! ## Precision vs Accuracy

Are not the same!!! ## Double vs. Float, Precision vs. Accuracy

int main() {
float a;
double b;

a = 1.15f;
b = a;

printf("1.15 in float: %g\n", a);
printf("1.15 in double: %g", b);
return 0;
} Double has worse accuracy than float???..

## Double vs. Float, Precision vs. Accuracy

• 0.15 in single precision binary format:
• • 0.15 in double precision binary format:
• • 0.15 in float->double precision binary:
• • The digits were chopped off!
• The printing code does "the right thing" for float, but truncated double has lost information

## Floating point comparison

• Naive way that does not work:
• You don't just compare like that (0.1 + 0.2 != 0.3)
• Some programming languages (like ML) don't even allow that

## Floating point comparison

• A "better" way (does not really work either:
• Which epsilon to choose?

• A better way: relative error
• Should also take NaNs into account, when applicable
• Still can be improved

## Floating point comparison

• Can also compare as integers!
• Thanks to the way IEEE 754 representation is designed

• Machine epsilon: the distance between 1.0 and the next representable float number
• "Upper bound on the relative error"
• ULP (unit in the last place). Depends on the actual value.
• ## Remedies: Use higher precision

• Sometimes a valid thing to do
• Would help with carrots planting
• However: might be just postponing/hiding a problem
• However: twice the memory usage
• However: worse SIMD utilization
• Compromise: do selectively, in the critical places

## Anecdote! MMORPG server simulation time • Anarchy Online by Funcom, released in 2001
• Massive Multiplayer Online Role Playing Game (MMORPG)
• Client-server architecture, all simulation run on server
• Initially used single-precision floating point for server time
• ...crashed the server after a couple of weeks uptime

## Remedies: Fixed point computation

• Work with integers and scale when required
• Reality for embedded microprocessors (no FPU anyway)
• Most common binary and decimal scales (10^X or 2^X)
#define NUM_CARROTS 10000        /* the number of carrotts to plant */
const long BOBWALK_WIDTH  = 480; /* the width of the bobwalk (0.01 feet) */
const long PLANT_INTERVAL = 125; /* the planting interval (0.01 feet) */

long carrot_pos[NUM_CARROTS];     /* carrot positions go here */

int main() {
int i;
/* Plant the carrots! */
for (i = 0; i < NUM_CARROTS; i++) {
carrot_pos[i] = BOBWALK_WIDTH + i*PLANT_INTERVAL;
printf("Carrot %d goes to %d.%d feet!\n",
i, carrot_pos[i]/100, carrot_pos[i]%100);
}
return 0;
}


## Remedies: Fixed point computation


void test_carrots() {
/* verify that carrots are planted evenly! */
int i, num_bad = 0;
for (i = 1; i < NUM_CARROTS; i++) {
long d = carrot_pos[i] - carrot_pos[i - 1];
if (d != PLANT_INTERVAL) {
printf("The carrot %d is planted wrong!!!\n", i);
}
}
if (num_bad == 0) printf("All carrots placed well!!!");
else printf( "%d of %d carrots placed incorrectly... :(" ,
}

\$ ./carrots
Carrot 0 goes to 4.80 feet!
Carrot 1 goes to 6.5 feet!
...
Carrot 3272 goes to 4094.80 feet!
Carrot 3273 goes to 4096.5 feet!
...
Carrot 9998 goes to 12502.30 feet!
Carrot 9999 goes to 12503.55 feet!

All carrots placed well!!!


## Remedies: Rational Number Types

• Represent numbers as fractions, e.g. "0.1" is "1/10"
• Some programming languages (Clojure, Haskell, Mathematica, Perl), support natively
• Otherwise, libraries, e.g. GMP (http://gmplib.org)
• Clojure example:
Clojure 1.3.0
user=> (def a 0.1)
user=> (def b 0.2)
user=> (def c 0.3)
user=> (= c (+ a b))
false

user=> (def ar (/ 1 10))
user=> (def br (/ 2 10))
user=> (def cr (/ 3 10))
user=> (= cr (+ ar br))
true

user=> [a b c]
[0.1 0.2 0.3]
user=> [ar br cr]
[1/10 1/5 3/10]


## Remedies: Arbitrary-precision arithmetic

• Using as many digits of precision, "as needed"
• Theoretically limited by the system memory
• Usually quite sophisticated
• Trading correctness for performance
• Some programming languages support natively
• java.math.BigDecimal
• Libraries: Boost Multiprecision Library, MPFR, MAPM, TTMath
• Also, some standalone software (Maple, Mathematica)

### Remedies: Adapt the algorithms

Example: Real roots: -200.05, 0.05

Standard formula: can be inaccurate

Computed roots: -200.050003051758 , 0.0500030517578125, error: 3.05e-6

An alternate formula to fight the catastrofic cancellation:

Computed roots: -200.050003051758 , 0.0499999970197678, error: 3.7e-9

### Improving accuracy: Number summation

/* Naive summation */
float naive_sum(float* val_arr, int nval) {
int i;
float acc = 0.0f;
for (i = 0; i < nval; i++) acc += val_arr[i];
return acc;
}
#define NUM_VAL 10000
int main() {
int i;
float val[NUM_VAL];
for (i = 0; i < NUM_VAL; i++) val[i] = ((float)i + 1);

printf("Naive sum: %f\n", naive_sum(val, NUM_VAL));
return 0;
}
> Naive sum: 50002896.000000

### Improving accuracy: Kahan Summation

Naive summation is a way off:

/* Kahan summation */
float kahan_sum(float* val_arr, int nval) {
int i;
float acc = 0.0f, y, t;
float corr = 0.0f; /* running corrective value for rounding error*/
for (i = 0; i < nval; i++) {
y = val_arr[i] - corr; /* add the correction to the item */
t = acc + y;           /* increase the sum, bits may be lost */
corr = (t - acc) - y;  /* recover the lost bits*/
acc = t;
}
return acc;
}

> Kahan sum: 50005000.000000

## Floating Point Determinism

• Generally no guarantee that the same code would give the same results every time
• Things that might affect: hardware, OS, compiler, compilation flags, builds, 3rd party library behaviour
• Worse with multithreading
• Worse with x87
• Rounding/exception modes is a global per-thread setting
• But gets better with x64!

## Anecdote! Online game synchronization • Cossacks: European Wars by GSC Game World (2000)
• Real-Time Strategy Game (with online multiplayer)
• Run on Win32 PCs, different hardware
• Problems synchronizing
• ...rewrote all the simulation in fixed point!

## Anecdote! Intel vs GCC • Algorithm output matrix comparison
• The same C++ code, compiled with different compilers (GCC 4.7.1 and Intel Compiler)
• The scale is 10E-6. It's very close, but still not exactly the same.
• ...can be other factors as well.

## IEEE 754-2008, going forward

• IEEE 754 was introduced in 1985, quite long time ago
• Hardware capabilities, problems changed quite a bit
• IEEE 754-2008 is an upgrade of IEE 754 (backwards compatible!)
• Resolved minor ambiguities
• Extended with "half precision" (16-bit float) and "quad precision" (128-bit float)
• Arbirary precision formats
• Exception handling recommendations to the programming language implementers
• Fused Multiply-Add operation (A*B + C)
• Added transendental functions

## Anecdote! Not a Floating Point fault • or, "how to mess with people who've learned to expect rounding errors in floating-point math"

• \begin{aligned} e^\pi - \pi \approx 20 \\ \pi^2 \approx \frac {227} {23} , \pi^3 \approx 31 \end{aligned}
• ...not everything explained with "rounding errors"! ## Questions?!!.. 