A tale about Floating Point Fairy, angry carrots and other stuff

USS Enterprise (NCC-1701)

"Power up the engines!!!"

"Sir yes sir!"

- Floating point precision is not uniformly distributed
- The further into the space, the bigger the interval between the two adjacent values
- At some point the added value is less than this interval, and the ship stops
...this, and many other confusing things

```
$ python
>>> 0.1 + 0.2 == 0.3
False
>>> 0.1 + 0.2
0.30000000000000004
>>> 0.3
0.29999999999999999
>>> WAT!?..
```

- 0.1 is not "0.1"
- I'ts "0.001001100110011001(1001)..."
- None of these numbers (0.1, 0.2, 0.3) are represented exactly
- Humans have ten fingers, computers have two
- Almost all of the computing harware novadays

- Built in 1958 at the Moscow State University
- Balanced ternary, usesing the three digits: -1, 0, and +1
- Natural representation of negative numbers (no "sign bit")
- Some computations are more efficient
- Donald Knuth argued it might be used in future hardware
...actually not that bad idea!

- Always finite representation
- Iteratively divide by 2, modulo goes into binary digits, to the left
- 135 (Decimal) == 10000111 (Binary)

Current Decimal | Current Binary |
---|---|

135 | |

67 | 1 |

33 | 11 |

16 | 111 |

8 | 0111 |

4 | 00111 |

2 | 000111 |

1 | 1000111 |

10000111 |

- Iteratively multiply by 2, leftmost digit goes into the binary digits, to the right
- 0.6875 (Decimal) == 0.1011 (Binary)

Current Decimal | Current Binary |
---|---|

0.6875 | |

[1]0.375 | 1 |

[0]0.75 | 10 |

[1]0.5 | 101 |

[1] | 1011 |

- However, only finite when power of two in denominator
- Goes into infinite loop. 0.8 (Decimal) == 0.11001100(1100)...

Current Decimal | Current Binary |
---|---|

0.8 | |

[1]0.6 | 1 |

[1]0.2 | 11 |

[0]0.4 | 110 |

[0]0.8 | 1100 |

[1]0.6 | 11001 |

... | ... |

- What to do with these infinite fractions?..
- Chop, of course!
- The last two digits get modified because of rounding
- The image shows the standard way of representing it, in "single precision"

- Z1, built in 1935-1938 by Konrad Zuse
- The first programmable machine, supported floating-point computations
- Had all the basic ingredients of modern machines
- Binary system
- 22 bits with 14 mantissa bits (with one extra implied), a 7-bit exponent, and a sign bit
- Z2, Z3, Z4 (32bit)

- Announced in 1980
- A separate microprocessor, working together with 8086
- Speed up floating point computations up to 500%
- Uses stack-based register model, unlike main CPU
- 80-bit internal register precision
- Later CPUs include the functionality in the main CPU
- Triggered the IEEE 754 standard

- Hardware problem with Floating Point Division
- Certain Intel P5 CPU models, introduced in 1993
- A problem discovered by professor Thomas Nicely
- Rare: 1 in 9 billion floating point divides
- Could be verified via division 4195835/3145727:

1.333739068902037589 instead of: 1.333820449136241002 ...public outcry, Intel offered replacement

- The problem: A lot of different hardware, each with own pecularities. Anarchy.
- Computer scientists could not rely on their code giving expected results.
- William Kahan, "the Father of Floating Point".
- Was closely tied to 8087 FP co-processor. "Intel's gift to society".

- Basic and extended floating-point number formats
- Add, subtract, multiply, divide, square root, remainder, and compare operations
- Conversions between integer and floating-point
- Conversion from/to decimal string
- Exceptions and their handling

- The actual value on the picture:
- s: sign, 1 bit
- e: exponent, 8 bits
- Exponent is "biased" by 127 (1111100 = 124, 124-127 = -3)
- m: mantissa, 23 bits
- One bit in mantissa is implicit, "always present"
- This is called "normalized" representation

- "The amount of precision that a floating number can have"
- Urban myth... but partially true:
- 24 because: 23 bits for mantissa + 1 implicit bit
- In C++, std::numeric_limits
::digits10 (in C, macro FLT_DIG) gives 6. - Wikipedia says "6 to 9". Get the same number if converting Decimal->IEEE 754->Decimal
- All decimals with 6 digits survive "trip and back"
- But technically, from 0 to 100+ digits: denormals have fewer, exact numbers have more

- s: sign, 1 bit
- e: exponent, 11 bits
- m: mantissa, 52 bits
- Same as binary32, but uses twice the amount of bits for everything
- Considerably better precision

- Precision decreases with increasing magnitude
- Because of exponent/mantissa representation
- That's why it is called "floating point"
- Actually, logarithmic:

- If digits do not fit, they get chopped (rounded):
- The last two digits get modified because of rounding
- There are different ways of rounding. IEEE 754 suggests 5
- The default rounding mode is "round to nearest, tie to even"

Rounding mode | +11.5 | +12.5 | -11.5 | -12.5 |
---|---|---|---|---|

to nearest, ties to even | +12.0 | +12.0 | -12.0 | -12.0 |

to nearest, ties from zero | +12.0 | +13.0 | -12.0 | -13.0 |

toward 0 | +11.0 | +12.0 | -11.0 | -12.0 |

toward +inf | +12.0 | +13.0 | -11.0 | -12.0 |

toward -inf | +11.0 | +12.0 | -12.0 | -13.0 |

- "Gradual underflow", AKA "denormals"
- Very useful for some algorithms (e.g. prevents divizion by 0)
- But generally very slow! (because of special treatment)
- No implicit leading 1! (i.e. binary not normalized)
- Use less bits for mantissa. Different representation
- For C++/FORTRAN, can be disabled on compiler level ("flush-to-zero" and "denormals-are-zero")

- +0 and -0 (they are equal by the standard!)
- +Inf and -Inf
- NaN (not-a-number), QNaN (quiet not-a-number)

- 5 types of exceptions should be signaled when detected
- Invalid operation: involving NaN, sqrt(-1), float->int overflow
- Division by zero. The result is signed Inf.
- Overflow: when rounding and converting
- Underflow
- Inexact

- On 21 Sept. 1997, maneuvering off the coast of Cape Charles
- Crewman entered a blank fiend into database
- The blank was treated as zero
- Caused Divide-by-Zero Exception
- Which was not handled properly by the software
- The exception was propagated to the system (WinNT 4.0)
...the vessel paralyzed for 3 hours

- User should be able to specify a handler (or disable)
- Many programming languages just don't care (such as Java)
- C/C++ has an API:
`#include`

void main() { unsigned int fp_control_word, new_fp_control_word; _controlfp_s(&fp_control_word, 0, 0); // make the new fp env same as the old one, // except for the changes we're going to make new_fp_control_word = fp_control_word | _EM_INVALID | _EM_DENORMAL | _EM_ZERODIVIDE | _EM_OVERFLOW | _EM_UNDERFLOW | _EM_INEXACT; // update the control word with our changes _controlfp_s(&fp_control_word, new_fp_control_word, _MCW_EM) }

- Launched in June 1996, 10 years and $7 billion to build
- Crashed in 36.7 seconds after launch
- 64 bit floating point number (horizontal velocity of the rocket) converted to 16 bit integer
- The number bigger than in Ariane4; unchecked overflow
- Internal software exception, shut down the navigation system
...the ultimate cause is the lack of proper FP exception handling

```
#define NUM_CARROTS 10000 /* the number of carrotts to plant */
const float BOBWALK_WIDTH = 4.8f; /* the width of the bobwalk (feet) */
const float PLANT_INTERVAL = 1.25f;/* the planting interval (feet) */
float carrot_pos[NUM_CARROTS]; /* carrot positions go here */
int main() {
int i;
/* Plant the carrots! */
for (i = 0; i < NUM_CARROTS; i++) {
carrot_pos[i] = BOBWALK_WIDTH + i*PLANT_INTERVAL;
printf("Carrot %d goes to %g feet!\n", i, carrot_pos[i]);
}
return 0;
}
```

```
const float EPSILON = 0.0001f; /* epsilon to check placement */
void test_carrots() {
/* verify that carrots are all planted evenly! */
int i, num_bad = 0;
for (i = 1; i < NUM_CARROTS; i++) {
float d = carrot_pos[i] - carrot_pos[i - 1];
if (fabs(d - PLANT_INTERVAL) > EPSILON) {
printf( "The carrot %d is planted wrong!!!\n" , i);
num_bad++;
}
}
if (num_bad == 0) printf( "All carrots placed well!!!");
else printf("%d of %d carrots placed incorrectly... :(",
num_bad, NUM_CARROTS);
}
```

```
The carrot 3273 is planted wrong!!!
1 of 10000 carrots placed incorrectly... :(
```

```
offs = single(4.8) + single(1.25)*[0:9999];
delta = offs(2:end) - offs(1:end-1);
error = delta - single(1.25);
plot(error, 'r-', 'LineWidth', 2);
grid on;
title('Planting distance error');
xlabel('Carrot index');
ylabel('Error (feet)');
```

```
octave> find(abs(error) >= 1e-4)
ans = 3273
octave> non_exact = find(abs(error) != 0)
non_exact =
9 48 201 816 3273
octave> non_exact_offsets = 4.8 + non_exact*1.25
non_exact_offsets =
16.050 64.800 256.050 1024.800 4096.050
octave> error(non_exact)
ans =
-9.5367e-007 3.8147e-006 -1.5259e-005 6.1035e-005 -2.4414e-00
```

- In majority of the cases the rounding errors cancel each other
- But in a few curious cases (around values 16, 64, 256, 1024, 4096) there is a catastrofic cancellation
- Rounding error in the lowest bits was enough
- "True story" (tm)

b = 31.006276685299816

*a*and*b*are close to each other*a*and*b*have some garbage in the lower bits- After the substraction, the "good bits" cancel each other: 31.006276686241002 - 31.006276685299816 = 0.000000000941186
- The garbage gets promoted and claims to be a meaningful computation result: 10000000000 * 0.000000000941186 = 9.41186

```
import math
def dot((x1, y1, z1), (x2, y2, z2)):
""" Dot product of two 3-dimensional vectors"""
return x1*x2 + y1*y2 + z1*z2
def normalize((x,y,z)):
""" Normalizes a 3-dimensional vector"""
len = math.sqrt(dot((x, y, z), (x, y, z)))
return (x/len, y/len, z/len)
def angle(a, b):
""" Finds an angle between two 3-dimensional vectors"""
anorm = normalize(a)
bnorm = normalize(b)
return math.acos(dot(anorm, bnorm))
```

```
>>> from dot import *
>>> v1=(16, 16, 16)
>>> v2=(32, 32, 32)
>>> v3=(1,2,3)
>>> angle(v1, v3)
0.38759668665518016
>>> angle(v1, v2)
Traceback (most recent call last):
File "
```", line 1, in
File "dot.py", line 16, in angle
return math.acos(dot(anorm, bnorm))
ValueError: math domain error
>>> dot(normalize(v1), normalize(v2))
1.0000000000000002

- Arccosine is only defined inside [-1, 1]
- Trying to take arccosine from 1.0000000000000002
- Result of accumulated rounding errors

- Bad conditioning number: small changes in input cause big changes in output
- Can happen e.g. when going "continuous"->discrete via decimal rounding/truncation
- The problem: number of inserts (pink dots) defined from the ("constant") length via rounding
- In one case: 2.499999786 -> 2, in other 2.50000000 -> 3
- Need different algorithm/architecture
- "True story" (tm)

Are not the same!!!

Are not the same!!!

```
int main() {
float a;
double b;
a = 1.15f;
b = a;
printf("1.15 in float: %g\n", a);
printf("1.15 in double: %g", b);
return 0;
}
```

Double has worse accuracy than float???..

- 0.15 in single precision binary format:
- 0.15 in double precision binary format:
- 0.15 in float->double precision binary:
- The digits were chopped off!
- The printing code does "the right thing" for float, but truncated double has lost information

- Naive way that does not work:
- You don't just compare like that (0.1 + 0.2 != 0.3)
- Some programming languages (like ML) don't even allow that

- A "better" way (does not really work either:
- Which epsilon to choose?
- A better way: relative error
- Should also take NaNs into account, when applicable
- Still can be improved

- Can also compare as integers!
- Thanks to the way IEEE 754 representation is designed
- Machine epsilon: the distance between 1.0 and the next representable float number
- "Upper bound on the relative error"
- ULP (unit in the last place). Depends on the actual value.

- Sometimes a valid thing to do
- Would help with carrots planting
- However: might be just postponing/hiding a problem
- However: twice the memory usage
- However: worse SIMD utilization
- Compromise: do selectively, in the critical places

- Anarchy Online by Funcom, released in 2001
- Massive Multiplayer Online Role Playing Game (MMORPG)
- Client-server architecture, all simulation run on server
- Initially used single-precision floating point for server time
...crashed the server after a couple of weeks uptime

- Work with integers and scale when required
- Reality for embedded microprocessors (no FPU anyway)
- Most common binary and decimal scales (10^X or 2^X)

```
#define NUM_CARROTS 10000 /* the number of carrotts to plant */
const long BOBWALK_WIDTH = 480; /* the width of the bobwalk (0.01 feet) */
const long PLANT_INTERVAL = 125; /* the planting interval (0.01 feet) */
long carrot_pos[NUM_CARROTS]; /* carrot positions go here */
int main() {
int i;
/* Plant the carrots! */
for (i = 0; i < NUM_CARROTS; i++) {
carrot_pos[i] = BOBWALK_WIDTH + i*PLANT_INTERVAL;
printf("Carrot %d goes to %d.%d feet!\n",
i, carrot_pos[i]/100, carrot_pos[i]%100);
}
return 0;
}
```

```
void test_carrots() {
/* verify that carrots are planted evenly! */
int i, num_bad = 0;
for (i = 1; i < NUM_CARROTS; i++) {
long d = carrot_pos[i] - carrot_pos[i - 1];
if (d != PLANT_INTERVAL) {
printf("The carrot %d is planted wrong!!!\n", i);
num_bad++;
}
}
if (num_bad == 0) printf("All carrots placed well!!!");
else printf( "%d of %d carrots placed incorrectly... :(" ,
num_bad, NUM_CARROTS);
}
```

```
$ ./carrots
Carrot 0 goes to 4.80 feet!
Carrot 1 goes to 6.5 feet!
...
Carrot 3272 goes to 4094.80 feet!
Carrot 3273 goes to 4096.5 feet!
...
Carrot 9998 goes to 12502.30 feet!
Carrot 9999 goes to 12503.55 feet!
All carrots placed well!!!
```

- Represent numbers as fractions, e.g. "0.1" is "1/10"
- Some programming languages (Clojure, Haskell, Mathematica, Perl), support natively
- Otherwise, libraries, e.g. GMP (http://gmplib.org)
- Clojure example:

```
Clojure 1.3.0
user=> (def a 0.1)
user=> (def b 0.2)
user=> (def c 0.3)
user=> (= c (+ a b))
false
user=> (def ar (/ 1 10))
user=> (def br (/ 2 10))
user=> (def cr (/ 3 10))
user=> (= cr (+ ar br))
true
user=> [a b c]
[0.1 0.2 0.3]
user=> [ar br cr]
[1/10 1/5 3/10]
```

- Using as many digits of precision, "as needed"
- Theoretically limited by the system memory
- Usually quite sophisticated
- Trading correctness for performance
- Some programming languages support natively
- java.math.BigDecimal
- Libraries: Boost Multiprecision Library, MPFR, MAPM, TTMath
- Also, some standalone software (Maple, Mathematica)

Example:

Real roots: -200.05, 0.05

Standard formula: can be inaccurate

Computed roots: -200.050003051758 , 0.0500030517578125, error: 3.05e-6

An alternate formula to fight the catastrofic cancellation:

Computed roots: -200.050003051758 , 0.0499999970197678, error: 3.7e-9

```
/* Naive summation */
float naive_sum(float* val_arr, int nval) {
int i;
float acc = 0.0f;
for (i = 0; i < nval; i++) acc += val_arr[i];
return acc;
}
```

```
#define NUM_VAL 10000
int main() {
int i;
float val[NUM_VAL];
for (i = 0; i < NUM_VAL; i++) val[i] = ((float)i + 1);
printf("Naive sum: %f\n", naive_sum(val, NUM_VAL));
return 0;
}
```

`> Naive sum: 50002896.000000`

Naive summation is a way off:

```
/* Kahan summation */
float kahan_sum(float* val_arr, int nval) {
int i;
float acc = 0.0f, y, t;
float corr = 0.0f; /* running corrective value for rounding error*/
for (i = 0; i < nval; i++) {
y = val_arr[i] - corr; /* add the correction to the item */
t = acc + y; /* increase the sum, bits may be lost */
corr = (t - acc) - y; /* recover the lost bits*/
acc = t;
}
return acc;
}
```

`> Kahan sum: 50005000.000000`

- Generally no guarantee that the same code would give the same results every time
- Things that might affect: hardware, OS, compiler, compilation flags, builds, 3rd party library behaviour
- Worse with multithreading
- Worse with x87
- Rounding/exception modes is a global per-thread setting
- But gets better with x64!

- Cossacks: European Wars by GSC Game World (2000)
- Real-Time Strategy Game (with online multiplayer)
- Run on Win32 PCs, different hardware
- Problems synchronizing
...rewrote all the simulation in fixed point!

- Algorithm output matrix comparison
- The same C++ code, compiled with different compilers (GCC 4.7.1 and Intel Compiler)
- The scale is 10E-6. It's very close, but still not exactly the same.
...can be other factors as well.

- IEEE 754 was introduced in 1985, quite long time ago
- Hardware capabilities, problems changed quite a bit
- IEEE 754-2008 is an upgrade of IEE 754 (backwards compatible!)
- Resolved minor ambiguities
- Extended with "half precision" (16-bit float) and "quad precision" (128-bit float)
- Arbirary precision formats
- Exception handling recommendations to the programming language implementers
- Fused Multiply-Add operation (A*B + C)
- Added transendental functions

- or, "how to mess with people who've learned to
*expect*rounding errors in floating-point math" ...not everything explained with "rounding errors"!

\[\begin{aligned} e^\pi - \pi \approx 20 \\ \pi^2 \approx \frac {227} {23} , \pi^3 \approx 31 \end{aligned} \]