# Floating Point Pitfalls

## by Eric Davies

# Visually Surprising Examples

## The Classic

`User: does 0.1 plus 0.2 equal 0.3?`

`Computer: nope`

`User: what's the difference?`

`Computer: 5.551115123125783e-17`

`User: holy **** why`

## The Nick Special

`User: truncate 2.51 to two decimal places please`

`Computer: okay, here you go: 2.50`

`User: what the **** is wrong with you?`

# IEEE 754

### aka the IEEE Standard for Floating-Point Arithmetic

aka "computer math"

- Originally published in 1985, updated in 2008
- Implemented on every computer you use
- Defines formats, rules/properties, arithmetic operations AND MORE!

## IEEE 754

- Represent a number in memory
- Determine what number a block of memory represents
- Define arithmetic and ordering for those numbers
- Most common is binary64 (Python's float, C's double, Julia's Float64)

## Formats

- Don't need to define arithmetic
- Intended only for transport or storage of small numbers
- binary16 (Julia's Float16) is one of these

## Interchange Formats

# IEEE binary64

## Introduction and Examples

- No way to store 1/10 in binary64
- 2.51 is actually recorded as 2.5099999999999997868371792719699442386627197265625 which is the closest binary number
- We think and write in decimal (base 10) but the computer thinks in binary (base 2) and shows us decimal
- Sometimes programs show us a rounded version to be nice to our eyes and brains

## No Such Thing As 0.1

## Julia

```
julia> 0.1 + 0.2
0.30000000000000004
```

## Python

```
>>> 0.1 + 0.2
0.30000000000000004
```

## R

```
> 0.1 + 0.2
[1] 0.3
> format(0.1 + 0.2, nsmall = 17)
[1] "0.30000000000000004"
```

## Octave/MATLAB

```
>>> 0.1 + 0.2
ans = 0.30000
>>> format long
>>> 0.1 + 0.2
ans = 0.300000000000000
>>> output_precision
ans = 15
>>> output_precision(17)
>>> 0.1 + 0.2
ans = 0.30000000000000004
```

# IEEE binary64

## Representation

- Sign s
- Exponent x
- Coefficient (significand) c
- Base is 2 (could be different, more later)

## Bit Layout

n = (-1)^s * c * 2^x

$n = (-1)^s * c * 2^x$

- Exponent is offset with a bias (1023 for binary64)
- Coefficient represents the fractional part of a decimal number as an integer, where the whole part is always 1

- All binary64 values have one standard representation (smallest exponent)
- There are -0.0 and +0.0
- Special values:
- -Inf and Inf
- infinities

- NaN
- Not a Number
- has optional extra info that no one uses

- -Inf and Inf

## Values

# IEEE binary64

## Properties

- Addition and multiplication are commutative
- 0.1 + 0.2 == 0.2 + 0.1

- but not associative
- (-1e20 + 1e20) + 3.14 == 3.14
- -1e20 + (1e20 + 3.14) == 0.0

- Multiplication doesn't distribute over addition for the same reasons
- Difficult for compilers to optimize operations, so some have a "fast math" option that sacrifices accuracy for speed

## Arithmetic

- One consequence of these properties is that reordering operations can produce different levels of error
- Difficult for compilers to optimize operations without reordering, so some have a "fast math" option that sacrifices accuracy for speed

## Arithmetic

- Fixed number of bits ⟹ some numbers can't be expressed exactly
- Thus there is some (quantifiable) error inherent in calculations using this format
- Error is measured in multiples of "machine epsilon", which is determined by:
- the number of bits in use
- the base
- the rounding method used to convert higher-precision calculation results to the storage precision

## Error

- Absolute error depends on the value of operands
- Difference in magnitude increases error
- 1e20 + 3.14 == 0
- This is one of the reasons it's good to normalize and scale inputs to neural nets, and why layers have activation functions which scale the output to a particular range

## Error

- In modern computers, arithmetic operations are performed using higher precision, then reduced to storage precision (transparent)
- Special machine instructions like "fma" (fused multiply-add) that combine several operations in one
- Other functions like log1p and sinpi that use IEEE properties to implement more accurate and efficient versions of combinations of operations

## Avoiding Error

- Method 1: Compare to result of BigFloat calculations
- absolute error
- relative error

- Method 2: calculate the bounds due to known rounding error
- IEEE 1788
- ValidatedNumerics.jl
- interval arithmetic

- IEEE 1788
- Method 3: calculate error analytically

## Evaluating Error

- interval arithmetic
- given two intervals A and B and a function taking two values
- applying the function to the intervals generates an interval C which contains the results of the function applied to any pair of values selected from A and B

- IEEE 1788: interval arithmetic for IEEE 754

## Evaluating Error

### Validated Numerics

- Julia and numpy both use "pairwise summation" instead of a simple iterative sum, which has less error but is just as fast
- Another method, Kahan summation, keeps track of small errors and compensates at the end
- Read more below

## Case Study: Summation

David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23, #1, March 1991, pp. 5-48.

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

# Alternatives

## to binary floating point

- You can have trunc(2.51, 2) == 2.51!
- decimal64 is fixed-precision floating point decimal
- You can have:
- arbitrary precision (set to any value) or "configurable" precision (pick from a small set of values) and rarely infinite precision
- fixed point
- decimal
- in many combinations

- Speed and memory tradeoffs

## Alternatives

- generic number code is first class in Julia
- easy to write a new number type
- easy to make it fast
- combine all sorts of types

- many packages for many number types
- C library wrappers
- pure Julia

## Alternatives in Julia

## Alternatives in Julia

Package | Floating Point | Precision | Base |
---|---|---|---|

FixedPointDecimals.jl | no | configurable | 10 |

DecFP.jl (IEEE 754) | yes | configurable | 10 |

FixedPointNumbers.jl | no | configurable | 2 |

ArbFloats.jl | yes | arbitrary | 2 |

BigFloats (Base) | yes | arbitrary | 2 |

- BigInt is an infinite-precision integer
- Rational{BigInt} can represent any rational number (limited by memory)
- Use this to get exact answers if you're curious

## Bonus: Rational{BigInt}

# Number Types at Invenia

- Pick a type which matches what you are trying to represent
- represent integers with integer types
- represent decimal numbers (e.g., dollar values) with decimal types
- consider the range and precision of possible values

- Test Float64s for "close enough" instead of equality, and don't expect them to be exactly equal to decimal literal inputs

## General Guidelines

- We use this for bid prices during financial calculations
- Represents a number using an integer value, and the number of decimal places stored in the type

## FixedPointDecimals.jl

x \times 10^p

$x \times 10^p$

- We can use integers for a lot of operations where we used to use floating point numbers in MATLAB
- Compilers like integers
- Some operations are much faster (e.g., squaring)

## Regular Old Integers

#### Floating Point Pitfalls

By Eric Davies

# Floating Point Pitfalls

- 850