Floating Point Pitfalls

by Eric Davies

Visually Surprising Examples

The Classic

User: does 0.1 plus 0.2 equal 0.3?

Computer: nope

User: what's the difference?

Computer: 5.551115123125783e-17

User: holy **** why

The Nick Special

User: truncate 2.51 to two decimal places please

Computer: okay, here you go: 2.50

User: what the **** is wrong with you?

IEEE 754

aka the IEEE Standard for Floating-Point Arithmetic

aka "computer math"

Originally published in 1985, updated in 2008
Implemented on every computer you use
Defines formats, rules/properties, arithmetic operations AND MORE!

IEEE 754

Represent a number in memory
Determine what number a block of memory represents
Define arithmetic and ordering for those numbers
Most common is binary64 (Python's float, C's double, Julia's Float64)

Formats

Don't need to define arithmetic
Intended only for transport or storage of small numbers
binary16 (Julia's Float16) is one of these

Interchange Formats

IEEE binary64

Introduction and Examples

No way to store 1/10 in binary64
2.51 is actually recorded as 2.5099999999999997868371792719699442386627197265625 which is the closest binary number
We think and write in decimal (base 10) but the computer thinks in binary (base 2) and shows us decimal
Sometimes programs show us a rounded version to be nice to our eyes and brains

No Such Thing As 0.1

Julia

julia> 0.1 + 0.2
0.30000000000000004

Python

>>> 0.1 + 0.2
0.30000000000000004

R

> 0.1 + 0.2
[1] 0.3
> format(0.1 + 0.2, nsmall = 17)
[1] "0.30000000000000004"

Octave/MATLAB

>>> 0.1 + 0.2
ans =  0.30000
>>> format long
>>> 0.1 + 0.2
ans =  0.300000000000000
>>> output_precision
ans =  15
>>> output_precision(17)
>>> 0.1 + 0.2
ans =  0.30000000000000004

IEEE binary64

Representation

Sign s
Exponent x
Coefficient (significand) c
Base is 2 (could be different, more later)

Bit Layout

n = (-1)^s * c * 2^x

n = (-1)^s * c * 2^x

Exponent is offset with a bias (1023 for binary64)
Coefficient represents the fractional part of a decimal number as an integer, where the whole part is always 1

All binary64 values have one standard representation (smallest exponent)
There are -0.0 and +0.0
Special values:
- -Inf and Inf
  - infinities
- NaN
  - Not a Number
  - has optional extra info that no one uses

Values

IEEE binary64

Properties

Addition and multiplication are commutative
- 0.1 + 0.2 == 0.2 + 0.1
but not associative
- (-1e20 + 1e20) + 3.14 == 3.14
- -1e20 + (1e20 + 3.14) == 0.0
Multiplication doesn't distribute over addition for the same reasons
Difficult for compilers to optimize operations, so some have a "fast math" option that sacrifices accuracy for speed

Arithmetic

One consequence of these properties is that reordering operations can produce different levels of error
Difficult for compilers to optimize operations without reordering, so some have a "fast math" option that sacrifices accuracy for speed

Arithmetic

Fixed number of bits ⟹ some numbers can't be expressed exactly
Thus there is some (quantifiable) error inherent in calculations using this format
Error is measured in multiples of "machine epsilon", which is determined by:
- the number of bits in use
- the base
- the rounding method used to convert higher-precision calculation results to the storage precision

Error

Absolute error depends on the value of operands
Difference in magnitude increases error
1e20 + 3.14 == 0
This is one of the reasons it's good to normalize and scale inputs to neural nets, and why layers have activation functions which scale the output to a particular range

Error

In modern computers, arithmetic operations are performed using higher precision, then reduced to storage precision (transparent)
Special machine instructions like "fma" (fused multiply-add) that combine several operations in one
Other functions like log1p and sinpi that use IEEE properties to implement more accurate and efficient versions of combinations of operations

Avoiding Error

Method 1: Compare to result of BigFloat calculations
- absolute error
- relative error
Method 2: calculate the bounds due to known rounding error
- IEEE 1788
  - ValidatedNumerics.jl
  - interval arithmetic
Method 3: calculate error analytically

Evaluating Error

interval arithmetic
- given two intervals A and B and a function taking two values
- applying the function to the intervals generates an interval C which contains the results of the function applied to any pair of values selected from A and B
IEEE 1788: interval arithmetic for IEEE 754

Evaluating Error

Validated Numerics

Julia and numpy both use "pairwise summation" instead of a simple iterative sum, which has less error but is just as fast
Another method, Kahan summation, keeps track of small errors and compensates at the end
Read more below

Case Study: Summation

David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23, #1, March 1991, pp. 5-48.

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Alternatives

to binary floating point

You can have trunc(2.51, 2) == 2.51!
decimal64 is fixed-precision floating point decimal
You can have:
- arbitrary precision (set to any value) or "configurable" precision (pick from a small set of values) and rarely infinite precision
- fixed point
- decimal
- in many combinations
Speed and memory tradeoffs

Alternatives

generic number code is first class in Julia
- easy to write a new number type
- easy to make it fast
- combine all sorts of types
many packages for many number types
- C library wrappers
- pure Julia

Alternatives in Julia

Package	Floating Point	Precision	Base
FixedPointDecimals.jl	no	configurable	10
DecFP.jl (IEEE 754)	yes	configurable	10
FixedPointNumbers.jl	no	configurable	2
ArbFloats.jl	yes	arbitrary	2
BigFloats (Base)	yes	arbitrary	2

BigInt is an infinite-precision integer
Rational{BigInt} can represent any rational number (limited by memory)
Use this to get exact answers if you're curious

Bonus: Rational{BigInt}

Number Types at Invenia

Pick a type which matches what you are trying to represent
- represent integers with integer types
- represent decimal numbers (e.g., dollar values) with decimal types
- consider the range and precision of possible values
Test Float64s for "close enough" instead of equality, and don't expect them to be exactly equal to decimal literal inputs

General Guidelines

We use this for bid prices during financial calculations
Represents a number using an integer value, and the number of decimal places stored in the type

FixedPointDecimals.jl

x \times 10^p

x \times 10^p

We can use integers for a lot of operations where we used to use floating point numbers in MATLAB
Compilers like integers
Some operations are much faster (e.g., squaring)

Floating Point Pitfalls

by Eric Davies

Visually Surprising Examples

The Classic

The Nick Special

IEEE 754

aka the IEEE Standard for Floating-Point Arithmetic

IEEE 754

Formats

Interchange Formats

IEEE binary64

Introduction and Examples

No Such Thing As 0.1

Julia

Python

R

Octave/MATLAB

IEEE binary64

Representation

Bit Layout

Values

IEEE binary64

Properties

Arithmetic

Arithmetic

Error

Error

Avoiding Error

Evaluating Error

Evaluating Error

Validated Numerics

Case Study: Summation

Alternatives

to binary floating point

Alternatives

Alternatives in Julia

Alternatives in Julia

Bonus: Rational{BigInt}

Number Types at Invenia

General Guidelines

FixedPointDecimals.jl

Regular Old Integers

Floating Point Pitfalls

More from Eric Davies