Floating Point Pitfalls
by Eric Davies
Visually Surprising Examples
The Classic
User: does 0.1 plus 0.2 equal 0.3?
Computer: nope
User: what's the difference?
Computer: 5.551115123125783e-17
User: holy **** why
The Nick Special
User: truncate 2.51 to two decimal places please
Computer: okay, here you go: 2.50
User: what the **** is wrong with you?
IEEE 754
aka the IEEE Standard for Floating-Point Arithmetic
aka "computer math"
- Originally published in 1985, updated in 2008
- Implemented on every computer you use
- Defines formats, rules/properties, arithmetic operations AND MORE!
IEEE 754
- Represent a number in memory
- Determine what number a block of memory represents
- Define arithmetic and ordering for those numbers
- Most common is binary64 (Python's float, C's double, Julia's Float64)
Formats
- Don't need to define arithmetic
- Intended only for transport or storage of small numbers
- binary16 (Julia's Float16) is one of these
Interchange Formats
IEEE binary64
Introduction and Examples
- No way to store 1/10 in binary64
- 2.51 is actually recorded as 2.5099999999999997868371792719699442386627197265625 which is the closest binary number
- We think and write in decimal (base 10) but the computer thinks in binary (base 2) and shows us decimal
- Sometimes programs show us a rounded version to be nice to our eyes and brains
No Such Thing As 0.1
Julia
julia> 0.1 + 0.2
0.30000000000000004
Python
>>> 0.1 + 0.2
0.30000000000000004
R
> 0.1 + 0.2
[1] 0.3
> format(0.1 + 0.2, nsmall = 17)
[1] "0.30000000000000004"
Octave/MATLAB
>>> 0.1 + 0.2
ans = 0.30000
>>> format long
>>> 0.1 + 0.2
ans = 0.300000000000000
>>> output_precision
ans = 15
>>> output_precision(17)
>>> 0.1 + 0.2
ans = 0.30000000000000004
IEEE binary64
Representation
- Sign s
- Exponent x
- Coefficient (significand) c
- Base is 2 (could be different, more later)
Bit Layout
n = (-1)^s * c * 2^x
n=(−1)s∗c∗2x
- Exponent is offset with a bias (1023 for binary64)
- Coefficient represents the fractional part of a decimal number as an integer, where the whole part is always 1
- All binary64 values have one standard representation (smallest exponent)
- There are -0.0 and +0.0
- Special values:
- -Inf and Inf
- infinities
- NaN
- Not a Number
- has optional extra info that no one uses
- -Inf and Inf
Values
IEEE binary64
Properties
- Addition and multiplication are commutative
- 0.1 + 0.2 == 0.2 + 0.1
- but not associative
- (-1e20 + 1e20) + 3.14 == 3.14
- -1e20 + (1e20 + 3.14) == 0.0
- Multiplication doesn't distribute over addition for the same reasons
- Difficult for compilers to optimize operations, so some have a "fast math" option that sacrifices accuracy for speed
Arithmetic
- One consequence of these properties is that reordering operations can produce different levels of error
- Difficult for compilers to optimize operations without reordering, so some have a "fast math" option that sacrifices accuracy for speed
Arithmetic
- Fixed number of bits ⟹ some numbers can't be expressed exactly
- Thus there is some (quantifiable) error inherent in calculations using this format
- Error is measured in multiples of "machine epsilon", which is determined by:
- the number of bits in use
- the base
- the rounding method used to convert higher-precision calculation results to the storage precision
Error
- Absolute error depends on the value of operands
- Difference in magnitude increases error
- 1e20 + 3.14 == 0
- This is one of the reasons it's good to normalize and scale inputs to neural nets, and why layers have activation functions which scale the output to a particular range
Error
- In modern computers, arithmetic operations are performed using higher precision, then reduced to storage precision (transparent)
- Special machine instructions like "fma" (fused multiply-add) that combine several operations in one
- Other functions like log1p and sinpi that use IEEE properties to implement more accurate and efficient versions of combinations of operations
Avoiding Error
- Method 1: Compare to result of BigFloat calculations
- absolute error
- relative error
- Method 2: calculate the bounds due to known rounding error
- IEEE 1788
- ValidatedNumerics.jl
- interval arithmetic
- IEEE 1788
- Method 3: calculate error analytically
Evaluating Error
- interval arithmetic
- given two intervals A and B and a function taking two values
- applying the function to the intervals generates an interval C which contains the results of the function applied to any pair of values selected from A and B
- IEEE 1788: interval arithmetic for IEEE 754
Evaluating Error
Validated Numerics
- Julia and numpy both use "pairwise summation" instead of a simple iterative sum, which has less error but is just as fast
- Another method, Kahan summation, keeps track of small errors and compensates at the end
- Read more below
Case Study: Summation
David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23, #1, March 1991, pp. 5-48.
https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html
Alternatives
to binary floating point
- You can have trunc(2.51, 2) == 2.51!
- decimal64 is fixed-precision floating point decimal
- You can have:
- arbitrary precision (set to any value) or "configurable" precision (pick from a small set of values) and rarely infinite precision
- fixed point
- decimal
- in many combinations
- Speed and memory tradeoffs
Alternatives
- generic number code is first class in Julia
- easy to write a new number type
- easy to make it fast
- combine all sorts of types
- many packages for many number types
- C library wrappers
- pure Julia
Alternatives in Julia
Alternatives in Julia
Package | Floating Point | Precision | Base |
---|---|---|---|
FixedPointDecimals.jl | no | configurable | 10 |
DecFP.jl (IEEE 754) | yes | configurable | 10 |
FixedPointNumbers.jl | no | configurable | 2 |
ArbFloats.jl | yes | arbitrary | 2 |
BigFloats (Base) | yes | arbitrary | 2 |
- BigInt is an infinite-precision integer
- Rational{BigInt} can represent any rational number (limited by memory)
- Use this to get exact answers if you're curious
Bonus: Rational{BigInt}
Number Types at Invenia
- Pick a type which matches what you are trying to represent
- represent integers with integer types
- represent decimal numbers (e.g., dollar values) with decimal types
- consider the range and precision of possible values
- Test Float64s for "close enough" instead of equality, and don't expect them to be exactly equal to decimal literal inputs
General Guidelines
- We use this for bid prices during financial calculations
- Represents a number using an integer value, and the number of decimal places stored in the type
FixedPointDecimals.jl
x \times 10^p
x×10p
- We can use integers for a lot of operations where we used to use floating point numbers in MATLAB
- Compilers like integers
- Some operations are much faster (e.g., squaring)
Regular Old Integers
Floating Point Pitfalls
By Eric Davies
Floating Point Pitfalls
- 876