Floating Point Pitfalls

by Eric Davies

Visually Surprising Examples

The Classic

User: does 0.1 plus 0.2 equal 0.3?
Computer: nope
User: what's the difference?
Computer: 5.551115123125783e-17
User: holy **** why

The Nick Special

User: truncate 2.51 to two decimal places please
Computer: okay, here you go: 2.50
User: what the **** is wrong with you?

IEEE 754

aka the IEEE Standard for Floating-Point Arithmetic

aka "computer math"

  • Originally published in 1985, updated in 2008
  • Implemented on every computer you use
  • Defines formats, rules/properties, arithmetic operations AND MORE!

IEEE 754

  • Represent a number in memory
  • Determine what number a block of memory represents
  • Define arithmetic and ordering for those numbers
  • Most common is binary64 (Python's float, C's double, Julia's Float64)

Formats

  • Don't need to define arithmetic
  • Intended only for transport or storage of small numbers
  • binary16 (Julia's Float16) is one of these

Interchange Formats

IEEE binary64

Introduction and Examples

  • No way to store 1/10 in binary64
  • 2.51 is actually recorded as 2.5099999999999997868371792719699442386627197265625 which is the closest binary number
  • We think and write in decimal (base 10) but the computer thinks in binary (base 2) and shows us decimal
  • Sometimes programs show us a rounded version to be nice to our eyes and brains

No Such Thing As 0.1

Julia

julia> 0.1 + 0.2
0.30000000000000004

Python

>>> 0.1 + 0.2
0.30000000000000004

R

> 0.1 + 0.2
[1] 0.3
> format(0.1 + 0.2, nsmall = 17)
[1] "0.30000000000000004"

Octave/MATLAB

>>> 0.1 + 0.2
ans =  0.30000
>>> format long
>>> 0.1 + 0.2
ans =  0.300000000000000
>>> output_precision
ans =  15
>>> output_precision(17)
>>> 0.1 + 0.2
ans =  0.30000000000000004

IEEE binary64

Representation

  • Sign s
  • Exponent x
  • Coefficient (significand) c
  • Base is 2 (could be different, more later)

Bit Layout

n = (-1)^s * c * 2^x
n=(1)sc2xn = (-1)^s * c * 2^x
  • Exponent is offset with a bias (1023 for binary64)
  • Coefficient represents the fractional part of a decimal number as an integer, where the whole part is always 1
  • All binary64 values have one standard representation (smallest exponent)
  • There are -0.0 and +0.0
  • Special values:
    • -Inf and Inf
      • infinities
    • NaN
      • Not a Number
      • has optional extra info that no one uses

Values

IEEE binary64

Properties

  • Addition and multiplication are commutative
    • 0.1 + 0.2 == 0.2 + 0.1
  • but not associative
    • (-1e20 + 1e20) + 3.14 == 3.14
    • -1e20 + (1e20 + 3.14) == 0.0
  • Multiplication doesn't distribute over addition for the same reasons
  • Difficult for compilers to optimize operations, so some have a "fast math" option that sacrifices accuracy for speed

Arithmetic

  • One consequence of these properties is that reordering operations can produce different levels of error
  • Difficult for compilers to optimize operations without reordering, so some have a "fast math" option that sacrifices accuracy for speed

Arithmetic

  • Fixed number of bits ⟹ some numbers can't be expressed exactly
  • Thus there is some (quantifiable) error inherent in calculations using this format
  • Error is measured in multiples of "machine epsilon", which is determined by:
    • the number of bits in use
    • the base
    • the rounding method used to convert higher-precision calculation results to the storage precision

Error

  • Absolute error depends on the value of operands
  • Difference in magnitude increases error
  • 1e20 + 3.14 == 0
  • This is one of the reasons it's good to normalize and scale inputs to neural nets, and why layers have activation functions which scale the output to a particular range

Error

  • In modern computers, arithmetic operations are performed using higher precision, then reduced to storage precision (transparent)
  • Special machine instructions like "fma" (fused multiply-add) that combine several operations in one
  • Other functions like log1p and sinpi that use IEEE properties to implement more accurate and efficient versions of combinations of operations

Avoiding Error

  • Method 1: Compare to result of BigFloat calculations
    • absolute error
    • relative error
  • Method 2: calculate the bounds due to known rounding error
    • IEEE 1788
      • ValidatedNumerics.jl
      • interval arithmetic
  • Method 3: calculate error analytically

Evaluating Error

  • interval arithmetic
    • given two intervals A and B and a function taking two values
    • applying the function to the intervals generates an interval C which contains the results of the function applied to any pair of values selected from A and B
  • IEEE 1788: interval arithmetic for IEEE 754

Evaluating Error

Validated Numerics

  • Julia and numpy both use "pairwise summation" instead of a simple iterative sum, which has less error but is just as fast
  • Another method, Kahan summation, keeps track of small errors and compensates at the end
  • Read more below

Case Study: Summation

David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23, #1, March 1991, pp. 5-48.

 

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

Alternatives

to binary floating point

  • You can have trunc(2.51, 2) == 2.51!
  • decimal64 is fixed-precision floating point decimal
  • You can have:
    • arbitrary precision (set to any value) or "configurable" precision (pick from a small set of values) and rarely infinite precision
    • fixed point
    • decimal
    • in many combinations
  • Speed and memory tradeoffs

Alternatives

  • generic number code is first class in Julia
    • easy to write a new number type
    • easy to make it fast
    • combine all sorts of types
  • many packages for many number types
    • C library wrappers
    • pure Julia

Alternatives in Julia

Alternatives in Julia

Package Floating Point Precision Base
FixedPointDecimals.jl no configurable 10
DecFP.jl (IEEE 754) yes configurable 10
FixedPointNumbers.jl no configurable 2
ArbFloats.jl yes arbitrary 2
BigFloats (Base) yes arbitrary 2
  • BigInt is an infinite-precision integer
  • Rational{BigInt} can represent any rational number (limited by memory)
  • Use this to get exact answers if you're curious

Bonus: Rational{BigInt}

Number Types at Invenia

  • Pick a type which matches what you are trying to represent
    • represent integers with integer types
    • represent decimal numbers (e.g., dollar values) with decimal types
    • consider the range and precision of possible values
  • Test Float64s for "close enough" instead of equality, and don't expect them to be exactly equal to decimal literal inputs

General Guidelines

  • We use this for bid prices during financial calculations
  • Represents a number using an integer value, and the number of decimal places stored in the type

FixedPointDecimals.jl

x \times 10^p
x×10px \times 10^p
  • We can use integers for a lot of operations where we used to use floating point numbers in MATLAB
  • Compilers like integers
  • Some operations are much faster (e.g., squaring)

Regular Old Integers