# Visually Surprising Examples

## The Classic

User: does 0.1 plus 0.2 equal 0.3?
Computer: nope
User: what's the difference?
Computer: 5.551115123125783e-17
User: holy **** why

## The Nick Special

User: truncate 2.51 to two decimal places please
Computer: okay, here you go: 2.50
User: what the **** is wrong with you?

# IEEE 754

### aka the IEEE Standard for Floating-Point Arithmetic

aka "computer math"

• Originally published in 1985, updated in 2008
• Implemented on every computer you use
• Defines formats, rules/properties, arithmetic operations AND MORE!

## IEEE 754

• Represent a number in memory
• Determine what number a block of memory represents
• Define arithmetic and ordering for those numbers
• Most common is binary64 (Python's float, C's double, Julia's Float64)

## Formats

• Don't need to define arithmetic
• Intended only for transport or storage of small numbers
• binary16 (Julia's Float16) is one of these

# IEEE binary64

## Introduction and Examples

• No way to store 1/10 in binary64
• 2.51 is actually recorded as 2.5099999999999997868371792719699442386627197265625 which is the closest binary number
• We think and write in decimal (base 10) but the computer thinks in binary (base 2) and shows us decimal
• Sometimes programs show us a rounded version to be nice to our eyes and brains

## Julia

julia> 0.1 + 0.2
0.30000000000000004

## Python

>>> 0.1 + 0.2
0.30000000000000004

## R

> 0.1 + 0.2
[1] 0.3
> format(0.1 + 0.2, nsmall = 17)
[1] "0.30000000000000004"

## Octave/MATLAB

>>> 0.1 + 0.2
ans =  0.30000
>>> format long
>>> 0.1 + 0.2
ans =  0.300000000000000
>>> output_precision
ans =  15
>>> output_precision(17)
>>> 0.1 + 0.2
ans =  0.30000000000000004

# IEEE binary64

## Representation

• Sign s
• Exponent x
• Coefficient (significand) c
• Base is 2 (could be different, more later)

## Bit Layout

n = (-1)^s * c * 2^x
$n = (-1)^s * c * 2^x$
• Exponent is offset with a bias (1023 for binary64)
• Coefficient represents the fractional part of a decimal number as an integer, where the whole part is always 1
• All binary64 values have one standard representation (smallest exponent)
• There are -0.0 and +0.0
• Special values:
• -Inf and Inf
• infinities
• NaN
• Not a Number
• has optional extra info that no one uses

# IEEE binary64

## Properties

• Addition and multiplication are commutative
• 0.1 + 0.2 == 0.2 + 0.1
• but not associative
• (-1e20 + 1e20) + 3.14 == 3.14
• -1e20 + (1e20 + 3.14) == 0.0
• Multiplication doesn't distribute over addition for the same reasons
• Difficult for compilers to optimize operations, so some have a "fast math" option that sacrifices accuracy for speed

## Arithmetic

• One consequence of these properties is that reordering operations can produce different levels of error
• Difficult for compilers to optimize operations without reordering, so some have a "fast math" option that sacrifices accuracy for speed

## Arithmetic

• Fixed number of bits ⟹ some numbers can't be expressed exactly
• Thus there is some (quantifiable) error inherent in calculations using this format
• Error is measured in multiples of "machine epsilon", which is determined by:
• the number of bits in use
• the base
• the rounding method used to convert higher-precision calculation results to the storage precision

## Error

• Absolute error depends on the value of operands
• Difference in magnitude increases error
• 1e20 + 3.14 == 0
• This is one of the reasons it's good to normalize and scale inputs to neural nets, and why layers have activation functions which scale the output to a particular range

## Error

• In modern computers, arithmetic operations are performed using higher precision, then reduced to storage precision (transparent)
• Special machine instructions like "fma" (fused multiply-add) that combine several operations in one
• Other functions like log1p and sinpi that use IEEE properties to implement more accurate and efficient versions of combinations of operations

## Avoiding Error

• Method 1: Compare to result of BigFloat calculations
• absolute error
• relative error
• Method 2: calculate the bounds due to known rounding error
• IEEE 1788
• ValidatedNumerics.jl
• interval arithmetic
• Method 3: calculate error analytically

## Evaluating Error

• interval arithmetic
• given two intervals A and B and a function taking two values
• applying the function to the intervals generates an interval C which contains the results of the function applied to any pair of values selected from A and B
• IEEE 1788: interval arithmetic for IEEE 754

## Evaluating Error

### Validated Numerics

• Julia and numpy both use "pairwise summation" instead of a simple iterative sum, which has less error but is just as fast
• Another method, Kahan summation, keeps track of small errors and compensates at the end

## Case Study: Summation

David Goldberg. "What Every Computer Scientist Should Know About Floating-Point Arithmetic,'' ACM Computing Surveys, Vol. 23, #1, March 1991, pp. 5-48.

https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html

# Alternatives

## to binary floating point

• You can have trunc(2.51, 2) == 2.51!
• decimal64 is fixed-precision floating point decimal
• You can have:
• arbitrary precision (set to any value) or "configurable" precision (pick from a small set of values) and rarely infinite precision
• fixed point
• decimal
• in many combinations

## Alternatives

• generic number code is first class in Julia
• easy to write a new number type
• easy to make it fast
• combine all sorts of types
• many packages for many number types
• C library wrappers
• pure Julia

## Alternatives in Julia

Package Floating Point Precision Base
FixedPointDecimals.jl no configurable 10
DecFP.jl (IEEE 754) yes configurable 10
FixedPointNumbers.jl no configurable 2
ArbFloats.jl yes arbitrary 2
BigFloats (Base) yes arbitrary 2
• BigInt is an infinite-precision integer
• Rational{BigInt} can represent any rational number (limited by memory)
• Use this to get exact answers if you're curious

# Number Types at Invenia

• Pick a type which matches what you are trying to represent
• represent integers with integer types
• represent decimal numbers (e.g., dollar values) with decimal types
• consider the range and precision of possible values
• Test Float64s for "close enough" instead of equality, and don't expect them to be exactly equal to decimal literal inputs

## General Guidelines

• We use this for bid prices during financial calculations
• Represents a number using an integer value, and the number of decimal places stored in the type

## FixedPointDecimals.jl

x \times 10^p
$x \times 10^p$
• We can use integers for a lot of operations where we used to use floating point numbers in MATLAB
• Compilers like integers
• Some operations are much faster (e.g., squaring)

By Eric Davies

• 850