Floating Point Arithmetic for Data Analysts
Elizabeth Byerly
Bálint Pető
2016-07-28
Press "s" for presenter's notes
Floating point refers to the system our software uses to store real numbers as a series of 0s and 1s. The purpose of this presentation is to draw out valid real number operations that fail with floating point, identify the coding errors that result, and discuss what we can do to anticipate and avoid these situations.
This presentation is language agnostic; it applies to any software that adheres to the IEEE 754 floating point standard (R, Stata, Excel, Python, SAS, ...). Actual code chunks are a mix of R and Stata, but should be readable as pseudo-code.
Purpose
We're going to provide the barest description of our subject, floating point numbers and arithmetic, and then list what you will leave our presentation knowing.
Computers don't think in real numbers
Numbers are stored as a finite set of 0s and 1s .
Most real numbers are necessarily stored innacurately.
We think in real numbers and base-10. We write our analytical code.
Our computers think in fixed-width numbers and base-2. They run our analytical code.
If we assume our computer is thinking about numbers the same way we do, our code will do something other than we expect.
Floating point is a standard
It allows us to reason consistently about how our programs will store numbers and perform arithmetic operations.
If computers can't store a "true" real number, what do they store? It used to be that every low-level software engineer provided their own answer to this question.
Defined for software portability: analytical code written on a Dell will work on a Lenovo .
Software portability was NOT standard. A lot of work has gone into ensuring our high level programs can be moved across machines without us having to spend thought and money to refactor them for consistent performance.
You will leave with:
Sufficient understanding of floating point arithmetic to avoid common errors
Jargon and references to allow for subsequent investigation if you are interested
Agenda
Why we're here: a story of floating point error
Representation and arithmetic
Sources of error
Avoiding error
We begin by explaining the QC experience that led to this presentation. It's a great example of bad assumptions, perfectly obvious to us, leading to ridiculously wrong results.
Next, we're provide a breakneck guide through how numbers are represented internally and how basic arithmetic is performed (addition and multiplication).
Once we've established how floating point works, we'll explain where floating point doesn't work (its most common failure modes).
Finally, we'll provide our current best tips for avoiding error when performing arithmetic on a computer.
At the end, if we have time, we will have sections on the IEEE standard that defines floating point (its history, what it defines) and alternative numerical representation options (rational numbers, unums).
Scenario
We want to convert a percentage into a categorical "bucket" variable.
..., (0.7, 0.8], (0.8, 0.9], ...
We define our cutoff points programatically:
>> x <- 0
>> for (i in 1:10) { x <- c(x, x[i] + .1) }
>> x
[1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
When we use those cutoffs on our continuous variable...
Don't be scared by the code. This starts at 0 increments by 0.1 using addition.
0.8 is bucketed into (0.8, 0.9]
> 0.7 + 0.1 == 0.8
[1] FALSE
. assert 0.7 + 0.1 == 0.8
assertion is false
This is not fixed by "double" and it is not dependent on your preferred software package.
> print(0.8, digits = 20)
[1] 0.80000000000000004
> print(0.7 + 0.1, digits = 20)
[1] 0.79999999999999993
None of these three numbers can be accurately stored by floating point. If we manually enter 0.8, it will resolve to .8 + 15 0s + 4 every time. However, the computer has no way of guessing we *mean* .8 + 15 0s + 4 when we enter 0.7 + 0.1.
We will walk through how this exact calculation is performed in a few minutes.
It Could Happen to You
The moral of our story!
We have to understand floating point arithmetic to avoid making bad assumptions about how our code will actually execute.
Representation and arithmetic
Understanding how a floating point number is stored and, to start, how a few arithmetic operations are performed help us enormously in developing an intuition about where errors can occur.
A base-2 normalized number encoded in 32 bits with a defined structure:
1 sign bit
8 exponent bits
23 mantissa bits
0 100 0011 0 000 0000 0000 1001 1000 0010
+ 1.100110000010 * 10^(1000011 -1111111)
A 32-bit float (128.03714) in all its glory!
We're going to limit our description to singles (known as floats in Stata). Doubles are equivalent to singles except that they are stored in 64 bits (not 32), allowing smaller and larger magnitudes of numbers to be represented with greater precision. They otherwise operate exactly the same.
Floating point errors may be ameliorated by higher precision, but they are not avoided! `set type double` does not cure all wounds.
If you're confused
and afraid, so are
we. It gets easier!
Don't worry, we're breaking it down.
First we're going to describe what we mean by "sign," "exponent," and "mantissa." Then we'll provide examples of converting from base-10 to floating point.
Sign
0 100 0011 0000 0000 0000 1001 1000 0010
+ 1.100110000010 * 10^(1000011-1111111)
The sign bit is set to 0 for positive numbers and 1 for negative numbers.
Exponent
0100 0011 0 000 0000 0000 1001 1000 0010
+1.100110000010 * 10^(1000011 -1111111)
The exponent ranges from -126 to +127 using bias : the bits encode numbers from +0 to +255 and the final value is determined by subtracting 127.
You'll note two possible encodings of the exponent are omitted (-126 to +127 is 254 values, vs. the 256 we can store in 8 bits!).
If you want to know how many numbers can be stored in a given number of bits, it's 2^(n) - so 8 bits store 2^8 (or 256) numbers. This works for all base schema - three digits in base-10 stores 1,000 numbers.
Mantissa
0100 0011 0000 0000 0000 1001 1000 0010
+1.100110000010 * 10^(1000011-1111111)
The mantissa stores up to 24 significant digits of a normalized base-2 number.
The mantissa only has 23 bits, but takes advantage of the fact floats are represented as normalized numbers: all normalized base-2 numbers (except for zero) start with 1
+/- Zero
* 000 0000 0 000 0000 0000 0000 0000 0000
+/- Infinity
* 111 1111 1 000 0000 0000 0000 0000 0000
NaN
1 111 1111 1 100 0000 0000 0000 0000 0000
Special values have reserved space.
Zero and infinity have both positive and negative versions; the sign bit is not part of their special encoding.
NaN is technically QNaN (or a quiet, propogating NaN), or a valid NaN generated by an indeterminate mathematical operation. There is also special notation for SNaN (signalling NaN), but that's an error value that gets handled behind the scenes and wouldn't be presented to the user.
Let's talk through converting a base-10 number into floating point.
How are base-10 inputs converted to floating point?
Initial base-10 representation
+2.25
-1.10
This is our comfort zone.
Convert to base-2
+10.01
-1.0001100110011...
There are many resources for learning how to convert between bases! Mississippi College has a readable explanation.
The important thing to note here is that "1.1" cannot be finitely represented in base-2. The "0011-0011" pattern repeats infinitely.
Normalize with scientific notation
+1.001 * 10^1
-1.0001100110011.. * 10^0
Identify floating point elements
+ 1.001 * 10^1
- 1.000110011... * 10^0
We can now point to the sign, mantissa, and exponent.
Account for exponent bias
+ 1.001 * 10^(10000000 -1111111)
- 1.000110011... * 10^(1111111 -1111111)
Remember that the exponent value stored by floating point will have 127 subtracted from it, so the two new exponent values here are 1 and 0 plus 127 respectively.
Structure as floating point
0 100 0000 0 001 0000 0000 0000 0000
1 011 1111 1 000 1100 1100 1100 11...
The first value converts neatly to floating point, appending trailing zeroes to the right.
We still have that awkward infinite mantissa.
Round
0 100 0000 0 001 0000 0000 0000 0000
1 011 1111 1 000 1100 1100 1100 1101
The IEEE standard recommends using "round to nearest, ties to even" rounding (Wikipedia article ). If you were raised on the split rule, 01234|56789, it will throw you off!
Resulting base-10 value
+2.25
-1.1000000238418579
Converting back to base ten, we see that the second value (which cannot be exactly represented by floating point) has been slightly corrupted.
Decimal input
0.1 + 0.7
Our old nemesis...
Let's understand exactly what went wrong.
Floating point equation after conversion
0 011 1101 1 100 1100 1100 1100 1100 1101
+
0 011 1111 0 011 0011 0011 0011 0011 0011
Neither of these numbers are finitely representable in base-2. Note that 0.7's binary representation is rounded to 0 (the representation is too small) and 0.1's is rounded up (the representation is too large).
Critically, note that where the truncation has occured in the real space of the number is different! The amount 0.7 is off from its floating point representation is less than for 0.1.
Base-2
+ 1.10011001100110011001101 * 10^-100
+
+ 1.01100110011001100110011 * 10^-1
We've skipped accounting for exponent bias.
Match the exponents
+ .00110011001100110011001101 * 10^-1
+
+ 1.01100110011001100110011 * 10^-1
Result
+1.10011001100110011001100101 * 10^-1
Our result!
0.7999999895691872
Identify floating point elements
+ 1.10011001100110011001100 101 * 10^-1
We have three digits that don't fit into the mantissa
Round
+ 1.10011001100110011001100 * 10^-1
The value is rounded.
There are multiple methods of rounding. The recommended method is "round to nearest, ties to even" - different from what is taught in schools, where .5 goes always up. Prevents accumulating bias.
Resulting floating point
0 011 1111 0 100 1100 1100 1100 1100 1100
Finally, it is stored as a floating-point value.
Base-10 before/after rounding: 0.7999999895691872 - 0.7999999523162842 = .000000037252903
Also known as: Common Stumbling Points
False precision
print(mean(c(1.1, 2.74, 3, 4.9, 5.2)), digits = 20)
[1] 3.3880000000000003
Do you believe your model is measuring to the 16th significant digit?
"Little in this world is measured to a relative accuracy of ±2-24, the accuracy provided by float precision." - Gould
"Epsilon" refers to the smallest value such that 1 + epsilon != 1. This concept exists because a float's resolution is limited.
Exact comparisons
. assert float(1.1) == 1.1
assertion is false
. assert 0.1 + 0.2 > 0.3
Our second example is in double. Again, precision is not the cure!
This issue only meaningfully arises when we deal with "nice" decimal numbers (within the range of realistic precision). We need to think about our assertions, though, and be aware of when it is relevant.
Overflow
> 10^309
[1] Inf
Overflow occurs when the system is requested to store a number that is larger than the maximum value that fits into floating point.
Single bound: 10^38.53
Double bound: 10^308.3
Underflow
> 10^-324
[1] 0
Single bound: 10^−44.85
Double bound: 10^−323.3
Drift
> x <- 0.7
> for (i in 1:3) {
+ print(x <- x + 0.1, digits = 22)
+ }
[1] 0.79999999999999993
[1] 0.89999999999999991
[1] 0.99999999999999989
Be aware of context
When is the exact value of a variable relevant to our analysis?
Counts
Financial data
Comparisons
Finance is the classic example of when "every penny counts."
Rational numbers
If you need exact calculations and have access to the data before it has been converted to floating point, you can store the numerator and denominator as integers for uncorrupted fractions.
data.frame(numerator = c(1, 1, 2, 2),
denominator = c(3, 3, 6, 6))
Two limitations: overflow is still relevant and the huge majority of your software's built-in mathematical operations are no longer accessible to you. You will be doing more manual coding.
Avoid unnecessary operations
> print((0.7 + 0.1 + 0.1 + 0.1), digits = 22)
[1] 0.99999999999999989
> print((0.7 + 0.1 * 3), digits = 22)
[1] 1
IEEE 754
Institute of Electrical and Electronics Engineers
IEEE Standards Association
History
Originally released in 1985, updated in 2008.
Draft developed by Intel.
Originally planned decimal storage, but too complicated to implement.
Standards defined
16, 32, 64, 128, & 256 bit binary numbers
32, 64, & 128 bit decimal numbers
Subnormal numbers
Representation of +/- zero, +/- infinity, SNaN, QNaN
Minimum precision calculations (add, subtract, multiply, divide, square root)
Method of rounding to nearest integer
Comparison results for non-obvious values (NaN, +/- inifinity)