IEEE 754

Systems in Rust

Author

Prof. Calvin

Announcements

Welcome to Systems in Rust
Action Items:
- “SHA-512” is due on next Friday
- Lecture on numbers, inspired by the lab, while we wait.

Today

Understand precision through the lens of floats.
- Understand the basics of floats
- Understand the language of floats
- Get a feel for when further investigation is required

Citation

Yoinked
url
Author John Farrier, Booz Allen Hamilton

Mission Statement

I’m a normal person.
- I think 5/2 is 2.5 or perhaps 2½.
I’m a scientist.
- I’m solving “Black-Scholes”
- Does anyone know what this is doing? \[ d_2 = d_1 - \sigma\sqrt{T - t} = \frac{1}{\sigma\sqrt{T - t}}\left[\ln\left(\frac{S_t}{K}\right) + \left(r - q - \frac{1}{2}\sigma^2\right)(T - t)\right] \]

Common Fallacies

“Floating point numbers are numbers”
- max(count()) Pinocchios
“It’s floating point error”
- “All floating point involves magical rounding errors”
“Linux and Windows handle floats differently”
“Floating point represents an interval value near the actual value”

Common Fallacies 2.0

“A double (an f64) holds 15 decimal places and I only need 3, so I have nothing to worry about”¹
“My programming language does better math than your programming language”²
“Why can’t computers just store whatever number I use”³

Anatomy of IEEE Floats

IEEE Float Specification

IEEE 754-1985, IEEE 854-1987, IEEE 754-2008
- These are paywalled.
- Those are years.
Provide for portable, provably consistent math
- Consistent, not correct.

Assurances

Ensure some significant mathematical identities hold true:
- $x + y = y + x$
  - Symmetry of addition
- $x + 0 = x$
  - Additive identity
- $x = y \implies x - y = 0$
  - Identity under subtraction

Assurances 2.0

\[ \frac{x}{\sqrt{x^2+y^2}} \leq 1 \]

What is missing?

IEEE Float Specification

Ensure every floating point number is unique
Ensure every floating point number has an opposite
- Zero is a special case because of this
Specifies algorithms for addition, subtraction, multiplication, division, and square-root
- Not really operations/relations! They are algorithms.

Aside: Scientific Notation

Scientific notation is a way of expressing numbers that are too large or too small to be conveniently written in decimal form, since to do so would require writing out an inconveniently long string of digits.

In scientific notation, nonzero numbers are written in the form

\[a \times 10^b\]

Aside: Explanation

In scientific notation, nonzero numbers are written in the form

\[a \times 10^b\]

$a$ (the coefficient or mantissa) is a number greater than or equal to 1 and less than 10 ($1 \le |a| < 10$).
$10$ is the base.
$b$ (the exponent) is an integer.

Aside: Physical Examples

Speed of light: The speed of light in a vacuum is approximately $300,000,000 \text{ m/s}$ \[ 3 \times 10^8 \text{ m/s} \]
Mass of an electron: The mass of an electron is approximately $0.00000000000000000000000000091093837 \text{ g}$. \[ 9.1093837 \times 10^{-28} \text{ g} \]

Aside: Economic Examples

We can use social science numbers.
Labor Market Outcomes of College Graduates by Major
Computer Science majors in 2023 have a $80,000 median wage “early career”
- $8.0000 \times 10^4$
And 6.1% unemployment
- $6.1 \times 10^{-2}$

IEEE Layout

An approximation using scientific notation
- $x = -1^s \times 2^e \times 1.m$
- $x = -1^{\text{sign bit}} \times \text{base} 2^{\text{exponent}} \times 1.\text{mantissa}$
  - Where the mantissa is the technical term for the the digits after the decimal point.

With Binary

$x = -0b1^{\text{0b0}} \times 0b10^{\text{0b10000000}} \times 0b1.10010010000111111011011$
Express in memory as the concatenation:
- 0b0 to 0b10000000 to 0b10010010000111111011011
- 01000000010010010000111111011011

Singles and Doubles

32 - bits = 1 sign bit + 8 exponent bits + 23 mantissa bits
- 0b0 to 0b10000000 to 0b10010010000111111011011
64 - bits = 1 sign bit + 12 exponent bits + 52 mantissa bits

Understanding check

What is the probability a real number $a \in \mathbb{R}$ has an exact float representation?
What is the probability an integer $n \in \mathbb{Z}$ has an exact float representation?
What is the probability a course number $n < 600$ has an exact float representation?

Special Floats - NaN

Divide by Zero
- 1 / 0
Not a Number (NaN)

>>> from numpy import float32 as f32
>>> f32(1)/f32(0)
<stdin>:1: RuntimeWarning: divide by zero encountered in scalar divide
np.float32(inf)
>>> 1/0
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ZeroDivisionError: division by zero

Special Floats - inf

Signed Infinity

Overflow protection

>>> f32(10) ** f32(100)
<stdin>:1: RuntimeWarning: overflow encountered in scalar power
np.float32(inf)
>>> 10.0 ** 10000
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: (34, 'Numerical result out of range')

Infinity is signed

We can get negative infinity through a variety of means.

>>> -(f32(10) ** f32(100))
np.float32(-inf)

We note that this suppresses the overflow error use ()
In some languages this is not ever regarded as an overflow error, like Julia (where we have to use larger powers due to the f64 default)

julia> 10.0^10000
Inf

julia> -(10.0^10000)
-Inf

Signed Zero

Underflow protection, preserves sign
- 0 =− 0

>>> -1 / 10 ** 1000
-0.0
>>> 0.0 == (-1 / 10 ** 1000)
True
>>> -0.0
-0.0
>>> 0.0 - 0.0
0.0

Works in base Python (e.g. don’t need NumPy).
Couldn’t get it to work in Julia actually (went to -inf or NaN)

Simple Example

Floating point is great because it will work exactly how you expect.

>>> 0.1 + 0.2 == 0.3
False

More or less.

>>> 0.1 + 0.2
0.30000000000000004

Simple Example

We shouldn’t be suprised by this!

>>> _ = [print(f"{f32(i/10):1.100}") for i in range(1,4)]
0.100000001490116119384765625
0.20000000298023223876953125
0.300000011920928955078125
>>> _ = [print(f"{f64(i/10):1.100}") for i in range(1,4)]
0.1000000000000000055511151231257827021181583404541015625
0.200000000000000011102230246251565404236316680908203125
0.299999999999999988897769753748434595763683319091796875

Storage of 1.

$x = -1^s \times 2^e \times 1.m$
$x = -1^{\text{sign bit}} \times \text{base} 2^{\text{exponent}} \times 1.\text{mantissa}$
$x = -1^{\text{sign bit}} \times \text{base} 2^{\text{exponent}} \times 1 + \text{mantissa} \times \frac{1}{2^n}$
$1 = -1^0 \times 2^0 \times (1 + 0 \times \frac{1}{2^n}$
So each of the sign bit, the base, and the mantissa are zero.
- I should note I was unable to reproduce this in Rust or C.
- I’m not really a floating pointer.

Except…

This is actually a little bit fake.

$ cat src/main.rs
fn main() {
    let x:f32 = 1.0;
    println!("{:032b}", x.to_bits());
}
$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/bits`
00111111100000000000000000000000

What is that?

Shift 127

While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed “bias” added to it.

The IEEE 754 binary format specifies a hardware implementation of such that the exponent, be it positive or negative, in binary formats is stored as an unsigned number that has a fixed “bias” added to it.
In the case of 32 bit floats, the bias is 127.

Breakdown

Take the following:

00111111100000000000000000000000

That is:
- Sign bit of 0
- Exponent of 01111111
  - All zeroes and all ones are reserved for NaN, inf, etc.
- Mantissa of 23 zeroes.

How precise?

I have referred to f32 as having “32 bits of precision”.
How much precision is that?
Well, let’s set the least significant bit of the mantissa to one.

main.rs

fn main() {
    let x:f32 = 1.0;
    let mut b:u32 = x.to_bits();
    b |= 1;
    println!("{:032b}", b);
    println!("{:1.32}", f32::from_bits(b));
}

“Epsilon”

What do we get?

$ cargo run
    Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.00s
     Running `target/debug/bits`
00111111100000000000000000000001
1.00000011920928955078125000000000

We refer to this value (less one) as machine epsilon, or perhaps $\varepsilon$.
The difference between 1.0 and the next available floating point number.
Much lower, of course, with double and quad precision.

Easier: “Sig figs”

Format	Sign	Exp.	Mant.	Bits	Bias	Prec.	Significance
Half	1	5	10	16	15	11	3-4
Single	1	8	23	32	127	24	6-9
Double	1	11	52	64	1023	53	15-17
Quad	1	15	112	128	16383	113	33-36

Prof. Kahan, UCB EECS

Sig figs

IEEE 754 was intended for scientific computing
That is, not systems computing, software etc.
Useful to us to think about information and how bits work.

Significant figures, also referred to as significant digits, are specific digits within a number that is written in positional notation that carry both reliability and necessity in conveying a particular quantity.

Small Values

For 32-bit floats, the minimum base 10 exponent is -36.
How is $1.0 \times 10^{-37}$ represented?

// In Rust, we may use '0b' prefix for binary representation.
// In Rust, we may us underscores within numerical values.
0b0_00000000_11011001110001111101110

Test it

Write a simple test…

src/main.rs

fn main() {
    let b:u32 = 0b0_00000000_11011001110001111101110;
    println!("{:032b}", b);
    println!("{:1.100}", f32::from_bits(b));
    let x:f32 = 10.0_f32.powf(-38.0_f32);
    println!("{:032b}", x.to_bits());
    println!("{:1.100}", x);
}

Results are as expected.

00000000011011001110001111101110
0.0000000000000000000000000000000000000099999993504564039245746141539976645128551939195729831580121175
00000000011011001110001111101110
0.0000000000000000000000000000000000000099999993504564039245746141539976645128551939195729831580121175

“Denormalized Number”

Numbers that have a zero exponent
Required when the exponent is below the minimum exponent
Helps prevent underflow
Generally speaking, if you are using one these, the math is about to get wronger than you’d think.
The good (???) language C++ provides a std::nextafter for which Rust crates exist.

Floating Point Precision

Representation is not uniform between numbers
Most precision lies between 0.0 and 0.1
Precision falls away

Visually

Fun with floating point, Dwayne Need

Floating Point Precision

The number of floats from 0.0

… to 0.1 = 1,036,831,949
… to 0.2 = 8,388,608
… to 0.4 = 8,388,608
… to 0.8 = 8,388,608
… to 1.6 = 8,388,608
… to 3.2 = 8,388,608

Errors in Floating Point

Storage of $\pi$

>>> from numpy import pi as pi
>>> # copy paste Wolfram|Alpha
>>> '3.14159265358979323846264338327950288419716939937510582097494459230781640628'
'3.14159265358979323846264338327950288419716939937510582097494459230781640628'
>>> f'{pi:1.100}'
'3.141592653589793115997963468544185161590576171875'

They differ fairly early on!

Measuring Error: “Ulps”

Units in Last Place

In computer science and numerical analysis, unit in the last place or unit of least precision (ulp) is the spacing between two consecutive floating-point numbers, i.e., the value the least significant digit (rightmost digit) represents if it is 1. It is used as a measure of accuracy in numeric calculations.

Haskell

Usually ULP is taught using Haskell.

> until (\x -> x == x+1) (+1) 0 :: Float
1.6777216e7
> it-1
1.6777215e7
> it+1
1.6777216e7

I’ll translate to .py/.rs

Generators

By convention we check number theoretic results with lazily evaluated generator expressions in Python.
The following generators all powers of two for a which adding one does not change the value within Python floating points.

from itertools import count

gen = (i for i in count() if 2.0 ** float(i) + 1.0 == 2.0 ** float(i))

x = next(gen)

print('2.0 ** x + 0.0 = ', 2.0 ** x + 0.0)
print('2.0 ** x + 1.0 = ', 2.0 ** x + 1.0)

Iterators

Rust doesn’t require iterators, supporting the infinite iterator by default using unwrap.

let x = (0..).find(|&i| f64::powi(2.0, i) + 1.0 == f64::powi(2.0, i)).unwrap();

let base = f64::powi(2.0, x);

// Prints the result (2^53 = 9007199254740992.0)
println!("2.0 ** x + 0.0 = {:.1}", base);
println!("2.0 ** x + 1.0 = {:.1}", base + 1.0);

I haven’t used find before, this is an LLM donation to our knowledge base.
By the way, that is basically 2 ** 53 and there are 52 bits of mantissa precision.

On “Ulps”

We can measure error using ULPs.

The gap between the two floating-point numbers nearest to x, even if x is one of )them.

>>> from numpy import float32 as f32, pi as pi, float64 as f64
>>> f64(pi) - f64(f32(pi))
np.float64(-8.742278012618954e-08)

Rounding

We have some guarantees:
- IEEE 754 requires that that elementary arithmetic operations are correctly rounded to within 0.5 ulps
- Transcendental functions are generally rounded to between 0.5 and 1.0 ulps
  - Transcendental as in $\pi$ or $e$
~9 ulps

Relative Error

The difference between the “real” number and the approximated number, divided by the “real” number.

>>> f64(f32(pi))/f64(pi)
np.float64(1.0000000278275352)

So for pi in f32, around $2.9 \times 10^{-8}$

Rounding Error

Induced by approximating an infinite range of numbers into a finite number of bits
Rounding is:
- Towards the nearest
- Towards zero
- Towards positive infinity (round up)
- Towards negative infinity (round down)

Rounding Error

What about rounding the half-way case? (i.e. 0.5)
- Round Up vs. Round Even
Correct Rounding:
- Basic operations (add, subtract, multiply, divide, sqrt) should return the number nearest the mathematical result.
- If there is a tie, round to the number with an even mantissa

Implementation

“Guard Bit”, “Round Bit”, “Sticky Bit”
Only used while doing calculations
- Not stored in the float itself
- The mantissa is shifted in calculations to align radix
The guard bits and round bits are extra precision
The sticky bit is an OR of anything that shifts through it

0_00000000_00000000000000000000000_G_R_S..

Special bits

“Guard Bit”, “Round Bit”, “Sticky Bit”
[G][R][S]
[0][-][-] - Round Down (do nothing)
[1][0][0] - Round Up if the mantissa LSB is 1
[1][0][1] - Round Up
[1][1][0] - Round Up
[1][1][1] - Round Up

0_00000000_00000000000000000000000_G_R_S..

Significance Error

Compute the Area of a Triangle given side lengths
Hero(n)’s Formula:
- Take $s = \frac{1}{2}(a + b + c)$
- Then $\sqrt{s(s-a)(s-b)(s-c)}$
Kahan’s Algorithm:
- Take $((a, b, c) | a \leq b \leq c)$
- Then $\sqrt{\frac{(x + (y + z)) \times (z - (x - y)) \times (z + (x - y)) \times (x + (y - z))}{4}}$

Write them

src/lib.rs

fn heron(a:f32, b:f32, c:f32) -> f32 {
    let s = (a + b + c)/2.0_f32;
    return (s * (s - a) * (s - b) * (s - c)).sqrt();
}

fn kahan(a:f32, b:f32, c:f32) -> f32 {
    let mut s = vec![a,b,c];
    s.sort_by(|a, b| a.partial_cmp(b).unwrap()); // "partial"? Why not `.sort`?
    let mut prod = s[0] + s[1] + s[2];
    prod *= s[2] - (s[0] - s[1]);
    prod *= s[2] + (s[0] - s[1]);
    prod *= s[0] + (s[1] - s[2]);
    return prod.sqrt() / 4.0;
}

Test them

src/main.rs

fn main(){
    dbg!(heron(3.0,4.0,5.0));
    dbg!(kahan(3.0,4.0,5.0));
    let a = 100_000.0_f32;
    let c = 0.01_f32;
    dbg!(heron(a,a,c));
    dbg!(kahan(a,a,c));
}

Oh what’s that?

[src/main.rs:21:5] heron(3.0,4.0,5.0) = 6.0
[src/main.rs:22:5] kahan(3.0,4.0,5.0) = 6.0
[src/main.rs:25:5] heron(a,a,c) = 781.25006
[src/main.rs:26:5] kahan(a,a,c) = 500.0

Using f64 (vs f32) we get 499.9999999999994

Significance Error

Use Stable Algorithms
- Keep big numbers with big numbers, little numbers with little numbers.
Parentheses can help
The compiler won’t re-arrange your math if it cannot prove it would yield the same!
- E.g. you can’t use < in Rust as you may provide a NaN.

Significance Error – Don’t floats?

Use integers
- Very fast
- Trade more precision for less range
- Only input/output may be impacted by floating point conversions
- Financial applications represent dollars as only cents or tenth’s of cents

Significance Error – Don’t floats?

Use a math library
- Slower
- Define your own level of accuracy
- Usually GMP or hardware specific.

Significance Error – Don’t floats?

Execute symbolically
- Slowest
- Arbitrary accuracy
- Round only at output

Significance Error – Don’t floats?

Write your own library?
- Use vectors of bits or bytes.
- Write your own arithmetic operations.
- Test and debug your own implemention.
Coming soon to a homework near you!

Algebraic Assumption Error

Distributive Rule does not apply:
- $a = x \times y - x \times z \nRightarrow a = x (y - z)$
Associative Rule does not apply: 𝑥 + 𝑦 +𝑧 ≠ 𝑥 +𝑦 +𝑧
- $a = x + (y + z) \nRightarrow a = (x + y) + z$
Cannot interchange division and multiplication:
- $a = \frac{x}{10.0} \nRightarrow a = x \times 0.1$

Distribution

>>> x, y, z
(np.float32(1000000000.0), np.float32(1.0000005), np.float32(1.0))
>>> x * y - x * z ; x * (y - z)
np.float32(448.0)
np.float32(476.83716)

Association

>>> x, y, z
(np.float32(16777216.0), np.float32(1.0), np.float32(1.0))
>>> x + (y + z) ; (x + y) + z
np.float32(16777218.0)
np.float32(16777216.0)

Interchange

Check last digit

>>> 1 / 7 / 100000000
1.4285714285714284e-09
>>> 1 / 100000000 / 7
1.4285714285714286e-09

Floating Point Exceptions

IEEE 754 Exception	Result when traps disabled	Argument to trap handler
overflow	$\pm \infty$ or $\pm x_{\text{max}}$	$\text{round}(x 2^{-a})$
underflow	$0, \pm 2^{\text{min}}$ or denormalized	$\text{round}(x 2^{a})$
divide by zero	$\pm \infty$	invalid operation
invalid	$\text{NaN}$	invalid operation
inexact	$\text{round}(x)$	$\text{round}(x)$

But Wait! There’s More!

Binary to Decimal Conversion Error
Summation Error
Propagation Error
Underflow, Overflow
Type Narrowing/Widening Rules

Testing

Design for Numerical Stability
Perform Meaningful Testing
Document assumptions
Track sources of approximation
Quantify goodness
- Well conditioned algorithms
Backward error analysis
- Are the outputs identical for slightly modified inputs?

Today

Understand precision through the lens of floats.
- Understand the basics of floats
- Understand the language of floats
- Get a feel for when further investigation is required

Footnotes

Grammatically, this statement should be “so I have nothing about which to worry.”↩︎
This is true in the special case that your programming language is Haskell.↩︎
This is due to how many numbers there are.↩︎

--- title: "IEEE 754" --- ## Announcements - **Welcome** to Systems in Rust - **Action Items**: - "SHA-512" is due on next Friday - Lecture on numbers, inspired by the lab, while we wait. ## Today - Understand precision through the lens of floats. - Understand the basics of floats - Understand the language of floats - Get a feel for when further investigation is required ## Citation - Yoinked - [url](https://github.com/DigitalInBlue/CPPCon2015) - Author John Farrier, Booz Allen Hamilton ## Mission Statement{.smaller} - I'm a normal person. - I think `5/2` is `2.5` or perhaps 2½. - I'm a scientist. - I'm solving "Black-Scholes" - [Does anyone know what this is doing?](https://numba.pydata.org/numba-examples/examples/finance/blackscholes/results.html) $$ d_2 = d_1 - \sigma\sqrt{T - t} = \frac{1}{\sigma\sqrt{T - t}}\left[\ln\left(\frac{S_t}{K}\right) + \left(r - q - \frac{1}{2}\sigma^2\right)(T - t)\right] $$ ## Common Fallacies - "Floating point numbers are numbers" - `max(count())` Pinocchios - "It’s floating point error" - "All floating point involves magical rounding errors" - "Linux and Windows handle floats differently" - "Floating point represents an interval value near the actual value" ## Common Fallacies 2.0 - "A double (an `f64`) holds 15 decimal places and I only need 3, so I have nothing to worry about"[^1] - "My programming language does better math than your programming language"[^2] - "Why can't computers just store whatever number I use"[^3] [^1]: Grammatically, this statement should be "so I have nothing about which to worry." [^2]: This is true in the special case that your programming language is Haskell. [^3]: This is due to how many numbers there are. ## Anatomy of IEEE Floats ![](https://upload.wikimedia.org/wikipedia/commons/thumb/1/18/IEEE754.svg/550px-IEEE754.svg.png) ## IEEE Float Specification - IEEE 754-1985, IEEE 854-1987, IEEE 754-2008 - These are paywalled. - Those are years. - Provide for portable, provably consistent math - Consistent, not correct. ## Assurances - Ensure some significant mathematical identities hold true: - $x + y = y + x$ - Symmetry of addition - $x + 0 = x$ - Additive identity - $x = y \implies x - y = 0$ - Identity under subtraction ## Assurances 2.0 $$ \frac{x}{\sqrt{x^2+y^2}} \leq 1 $$ - What is missing? ## IEEE Float Specification - Ensure every floating point number is unique - Ensure every floating point number has an opposite - _Zero_ is a special case because of this - Specifies algorithms for addition, subtraction, multiplication, division, and square-root - Not really operations/relations! They are algorithms. ## Aside: Scientific Notation > [Scientific notation is a way of expressing numbers that are too large or too small to be conveniently written in decimal form, since to do so would require writing out an inconveniently long string of digits.](https://en.wikipedia.org/wiki/Scientific_notation) - In scientific notation, nonzero numbers are written in the form $$a \times 10^b$$ ## Aside: Explanation - In scientific notation, nonzero numbers are written in the form $$a \times 10^b$$ * $a$ (the coefficient or mantissa) is a number greater than or equal to 1 and less than 10 ($1 \le |a| < 10$). * $10$ is the base. * $b$ (the exponent) is an integer. ## Aside: Physical Examples 1. **Speed of light:** The speed of light in a vacuum is approximately $300,000,000 \text{ m/s}$ $$ 3 \times 10^8 \text{ m/s} $$ 2. **Mass of an electron:** The mass of an electron is approximately $0.00000000000000000000000000091093837 \text{ g}$. $$ 9.1093837 \times 10^{-28} \text{ g} $$ ## Aside: Economic Examples - We can use social science numbers. - [Labor Market Outcomes of College Graduates by Major](https://www.newyorkfed.org/research/college-labor-market#--:explore:outcomes-by-major) - Computer Science majors in 2023 have a \$80,000 median wage "early career" - $8.0000 \times 10^4$ - And 6.1% unemployment - $6.1 \times 10^{-2}$ ## IEEE Layout - An approximation using scientific notation - $x = -1^s \times 2^e \times 1.m$ - $x = -1^{\text{sign bit}} \times \text{base} 2^{\text{exponent}} \times 1.\text{mantissa}$ - Where the mantissa is the technical term for the the digits after the decimal point. ## With Binary{.smaller} - $x = -0b1^{\text{0b0}} \times 0b10^{\text{0b10000000}} \times 0b1.10010010000111111011011$ - Express in memory as the concatenation: - `0b0` to `0b10000000` to `0b10010010000111111011011` - `01000000010010010000111111011011` ## Singles and Doubles - 32 - bits = 1 sign bit + 8 exponent bits + 23 mantissa bits - `0b0` to `0b10000000` to `0b10010010000111111011011` - 64 - bits = 1 sign bit + 12 exponent bits + 52 mantissa bits ## Understanding check - What is the probability a *real number* $a \in \mathbb{R}$ has an exact float representation? - What is the probability an *integer* $n \in \mathbb{Z}$ has an exact float representation? - What is the probability a *course number* $n < 600$ has an exact float representation? ## Special Floats - NaN - Divide by Zero - 1 / 0 - Not a Number (NaN) ```py >>> from numpy import float32 as f32 >>> f32(1)/f32(0) <stdin>:1: RuntimeWarning: divide by zero encountered in scalar divide np.float32(inf) >>> 1/0 Traceback (most recent call last): File "<stdin>", line 1, in <module> ZeroDivisionError: division by zero ``` ## Special Floats - inf - Signed Infinity - Overflow protection ```py >>> f32(10) ** f32(100) <stdin>:1: RuntimeWarning: overflow encountered in scalar power np.float32(inf) >>> 10.0 ** 10000 Traceback (most recent call last): File "<stdin>", line 1, in <module> OverflowError: (34, 'Numerical result out of range') ``` ## Infinity is signed - We can get negative infinity through a variety of means. ```py >>> -(f32(10) ** f32(100)) np.float32(-inf) ``` - We note that this suppresses the overflow error use `()` - In some languages this is not ever regarded as an overflow error, like Julia (where we have to use larger powers due to the f64 default) ```jl julia> 10.0^10000 Inf julia> -(10.0^10000) -Inf ``` ## Signed Zero - Signed Zero - Underflow protection, preserves sign - + 0 =− 0 ```py >>> -1 / 10 ** 1000 -0.0 >>> 0.0 == (-1 / 10 ** 1000) True >>> -0.0 -0.0 >>> 0.0 - 0.0 0.0 ``` - Works in base Python (e.g. don't need NumPy). - Couldn't get it to work in Julia actually (went to -inf or NaN) ## Simple Example - Floating point is great because it will work exactly how you expect. ```py >>> 0.1 + 0.2 == 0.3 False ``` - More or less. ```py >>> 0.1 + 0.2 0.30000000000000004 ``` ## Simple Example - We shouldn't be suprised by this! ```py >>> _ = [print(f"{f32(i/10):1.100}") for i in range(1,4)] 0.100000001490116119384765625 0.20000000298023223876953125 0.300000011920928955078125 >>> _ = [print(f"{f64(i/10):1.100}") for i in range(1,4)] 0.1000000000000000055511151231257827021181583404541015625 0.200000000000000011102230246251565404236316680908203125 0.299999999999999988897769753748434595763683319091796875 ``` ## Storage of 1. - $x = -1^s \times 2^e \times 1.m$ - $x = -1^{\text{sign bit}} \times \text{base} 2^{\text{exponent}} \times 1.\text{mantissa}$ - $x = -1^{\text{sign bit}} \times \text{base} 2^{\text{exponent}} \times 1 + \text{mantissa} \times \frac{1}{2^n}$ - $1 = -1^0 \times 2^0 \times (1 + 0 \times \frac{1}{2^n}$ - So each of the sign bit, the base, and the mantissa are zero. - I should note I was unable to reproduce this in Rust or C. - I'm not really a floating pointer. ## Except... - This is actually a little bit fake. ```rs $ cat src/main.rs fn main() { let x:f32 = 1.0; println!("{:032b}", x.to_bits()); } $ cargo run Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.00s Running `target/debug/bits` 00111111100000000000000000000000 ``` - What is that? ## Shift 127 > [While the exponent can be positive or negative, in binary formats it is stored as an unsigned number that has a fixed "bias" added to it.](https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation) - The IEEE 754 binary format specifies a hardware implementation of such that the exponent, be it positive or negative, in binary formats is stored as an unsigned number that has a fixed "bias" added to it. - In the case of 32 bit floats, the bias is 127. ## Breakdown - Take the following: ```sh 00111111100000000000000000000000 ``` - That is: - Sign bit of `0` - Exponent of `01111111` - All zeroes and all ones are reserved for `NaN`, `inf`, etc. - Mantissa of 23 zeroes. ## How precise? - I have referred to `f32` as having "32 bits of precision". - How much precision is that? - Well, let's set the least significant bit of the mantissa to one. ```{.rs filename="main.rs"} fn main() { let x:f32 = 1.0; let mut b:u32 = x.to_bits(); b |= 1; println!("{:032b}", b); println!("{:1.32}", f32::from_bits(b)); } ``` ## "Epsilon" - What do we get? ```sh $ cargo run Finished `dev` profile [unoptimized + debuginfo] target(s) in 0.00s Running `target/debug/bits` 00111111100000000000000000000001 1.00000011920928955078125000000000 ``` - We refer to this value (less one) as [machine epsilon](https://en.wikipedia.org/wiki/Machine_epsilon), or perhaps $\varepsilon$. - The difference between 1.0 and the next available floating point number. - Much lower, of course, with double and quad precision. ## Easier: "Sig figs" | Format | Sign | Exp. | Mant. | Bits | Bias | Prec. | Significance | | :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | | Half | 1 | 5 | 10 | 16 | 15 | 11 | **3-4** | | Single | 1 | 8 | 23 | 32 | 127 | 24 | **6-9** | | Double | 1 | 11 | 52 | 64 | 1023 | 53 | **15-17** | | Quad | 1 | 15 | 112 | 128 | 16383 | 113 | **33-36** | - [Prof. Kahan, UCB EECS](https://people.eecs.berkeley.edu/~wkahan/ieee754status/IEEE754.PDF) ## Sig figs - IEEE 754 was intended for *scientific computing* - That is, not systems computing, software etc. - Useful to us to think about information and how bits work. > [Significant figures, also referred to as significant digits, are specific digits within a number that is written in positional notation that carry both reliability and necessity in conveying a particular quantity.](https://en.wikipedia.org/wiki/Significant_figures) ## Small Values - For 32-bit floats, the minimum base 10 exponent is -36. - How is $1.0 \times 10^{-37}$ represented? ```rs // In Rust, we may use '0b' prefix for binary representation. // In Rust, we may us underscores within numerical values. 0b0_00000000_11011001110001111101110 ``` ## Test it - Write a simple test... ```{.rs filename="src/main.rs"} fn main() { let b:u32 = 0b0_00000000_11011001110001111101110; println!("{:032b}", b); println!("{:1.100}", f32::from_bits(b)); let x:f32 = 10.0_f32.powf(-38.0_f32); println!("{:032b}", x.to_bits()); println!("{:1.100}", x); } ``` - Results are as expected. ```sh 00000000011011001110001111101110 0.0000000000000000000000000000000000000099999993504564039245746141539976645128551939195729831580121175 00000000011011001110001111101110 0.0000000000000000000000000000000000000099999993504564039245746141539976645128551939195729831580121175 ``` ## "Denormalized Number" - Numbers that have a zero exponent - Required when the exponent is below the minimum exponent - Helps prevent underflow - Generally speaking, if you are using one these, the math is about to get wronger than you'd think. - The good (???) language C++ provides a `std::nextafter` for which Rust crates exist. ## Floating Point Precision - Representation is not uniform between numbers - Most precision lies between 0.0 and 0.1 - Precision falls away ## Visually ![](https://dwayneneed.github.io/static/img/2010-05-06-fun-with-floating-point_6bit-float-distribution.png) - [Fun with floating point, Dwayne Need](https://dwayneneed.github.io/.net/2010/05/06/fun-with-floating-point.html) ## Floating Point Precision The number of floats from 0.0 - ... to 0.1 = 1,036,831,949 - ... to 0.2 = 8,388,608 - ... to 0.4 = 8,388,608 - ... to 0.8 = 8,388,608 - ... to 1.6 = 8,388,608 - ... to 3.2 = 8,388,608 ## Errors in Floating Point - Storage of $\pi$ ```py >>> from numpy import pi as pi >>> # copy paste Wolfram|Alpha >>> '3.14159265358979323846264338327950288419716939937510582097494459230781640628' '3.14159265358979323846264338327950288419716939937510582097494459230781640628' >>> f'{pi:1.100}' '3.141592653589793115997963468544185161590576171875' ``` - They differ fairly early on! ## Measuring Error: "Ulps" - Units in Last Place > [In computer science and numerical analysis, unit in the last place or unit of least precision (ulp) is the spacing between two consecutive floating-point numbers, i.e., the value the least significant digit (rightmost digit) represents if it is 1. It is used as a measure of accuracy in numeric calculations.](https://en.wikipedia.org/wiki/Unit_in_the_last_place) ## Haskell - Usually ULP is taught using [Haskell](https://en.wikipedia.org/wiki/Unit_in_the_last_place#Example_1). ```haskell > until (\x -> x == x+1) (+1) 0 :: Float 1.6777216e7 > it-1 1.6777215e7 > it+1 1.6777216e7 ``` - I'll translate to .py/.rs ## Generators - By convention we check number theoretic results with lazily evaluated generator expressions in Python. - The following generators all powers of two for a which adding one does not change the value within Python floating points. ```py from itertools import count gen = (i for i in count() if 2.0 ** float(i) + 1.0 == 2.0 ** float(i)) x = next(gen) print('2.0 ** x + 0.0 = ', 2.0 ** x + 0.0) print('2.0 ** x + 1.0 = ', 2.0 ** x + 1.0) ``` ## Iterators - Rust doesn't require iterators, supporting the infinite iterator by default using `unwrap`. ```rs let x = (0..).find(|&i| f64::powi(2.0, i) + 1.0 == f64::powi(2.0, i)).unwrap(); let base = f64::powi(2.0, x); // Prints the result (2^53 = 9007199254740992.0) println!("2.0 ** x + 0.0 = {:.1}", base); println!("2.0 ** x + 1.0 = {:.1}", base + 1.0); ``` - I haven't used `find` before, this is an LLM donation to our knowledge base. - By the way, that is basically `2 ** 53` and there are 52 bits of mantissa precision. ## On "Ulps" - We can measure error using ULPs. > [The gap between the two floating-point numbers nearest to x, even if x is one of )them.](https://www.cs.berkeley.edu/~wkahan/LOG10HAF.TXT) ```py >>> from numpy import float32 as f32, pi as pi, float64 as f64 >>> f64(pi) - f64(f32(pi)) np.float64(-8.742278012618954e-08) ``` ## Rounding - We have some guarantees: - IEEE 754 requires that that elementary arithmetic operations are correctly rounded to within 0.5 ulps - Transcendental functions are generally rounded to between 0.5 and 1.0 ulps - Transcendental as in $\pi$ or $e$ - ~9 ulps # Relative Error - The difference between the "real" number and the approximated number, divided by the "real" number. ```py >>> f64(f32(pi))/f64(pi) np.float64(1.0000000278275352) ``` - So for `pi` in `f32`, around $2.9 \times 10^{-8}$ ## Rounding Error - Induced by approximating an infinite range of numbers into a finite number of bits - Rounding is: - Towards the nearest - Towards zero - Towards positive infinity (round up) - Towards negative infinity (round down) ## Rounding Error - What about rounding the half-way case? (i.e. 0.5) - Round Up vs. Round Even - Correct Rounding: - Basic operations (add, subtract, multiply, divide, sqrt) should return the number nearest the mathematical result. - If there is a tie, round to the number with an even mantissa ## Implementation - "Guard Bit", "Round Bit", "Sticky Bit" - Only used while doing calculations - Not stored in the float itself - The mantissa is shifted in calculations to align radix - The guard bits and round bits are extra precision - The sticky bit is an OR of anything that shifts through it ```rs 0_00000000_00000000000000000000000_G_R_S.. ``` ## Special bits - "Guard Bit", "Round Bit", "Sticky Bit" - **[G][R][S]** - **[0][-][-]** - Round Down (do nothing) - **[1][0][0]** - Round Up if the mantissa LSB is 1 - **[1][0][1]** - Round Up - **[1][1][0]** - Round Up - **[1][1][1]** - Round Up ```rs 0_00000000_00000000000000000000000_G_R_S.. ``` ## Significance Error - Compute the Area of a Triangle given side lengths - [Hero(n)’s Formula](https://en.wikipedia.org/wiki/Heron%27s_formula): - Take $s = \frac{1}{2}(a + b + c)$ - Then $\sqrt{s(s-a)(s-b)(s-c)}$ - [Kahan’s Algorithm](https://people.eecs.berkeley.edu/~wkahan/Triangle.pdf): - Take $((a, b, c) | a \leq b \leq c)$ - Then $\sqrt{\frac{(x + (y + z)) \times (z - (x - y)) \times (z + (x - y)) \times (x + (y - z))}{4}}$ ## Write them ```{.rs filename="src/lib.rs"} fn heron(a:f32, b:f32, c:f32) -> f32 { let s = (a + b + c)/2.0_f32; return (s * (s - a) * (s - b) * (s - c)).sqrt(); } fn kahan(a:f32, b:f32, c:f32) -> f32 { let mut s = vec![a,b,c]; s.sort_by(|a, b| a.partial_cmp(b).unwrap()); // "partial"? Why not `.sort`? let mut prod = s[0] + s[1] + s[2]; prod *= s[2] - (s[0] - s[1]); prod *= s[2] + (s[0] - s[1]); prod *= s[0] + (s[1] - s[2]); return prod.sqrt() / 4.0; } ``` ## Test them ```{.rs filename="src/main.rs"} fn main(){ dbg!(heron(3.0,4.0,5.0)); dbg!(kahan(3.0,4.0,5.0)); let a = 100_000.0_f32; let c = 0.01_f32; dbg!(heron(a,a,c)); dbg!(kahan(a,a,c)); } ``` - Oh what's that? ```sh [src/main.rs:21:5] heron(3.0,4.0,5.0) = 6.0 [src/main.rs:22:5] kahan(3.0,4.0,5.0) = 6.0 [src/main.rs:25:5] heron(a,a,c) = 781.25006 [src/main.rs:26:5] kahan(a,a,c) = 500.0 ``` - Using `f64` (vs `f32`) we get 499.9999999999994 ## Significance Error - Use Stable Algorithms - Keep big numbers with big numbers, little numbers with little numbers. - Parentheses can help - The compiler won’t re-arrange your math if it cannot prove it would yield the same! - E.g. you can't use `<` in Rust as you may provide a `NaN`. ## Significance Error – Don’t floats? - Use integers - Very fast - Trade more precision for less range - Only input/output may be impacted by floating point conversions - Financial applications represent dollars as only cents or tenth’s of cents ## Significance Error – Don’t floats? - Use a math library - Slower - Define your own level of accuracy - Usually GMP or hardware specific. ## Significance Error – Don’t floats? - Execute symbolically - Slowest - Arbitrary accuracy - Round only at output ## Significance Error – Don’t floats? - Write your own library? - Use vectors of bits or bytes. - Write your own arithmetic operations. - Test and debug your own implemention. - Coming soon to a homework near you! ## Algebraic Assumption Error - Distributive Rule does not apply: - $a = x \times y - x \times z \nRightarrow a = x (y - z)$ - Associative Rule does not apply: 𝑥 + 𝑦 +𝑧 ≠ 𝑥 +𝑦 +𝑧 - $a = x + (y + z) \nRightarrow a = (x + y) + z$ - Cannot interchange division and multiplication: - $a = \frac{x}{10.0} \nRightarrow a = x \times 0.1$ ## Distribution ```py >>> x, y, z (np.float32(1000000000.0), np.float32(1.0000005), np.float32(1.0)) >>> x * y - x * z ; x * (y - z) np.float32(448.0) np.float32(476.83716) ``` ## Association ```py >>> x, y, z (np.float32(16777216.0), np.float32(1.0), np.float32(1.0)) >>> x + (y + z) ; (x + y) + z np.float32(16777218.0) np.float32(16777216.0) ``` ## Interchange - Check last digit ```py >>> 1 / 7 / 100000000 1.4285714285714284e-09 >>> 1 / 100000000 / 7 1.4285714285714286e-09 ``` ## Floating Point Exceptions | IEEE 754 Exception | Result when traps disabled | Argument to trap handler | | :------------------ | :------------------------- | :----------------------- | | **overflow** | $\pm \infty$ or $\pm x_{\text{max}}$ | $\text{round}(x 2^{-a})$ | | **underflow** | $0, \pm 2^{\text{min}}$ or denormalized | $\text{round}(x 2^{a})$ | | **divide by zero** | $\pm \infty$ | invalid operation | | **invalid** | $\text{NaN}$ | invalid operation | | **inexact** | $\text{round}(x)$ | $\text{round}(x)$ | ## But Wait! There’s More! - Binary to Decimal Conversion Error - Summation Error - Propagation Error - Underflow, Overflow - Type Narrowing/Widening Rules ## Testing - Design for Numerical Stability - Perform Meaningful Testing - Document assumptions - Track sources of approximation - Quantify goodness - Well conditioned algorithms - Backward error analysis - Are the outputs identical for slightly modified inputs? ## Today - Understand precision through the lens of floats. - Understand the basics of floats - Understand the language of floats - Get a feel for when further investigation is required