Numbers in Digital Systems

Numerical Methods

David Mayerich

Scalable Tissue Imaging and Modeling (STIM) Laboratory

Department of Electrical and Computer Engineering

Cullen College of Engineering

University of Houston

David Mayerich

STIM Laboratory, University of Houston

Radix

Representing and Converting Base

Binary Numbers and Arithmetic

Bases in Digital Systems

David Mayerich

STIM Laboratory, University of Houston

Numerical Bases

The radix or base is the number of unique digits used to represent a number:

David Mayerich

STIM Laboratory, University of Houston

2473_{10} = 2 \times 10^3 + 4 \times 10^2 + 7 \times 10^1 + 3 \times 10^0 = 2473

4651_{8} = 4 \times 8^3 + 6 \times 8^2 + 5 \times 8^1 + 1 \times 8^0 = 2473

\begin{split} 1001 1010 1001_{2} = & 1 \times 2^{11} + 0 \times 2^{10} + 0 \times 2^{9} + 1\times 2^{8} +\\ & 1 \times 2^{7} + 0 \times 2^{6} + 1 \times 2^{5} + 0 \times 2^{4} +\\ & 1 \times 2^{3} + 0 \times 2^{2} + 0 \times 2^{1} + 1 \times 2^{0} = 2473 \end{split}

9A9_{16} = 9 \times 16^2 + 10 \times 16^1 + 9 \times 16^1 = 2473

10 = A \quad 11 = B \quad 12 = C \quad 13 = D \quad 14 = E \quad 15 = F

Hexidecimal (base \(16\))

Binary (base \(2\))

Octal (base \(8\))

Decimal (base \(10\))

Radix Points

Separates whole numbers from fractions in any base

David Mayerich

STIM Laboratory, University of Houston

182.5_{10} = 1 \times 10^2 + 8 \times 10^1 + 2 \times 10^0 + 5 \times 10^{-1} = 182.5

266.4_{8} = 2 \times 8^2 + 6 \times 8^1 + 6 \times 8^0 + 4 \times 8^{-1} = 182.5

\begin{split} 1011 0110.1_{2} = & 1 \times 2^{11} + 0 \times 2^{10} + 1 \times 2^{9} + 1\times 2^{8} +\\ & 0 \times 2^{7} + 1 \times 2^{6} + 1 \times 2^{5} + 0 \times 2^{4} +\\ & 1 \times 2^{-1} = 182.5 \end{split}

B6.8_{16} = 11 \times 16^1 + 6 \times 16^0 + 8 \times 16^{-1} = 182.5

10 = A \quad 11 = B \quad 12 = C \quad 13 = D \quad 14 = E \quad 15 = F

Hexidecimal (base \(16\))

Binary (base \(2\))

Octal (base \(8\))

Decimal (base \(10\))

Binary Numbers

Registers

Converting Binary Numbers

Binary Arithmetic

David Mayerich

STIM Laboratory, University of Houston

Binary

Modern computers represent numbers using memory cells
Individual cells can occupy two distinct states: high and low voltage
Each cell represents one binary digit: high = 1, low = 0

David Mayerich

STIM Laboratory, University of Houston

0100 \ \ 1101 \ \ 0110 \ \ 1010 \ \ 1111 \ \ 0011

\(\rightarrow 10^4\) values

\(\rightarrow 2^4\) values

\(\rightarrow 10^6\) values

\(\rightarrow 2^6\) values

byte

nibble

Numbers are represented as sequences of digits
Digits define the number of different values that can be represented

min

max

min

max

Reading Binary Numbers

Initialize a decimal register \(x_{0} = 0\)
For each binary digit, double \(x\) and add the associated digit as a decimal value

David Mayerich

STIM Laboratory, University of Houston

1101 \rightarrow 1 \quad 1 \quad 0 \quad 1

Used to be known as "double dabble"

x_0 = 0

\begin{split} x_1 = 2(0)&+1\\ &=1 \end{split}

\begin{split} x_2 = 2(1) &+1\\ &=3 \end{split}

\begin{split} x_3 = 2(3) &+0\\ &=6 \end{split}

\begin{split} x_4 = 2(6) &+1\\ &=13 \end{split}

Reading Binary Numbers

David Mayerich

STIM Laboratory, University of Houston

0110 \ 1010 \rightarrow 0 \quad\quad 1 \quad\quad 1 \quad\quad 0 \quad\quad 1 \quad\quad 0 \quad\quad 1 \quad\quad 0

x_0 = 0

\begin{split} x_1 = 2(0)&+0\\ &=0 \end{split}

\begin{split} x_1 = 2(0)&+1\\ &=1 \end{split}

\begin{split} x_1 = 2(1)&+1\\ &=3 \end{split}

\begin{split} x_1 = 2(3)&+0\\ &=6 \end{split}

\begin{split} x_1 = 2(6)&+1\\ &=13 \end{split}

\begin{split} x_1 = 2(13)&+0\\ &=26 \end{split}

\begin{split} x_1 = 2(26)&+1\\ &=53 \end{split}

\begin{split} x_1 = 2(53)&+0\\ &=106 \end{split}

Fractional Binary Numbers

David Mayerich

STIM Laboratory, University of Houston

1101.1011 \rightarrow 1 \quad\quad 1 \quad\quad 0 \quad\quad 1 \quad.\quad 1 \quad\quad 0 \quad\quad 1 \quad\quad 1

\frac{11}{}

x_0 = 0

\begin{split} x_1 = 2(0)&+1\\ &=1 \end{split}

\begin{split} x_1 = 2(1)&+1\\ &=3 \end{split}

\begin{split} x_1 = 2(3)&+0\\ &=6 \end{split}

\begin{split} x_1 = 2(3)&+0\\ &=13 \end{split}

x_0 = 0

\begin{split} x_1 = 2(0)&+1\\ &=1 \end{split}

\begin{split} x_1 = 2(1)&+0\\ &=2 \end{split}

\begin{split} x_1 = 2(2)&+1\\ &=5 \end{split}

\begin{split} x_1 = 2(5)&+1\\ &=11 \end{split}

Binary Arithmetic

David Mayerich

STIM Laboratory, University of Houston

0\ 1\ 0\ 1\ .\ 0\ 1\ 0

+ 0\ 0\ 1\ 1\ .\ 1\ 0\ 1

1\ 0\ 0\ 0\ .\ 1\ 1\ 1

0\ 1\ 0\ 1\ .\ 0\ 1\ 0

\times 0\ 0\ 1\ 1\ .\ 1\ 0\ 1

5 . 250

+3 . 625

8 . 875

5 . 250

\times 3 . 625

19 . 03125

Arithmetic works the same way in any base

0\ 1\ 0\ 1\ \ \ 0\ 1\ 0

0\ 0\ 0\ 0\ 0\ \ \ 0\ 0

0\ 1\ 0\ 1\ 0\ 1\ \ \ 0

0\ 1\ 0\ 1\ 0\ 1\ 0

1\ 0\ 0\ 1\ 1\ 0. 0\ 0\ \ \ 0\ 1\ 0

Representing Integers

Integers

Signed/Unsigned Integers

Overflow

David Mayerich

STIM Laboratory, University of Houston

Signed and Unsigned Integers

One's complement - negative values are the bitwise NOT of positive values

David Mayerich

STIM Laboratory, University of Houston

0001\ 0001_2 = 17_{10}

1110\ 1110_2 = -17_{10}

0111\ 1011_2 = 122_{10}

1000\ 0100_2 = -122_{10}

0101\ 0110_2 = 86_{10}

1010\ 1001_2 = -86_{10}

Two's complement - negative values are the bitwise NOT \(+1\)

Represent negative integer values in registers: \(-17_{10}=-0001\ 0001_2\)

Sign and magnitude - leading bit represents the sign \(\bm{+}\rightarrow 0\) and \(\bm{-}\rightarrow 1\)

0001\ 0001_2 = 17_{10}

1001\ 0001_2 = -17_{10}

0111\ 1011_2 = 122_{10}

1111\ 1011_2 = -122_{10}

0101\ 0110_2 = 86_{10}

1101\ 0110_2 = -86_{10}

0001\ 0001_2 = 17_{10}

1110\ 1110_2 +1

1110\ 1111_2 = -17_{10}

0111\ 1011_2 = 122_{10}

1000\ 0100_2 +1

1000\ 0101_2 = -122_{10}

0101\ 0110_2 = 86_{10}

1010\ 1001_2 +1

1010\ 1010_2 = -86_{10}

Integer Overflow

What happens when the result exceeds the available register size?
Consider a \(6\) bit addition:

David Mayerich

STIM Laboratory, University of Houston

Overflow is defined for unsigned integers:
1. Assume an \(n\)-bit operation
2. Perform the operation
3. Keep the \(n\) least significant bits
Keeping \(n\) bits is equivalent to the operation \(x\ \text{mod}\ 2^n\):

97\ \text{mod}\ 2^n=33

§6.2.5/9
A computation involving unsigned operands can never overﬂow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.

Overflow is undefined for signed integers

What Does This Mean?

Every register is limited - \(n\) bits let you represent \(2^n\) different values
Signed integers require a signed bit to flag \(+\)/\(-\) values so the range is halved:

David Mayerich

STIM Laboratory, University of Houston

unsigned char a;	// [0, 255]
char b;				// [-128, 127]

unsigned int x;		// [0, 2^32 - 1]
int x;				// [-2^31, 2^31 - 1]

Most integer overflows are undefined
Unsigned integers overflow by "wrapping around" to the minimum value:
- This is the same result as a modulo operation:

unsigned int a = UINT_MAX + 1;	// a = 0

unsigned int c = a + b;
unsigned int z = (a + b) % pow(2, 32);
// c == z

Implementing Floating Point

Floating Point Numbers

Floating Point Arithmetic

Digital Representations

David Mayerich

STIM Laboratory, University of Houston

Scientific Notation

Compressed format using a mantissa \(m\) and an exponent \(n\):

David Mayerich

STIM Laboratory, University of Houston

a = m \times 10^n

Represent large and small numbers to simplify calculations:

Computers and calculators often use "E" to denote the exponent: 6.6743E-11
The mantissa is normalized: \(1 \leq m < b\) where \(b\) is the base

ensures one representation for any value

\text{where}\quad m\in \mathbb{Q},\ n\in\mathbb{Z}

mantissa
or significand

exponent

G=6.6743 \times 10^{-11}

gravitational constant:

c=2.9979 \times 10^{8}

speed of light:

elementary charge:

e=1.6022 \times 10^{-19}

electric permittivity:

\epsilon_0=8.8542 \times 10^{-12}

G=667.43 \times 10^{-14}

G=0.0066743 \times 10^{-8}

G=6.6743 \times 10^{-11}

Floating Point

Scientific notation can be used in any basis \(b\)
Specify a precision: number of digits for \(m\) and \(n\)
Mantissa and exponent can be negative or positive

David Mayerich

STIM Laboratory, University of Houston

a = \pm \ m \times b^n

G= + 6\ .\ 6\ 7\ 4\ 3 \times 10^{-\ 1\ 1}

c=+ 2\ .\ 9\ 9\ 7\ 9 \times 10^{+\ 0\ 8}

=+ 1\ .\ 4\ 7\ 7\ 7 \times 2^{-\ 6\ 3}

=+ 1\ .\ 1\ 1\ 6\ 8 \times 2^{+\ 2\ 8}

5 digit

precision

2 digit

exponent

sign

Floating Point Arithmetic

Fixed precision means that a floating point value \(\text{fl}(x)\) may not match the target value \(x\)

David Mayerich

STIM Laboratory, University of Houston

+\ 1\ 3\ .\ 1\ 4\ 4\ 1 \times 10^{-\ 1\ 1}

A non-representable number \(x\) is surrounded by two representable values: \(x_-\) and \(x_+\):
Rounding options:
- round-by-chopping:
- round-to-nearest: \(\text{fl}(x)\) is the closest representable value to \(x\) (ties resolve to the closest even value)

1.3144 < 1.31441 < 1.3145

\text{fl}(x) = \begin{cases} x_+ & x \leq 0\\ x_- & x > 0\\ \end{cases}

+\ 7\ .\ 3\ 1\ 2\ 4 \times 10^{-\ 1\ 1}

+\ 5\ .\ 8\ 3\ 1\ 7 \times 10^{-\ 1\ 1}

+\ 1\ .\ 3\ 1\ 4\ 4\ 1 \times 10^{-\ 1\ 0}

+\ 1\ .\ 3\ 1\ 4\ 4 \times 10^{-\ 1\ 0}

Implications

Operations on floating point numbers are not necessarily associative or distributive:

David Mayerich

STIM Laboratory, University of Houston

\text{fl}(\text{fl}(x + y) + z) \neq \text{fl}(x + \text{fl}(y + z))

\text{fl}(z \times \text{fl}(x + y)) \neq \text{fl}(\text{fl}(z \times x) + \text{fl}(z \times y))

Cumulative operations can fail:

9.993 \times 10^1 + 4.000 \times 10^{-2} = 9.997 \times 10^1

9.997\times 10^1 + 4.000\times 10^{-2} = 1.0001\times 10^1 \rightarrow 1.000\times 10^1

1.000\times 10^1 + 4.000\times 10^{-2} = 1.0004\times 10^1 \rightarrow 1.000\times 10^1

(0.03842 +1.273) -1.221

0.03842

+1.273

1.31142

-1.221

0.090

0.03842 +(1.273 -1.221)

vs.

+1.273

-1.221

0.052

+0.03842

0.09042

9\ 9\ .\ 9\ 3

0\ 0\ .\ 0\ 4

9\ 9\ .\ 9\ 7

0\ 0\ .\ 0\ 4

1\ 0\ 0\ .\ 0\ 1

0\ 0\ .\ 0\ 4

1\ 0\ 0\ .\ 0\ 4

Implementation Quirks

Digital systems implement floating point using binary values
Sign-and-magnitude is used for negative/positive values
Normalization: \(1\) is always the leading bit, so it doesn't have to be stored (implied \(1\))
Exponent bias
- signed exponents are required, but two's complement makes comparisons slower
- a static bias \(B\) is introduced

David Mayerich

STIM Laboratory, University of Houston

1.m \times 2^{x - B}

where \(m\in \mathbb{Q}\) and \(x\in\mathbb{Z}\) are both binary values

Floating Point Standards

The IEEE 754 standard is the most common for floating point in computing:

David Mayerich

STIM Laboratory, University of Houston

standard	C/C++	m bits	x bits	bias
binary16	`single`	10	5	15
binary32	`float`	23	8	127
binary64	`double`	53	11	1023
binary128	N/A	113	15	16383
binary256	N/A	19	237	262143

Floating point storage within a register:

1\ \ 1\ \ 0\ \ 1\ \ 1\ \ 0\ \ 0\ \ 1\ \ 1\ \ 0\ \ 1\ \ 0\ \ 0\ \ 0\ \ 1\ \ 0

-1.0110100010_{2}\times 2^{10110_2 - 15}

2^{22 - 15} = 2^7

-10110100.010_{2} = -180.25

sign

exponent

mantissa

Floating Point Standards (32-bit)

David Mayerich

STIM Laboratory, University of Houston

Word Size and Endianness

The word size is a single unit of data stored, handled by an operation, or transmitted
The smallest addressable data size is usually a byte (8 bits)

David Mayerich

STIM Laboratory, University of Houston

1\ 1\ 0\ 0\ 0\ 0\ 0\ 0\ 0\ 1\ 0\ 0\ 1\ 0\ 0\ 1\ 0\ 0\ 0\ 0\ 1\ 1\ 1\ 1\ 1\ 1\ 0\ 1\ 1\ 0\ 1\ 1

Consider a 32-bit floating point value representing \(-\pi\):

Each 4-bit nibble has \(2^4=16\) possible values, often represented using hexadecimal
Bytes in a sequence can be stored in two orders
- Big-Endian: C0 49 0F DB
- Little-Endian: DB 0F 49 C0

C

0

4

9

0

F

D

B

sign

exponent (8 bits)

mantissa (23 bits)

1 byte

4 bit

"nibble"

more common

Data Dumps and Memory

David Mayerich

STIM Laboratory, University of Houston

Discussion

What is stored in this IEEE 754 binary32 register:

David Mayerich

STIM Laboratory, University of Houston

A1\quad 1E\quad EF\quad 3B

3B\quad EF\quad 1E\quad A1

0011\ 1011\quad 1110\ 1111\quad 0001\ 1110\quad 1010\ 0001

+1.11011110001111010100001_2 \times 2^{127 - 01110111_2}

119 - 127 = -8

+1.86812222003936767578 \times 2^{-8}

7.2973524220287799835205078125 \times 10^{-3}

\text{actual}\quad 7.2973525643 \times 10^{-3}

\epsilon= -1.422712200164794921875 \times 10^{-10}

endianness