Numbers in Digital Systems
Numerical Methods
David Mayerich
Scalable Tissue Imaging and Modeling (STIM) Laboratory
Department of Electrical and Computer Engineering
Cullen College of Engineering
University of Houston
David Mayerich
STIM Laboratory, University of Houston

Radix
Representing and Converting Base
Binary Numbers and Arithmetic
Bases in Digital Systems
David Mayerich
STIM Laboratory, University of Houston
Numerical Bases
-
The radix or base is the number of unique digits used to represent a number:
David Mayerich
STIM Laboratory, University of Houston
Hexidecimal (base \(16\))
Binary (base \(2\))
Octal (base \(8\))
Decimal (base \(10\))
Radix Points
-
Separates whole numbers from fractions in any base
David Mayerich
STIM Laboratory, University of Houston
Hexidecimal (base \(16\))
Binary (base \(2\))
Octal (base \(8\))
Decimal (base \(10\))
Binary Numbers
Registers
Converting Binary Numbers
Binary Arithmetic
David Mayerich
STIM Laboratory, University of Houston
Binary
-
Modern computers represent numbers using memory cells
-
Individual cells can occupy two distinct states: high and low voltage
-
Each cell represents one binary digit: high = 1, low = 0
David Mayerich
STIM Laboratory, University of Houston
\(\rightarrow 10^4\) values
\(\rightarrow 2^4\) values
\(\rightarrow 10^6\) values
\(\rightarrow 2^6\) values

byte

nibble
-
Numbers are represented as sequences of digits
-
Digits define the number of different values that can be represented
min
max
min
max
Reading Binary Numbers
-
Initialize a decimal register \(x_{0} = 0\)
-
For each binary digit, double \(x\) and add the associated digit as a decimal value
David Mayerich
STIM Laboratory, University of Houston
-
Used to be known as "double dabble"
Reading Binary Numbers
David Mayerich
STIM Laboratory, University of Houston
Fractional Binary Numbers
David Mayerich
STIM Laboratory, University of Houston
Binary Arithmetic
David Mayerich
STIM Laboratory, University of Houston
Arithmetic works the same way in any base
Representing Integers
Integers
Signed/Unsigned Integers
Overflow
David Mayerich
STIM Laboratory, University of Houston
Signed and Unsigned Integers
-
One's complement - negative values are the bitwise NOT of positive values
David Mayerich
STIM Laboratory, University of Houston
-
Two's complement - negative values are the bitwise NOT \(+1\)
-
Represent negative integer values in registers: \(-17_{10}=-0001\ 0001_2\)
-
Sign and magnitude - leading bit represents the sign \(\bm{+}\rightarrow 0\) and \(\bm{-}\rightarrow 1\)
Integer Overflow
-
What happens when the result exceeds the available register size?
-
Consider a \(6\) bit addition:
David Mayerich
STIM Laboratory, University of Houston
-
Overflow is defined for unsigned integers:
-
Assume an \(n\)-bit operation
-
Perform the operation
-
Keep the \(n\) least significant bits
-
-
Keeping \(n\) bits is equivalent to the operation \(x\ \text{mod}\ 2^n\):
§6.2.5/9
A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.
-
Overflow is undefined for signed integers
What Does This Mean?
-
Every register is limited - \(n\) bits let you represent \(2^n\) different values
-
Signed integers require a signed bit to flag \(+\)/\(-\) values so the range is halved:
David Mayerich
STIM Laboratory, University of Houston
unsigned char a; // [0, 255]
char b; // [-128, 127]
unsigned int x; // [0, 2^32 - 1]
int x; // [-2^31, 2^31 - 1]-
Most integer overflows are undefined
-
Unsigned integers overflow by "wrapping around" to the minimum value:
-
This is the same result as a modulo operation:
-
unsigned int a = UINT_MAX + 1; // a = 0unsigned int c = a + b;
unsigned int z = (a + b) % pow(2, 32);
// c == zImplementing Floating Point
Floating Point Numbers
Floating Point Arithmetic
Digital Representations
David Mayerich
STIM Laboratory, University of Houston
Scientific Notation
-
Compressed format using a mantissa \(m\) and an exponent \(n\):
David Mayerich
STIM Laboratory, University of Houston
-
Represent large and small numbers to simplify calculations:
-
Computers and calculators often use "
E" to denote the exponent:6.6743E-11 -
The mantissa is normalized: \(1 \leq m < b\) where \(b\) is the base
ensures one representation for any value
mantissa
or significand
exponent
gravitational constant:
speed of light:
elementary charge:
electric permittivity:
Floating Point
-
Scientific notation can be used in any basis \(b\)
-
Specify a precision: number of digits for \(m\) and \(n\)
-
Mantissa and exponent can be negative or positive
David Mayerich
STIM Laboratory, University of Houston

5 digit
precision

2 digit
exponent
sign
Floating Point Arithmetic
-
Fixed precision means that a floating point value \(\text{fl}(x)\) may not match the target value \(x\)
David Mayerich
STIM Laboratory, University of Houston
-
A non-representable number \(x\) is surrounded by two representable values: \(x_-\) and \(x_+\):
-
Rounding options:
-
round-by-chopping:
-
round-to-nearest: \(\text{fl}(x)\) is the closest representable value to \(x\) (ties resolve to the closest even value)
-
Implications
-
Operations on floating point numbers are not necessarily associative or distributive:
David Mayerich
STIM Laboratory, University of Houston
-
Cumulative operations can fail:
vs.
Implementation Quirks
-
Digital systems implement floating point using binary values
-
Sign-and-magnitude is used for negative/positive values
-
Normalization: \(1\) is always the leading bit, so it doesn't have to be stored (implied \(1\))
-
Exponent bias
-
signed exponents are required, but two's complement makes comparisons slower
-
a static bias \(B\) is introduced
-
David Mayerich
STIM Laboratory, University of Houston
where \(m\in \mathbb{Q}\) and \(x\in\mathbb{Z}\) are both binary values
Floating Point Standards
-
The IEEE 754 standard is the most common for floating point in computing:
David Mayerich
STIM Laboratory, University of Houston
| standard | C/C++ | m bits | x bits | bias |
|---|---|---|---|---|
| binary16 | single |
10 | 5 | 15 |
| binary32 | float |
23 | 8 | 127 |
| binary64 | double |
53 | 11 | 1023 |
| binary128 | N/A | 113 | 15 | 16383 |
| binary256 | N/A | 19 | 237 | 262143 |
-
Floating point storage within a register:
sign
exponent
mantissa
Floating Point Standards (32-bit)
David Mayerich
STIM Laboratory, University of Houston
Word Size and Endianness
-
The word size is a single unit of data stored, handled by an operation, or transmitted
-
The smallest addressable data size is usually a byte (8 bits)
David Mayerich
STIM Laboratory, University of Houston
Consider a 32-bit floating point value representing \(-\pi\):
-
Each 4-bit nibble has \(2^4=16\) possible values, often represented using hexadecimal
-
Bytes in a sequence can be stored in two orders
-
Big-Endian:
C0 49 0F DB -
Little-Endian:
DB 0F 49 C0
-
C
0
4
9
0
F
D
B
sign
exponent (8 bits)
mantissa (23 bits)
1 byte
4 bit
"nibble"
more common
Data Dumps and Memory
David Mayerich
STIM Laboratory, University of Houston
Discussion
-
What is stored in this IEEE 754 binary32 register:
David Mayerich
STIM Laboratory, University of Houston
endianness
C.1 Numbers in Digital Systems
By STIM Laboratory
C.1 Numbers in Digital Systems
- 204