Advanced Data representations

User-defined data types

Recall what you learn before:

List data types you've learnt

What is the best way to store the followings?

To store direction (e.g. North)
To store day of the week
To store telephone number of someone
To store telephone number of someone, with country code, area code as well
To store a mini phone book, with name and a telephone number associated with the name
To store a number of "tags" related to a post
To store coordinates of a point (x, y, z)

Common data structures

Array: Container to store a fixed number of data in same types
List: Similar to array, but the size of list can be changed, items can be inserted, removed
Both Array and List use "index" to access the data
Dictionary: Stores data in Key-Value pair, data can be accessed using the "Key". Note key must be unique. Data are not ordered.
Set: Stores unique item with no particular order

Note: Except array, all are called "Abstract" data structures

Non-Composite user-defined types

User-defined types

The User here refers to programmer
Sometimes primitives data types are not intuitive enough to represent some types of data
Programming languages allows user-defined types
- => more readable code
- => usually shorter code
- => more intuitive for human to handle
- => less error in programming
Two types of user-defined types:
- Composite
- Non-Composite

Enumerated data type

Defines list of possible "named" values
Under the hood usually the value is mapped to Integer (e.g. C)

Pointer Data type

Store Memory Location of variables
"Pointing" at a certain location
In Pseudocode:
- ^ means pointing to the location of a variable
- @ means accessing the content of memory location (at that location)

More about pointer

Pointer usually works with direct memory access
- Existed in lower level languages only
- e.g. C, C++
Pointer is very powerful, but yet very difficult to debug
Python don't have pointer, but there's variable/object reference
- Objects are always pass-by-reference, in other words, objects variables are always "pointer"
- Primitives are always pass-by-value
- (demo)

Composite user defined types

With reference to at least one other type
Examples included:
- Record
- Set
- Objects and classes

Record

Contains fixed number of components
Can be of different types

e.g. Employee Record

TYPE
TEmployeeRecord
    DECLARE FirstName: STRING
    DECLARE LastName: STRING
    DECLARE DateEmployed: DATE
    DECLARE Salary: CURRENCY
ENDTYPE

To Access:

DECLARE Sam : TEmployeeRecord
Sam.FirstName <- "Sam"
Sam.DateEmployed <- #12/04/1995#
Sam.Salary <- 4100

Set

Set (data type)

Similar to list / array, a set is a data structure for multiple items of the same type (primitive/user-defined)
Items in set must be unique
Support Union, difference, intersection operations, similar to Math

Think Pair Share

Can you name one scenario that requires Set to store the information?

Dictionary

Contains Key-Value pair
Key and value can be any data type
The Key is needed to retrieve and store the value

TYPE DictionaryEntry
    DECLARE Key: STRING
    DECLARE Value: STRING
ENDTYPE

DECLARE EnglishFrench[0:9999] : DictionaryEntry

TO INSERT:
EnglishFrench["Hello"] <- "Bonjour"

TO LOOKUP:
print(EnglishFrench["Hello"])

File organisation

Data file in computer usually stored in two ways:
- Text-based files
- Binary Files

TYPE
TGameRecord
  PlayerID: STRING (20)
  GameTime: DATETIME
  RoundsFired: INTEGER
  Accuracy: FLOAT
ENDTYPE

Consider we need to store the following record on server:

Text Files

Data stored as plain text (human readable)
- E.g. an integer value 1234 will be stored as "1234" (characters) which is 4 human readable character
The format must be predefined
- e.g. Number or data items per line must be known
- In order to identify data in fields
  - Number of character per item must be known
  - or using delimiter (like CSV)
Advantages?

Question time

In CSV or other delimiter based design, there is always an issue with having the delimiter itself in the data. What is the solution?

Group activity

Define how data should store for 3 game records in a file, using plain text file

Research

For plain text data storage, there are more advanced standard that allows more flexible scheme. Some examples are:
- XML
- JSON
- YAML
Refer to Classroom for instructions

Binary File

Data stored in their internal representation
- e.g. An integer 1234 may stored as a two-byte binary representation
Data stored as Record - contains a fixed number of fields, fields are of fixed structure
Type of fields must be well defined - thus the size of each field is fixed
For String Field there are two methods:
- Fixed-length string - more efficient
- Variable-length string

Discussion Time

What is the advantages and disadvantages for binary file over text file?

File organisation methods

Serial
Sequential
Direct-access

Serial Files

Records are not organised in any defined order
New record is simply appended to the end of the file
- E.g. Transaction record of customer account activities
- Event log of a computer system
- Thus, Above records ordered in time

Sequential Files

Records are ordered by its defined Key Field
When a new record arrive, it will append to the end first (like Serial), but
It will sort later
Why?

Direct-access Files

Records are not ordered within the file, and there are methods to know where (address) of the record.
- Key file
  - Separate sequential file to store the key value, which is mapped to the address of the record in the main file
  - Multiple key files are possible
- Use hashing algorithm
  - Hashing algorithm is a method to generate the address from a given key
  - Collision might occurs when hashing, where two different key shared the same address

Discussion Time

Advantage and disadvantages of direct-access?

Direct-access

Good:
- Access time (time from request to read) to any given record is (virtually) the same
- In sequential access you need to go through records so the time may vary
Not good:
- Waste of storage space, as records are not packed together (will discuss more on hash table)

Real Number Representations

In computer we have two major method to represents real numbers
- Fixed point and
- Floating point

Fixed Point representations

The location of the (decimal) "point" is fixed
- A predetermined pattern is used, e.g. a 8-bit real number can have
  - first bit as sign, 2-6 bit as integer part and 7-8 bit as decimal

Sign	2^4	2^3	2^2	2^1	2^0	2^-1	2^-2

Decimal point

Questions:

Express the following in the above system:
1. 1.5
2. -7.75
The largest binary code is 0-11111-11 in the above system, what is the denary value?
What is the smallest (most negative) binary code? what is the denary value?
Try to express 0.125 in the above system. Can you do it? If not, what should the computer do?

Discussion Time

In modern computers we usually have 64-bit number (integer/real)

Fixed point representation is straightforward and easy to understand, but what is the limitations?

All real numbers can be expressed as:

Where
- M: Mantissa ( 0 ≤ M < 1 )
- R: Radix
- E: Exponent (integer)

Floating point representations

In Denary numbers

R is always 10
e.g. we can express 2.54 as: 0.254 x 10^1

Denary	Floating point notation
25.3
13475
-12.9
0.123
0.00254
-0.00195

Denary	Normalised floating point notation
A
A
A
	B
	B
	B

Your turn

Exchange notebook and put 3 denary and 3 floating point notation value in A and B respectively, return to the owner and they need to answer them

Floating point conversion

Convert the denary real number to binary (M)
1. Put binary point between integer and fractional part
2. If integer part is negative, apply 2's complement
Write exponent value zero (E) in Denary
Normalise the M
1. Shift the binary point until first two bits is different (i.e. 0.1 or 1.0)
2. Each left shift add 1 to E, right shift minus 1 to E
Convert E into binary using 2's complement

Example: 12-bit floating point

with 8-bit M, 4-bit E

Why Normalisation?

Normalisation moves the binary point so that the first digit after the point is a significant digit
Maximise the precision of a number
First digit in positive number is always 0, thus, the second bit must be 1 so that it can make way to store the most digit afterwards
However, negative number must start with 1, if second bit is 0 then maximum digits can be stored

overflow, underflow and precision

Overflow
- The number is too large to be represented
- Overflow exception / error
Underflow
- The number is too small in magnitude (i.e. close to zero) to be represented
- So it stores as zero
Losing precision
- Not enough bits to store the precision and eventually loss some detail
- Precision is defined by number of bits in Mantissa

[CSAL] Data Representation

By Andy tsui

[CSAL] Data Representation

Advanced Data representations

User-defined data types

Recall what you learn before:

What is the best way to store the followings?

Common data structures

Non-Composite user-defined types

User-defined types

Enumerated data type

Pointer Data type

More about pointer

Composite user defined types

Record

Set (data type)

Think Pair Share

Dictionary

File organisation

Text Files

Question time

Research

Binary File

File organisation methods

Serial Files

Sequential Files

Direct-access Files

Direct-access

Real Number Representations

Fixed Point representations

Floating point representations

In Denary numbers

Floating point conversion

Why Normalisation?

overflow, underflow and precision

[CSAL] Data Representation

More from Andy tsui