Advanced Data representations
User-defined data types
Recall what you learn before:
- List data types you've learnt
What is the best way to store the followings?
- To store direction (e.g. North)
- To store day of the week
- To store telephone number of someone
- To store telephone number of someone, with country code, area code as well
- To store a mini phone book, with name and a telephone number associated with the name
- To store a number of "tags" related to a post
- To store coordinates of a point (x, y, z)
Common data structures
- Array: Container to store a fixed number of data in same types
- List: Similar to array, but the size of list can be changed, items can be inserted, removed
- Both Array and List use "index" to access the data
- Dictionary: Stores data in Key-Value pair, data can be accessed using the "Key". Note key must be unique. Data are not ordered.
- Set: Stores unique item with no particular order
Note: Except array, all are called "Abstract" data structures
Non-Composite user-defined types
User-defined types
- The User here refers to programmer
- Sometimes primitives data types are not intuitive enough to represent some types of data
- Programming languages allows user-defined types
- => more readable code
- => usually shorter code
- => more intuitive for human to handle
- => less error in programming
- Two types of user-defined types:
- Composite
- Non-Composite
Enumerated data type
- Defines list of possible "named" values
- Under the hood usually the value is mapped to Integer (e.g. C)
Pointer Data type
- Store Memory Location of variables
- "Pointing" at a certain location
- In Pseudocode:
- ^ means pointing to the location of a variable
- @ means accessing the content of memory location (at that location)
More about pointer
- Pointer usually works with direct memory access
- Existed in lower level languages only
- e.g. C, C++
- Pointer is very powerful, but yet very difficult to debug
- Python don't have pointer, but there's variable/object reference
- Objects are always pass-by-reference, in other words, objects variables are always "pointer"
- Primitives are always pass-by-value
- (demo)
Composite user defined types
- With reference to at least one other type
- Examples included:
- Record
- Set
- Objects and classes
Record
- Contains fixed number of components
- Can be of different types
e.g. Employee Record
TYPE TEmployeeRecord DECLARE FirstName: STRING DECLARE LastName: STRING DECLARE DateEmployed: DATE DECLARE Salary: CURRENCY ENDTYPE
To Access:
DECLARE Sam : TEmployeeRecord Sam.FirstName <- "Sam" Sam.DateEmployed <- #12/04/1995# Sam.Salary <- 4100
Set
Set (data type)
- Similar to list / array, a set is a data structure for multiple items of the same type (primitive/user-defined)
- Items in set must be unique
- Support Union, difference, intersection operations, similar to Math
Think Pair Share
- Can you name one scenario that requires Set to store the information?
Dictionary
- Contains Key-Value pair
- Key and value can be any data type
- The Key is needed to retrieve and store the value
TYPE DictionaryEntry DECLARE Key: STRING DECLARE Value: STRING ENDTYPE DECLARE EnglishFrench[0:9999] : DictionaryEntry TO INSERT: EnglishFrench["Hello"] <- "Bonjour" TO LOOKUP: print(EnglishFrench["Hello"])
File organisation
- Data file in computer usually stored in two ways:
- Text-based files
- Binary Files
TYPE TGameRecord PlayerID: STRING (20) GameTime: DATETIME RoundsFired: INTEGER Accuracy: FLOAT ENDTYPE
Consider we need to store the following record on server:
Text Files
- Data stored as plain text (human readable)
- E.g. an integer value 1234 will be stored as "1234" (characters) which is 4 human readable character
- The format must be predefined
- e.g. Number or data items per line must be known
- In order to identify data in fields
- Number of character per item must be known
- or using delimiter (like CSV)
- Advantages?
Question time
- In CSV or other delimiter based design, there is always an issue with having the delimiter itself in the data. What is the solution?
Group activity
Define how data should store for 3 game records in a file, using plain text file
Research
- For plain text data storage, there are more advanced standard that allows more flexible scheme. Some examples are:
- XML
- JSON
- YAML
- Refer to Classroom for instructions
Binary File
- Data stored in their internal representation
- e.g. An integer 1234 may stored as a two-byte binary representation
- Data stored as Record - contains a fixed number of fields, fields are of fixed structure
- Type of fields must be well defined - thus the size of each field is fixed
- For String Field there are two methods:
- Fixed-length string - more efficient
- Variable-length string
Discussion Time
What is the advantages and disadvantages for binary file over text file?
File organisation methods
- Serial
- Sequential
- Direct-access
Serial Files
- Records are not organised in any defined order
- New record is simply appended to the end of the file
- E.g. Transaction record of customer account activities
- Event log of a computer system
- Thus, Above records ordered in time
Sequential Files
- Records are ordered by its defined Key Field
- When a new record arrive, it will append to the end first (like Serial), but
- It will sort later
- Why?
Direct-access Files
- Records are not ordered within the file, and there are methods to know where (address) of the record.
- Key file
- Separate sequential file to store the key value, which is mapped to the address of the record in the main file
- Multiple key files are possible
- Use hashing algorithm
- Hashing algorithm is a method to generate the address from a given key
- Collision might occurs when hashing, where two different key shared the same address
- Key file
Discussion Time
Advantage and disadvantages of direct-access?
Direct-access
- Good:
- Access time (time from request to read) to any given record is (virtually) the same
- In sequential access you need to go through records so the time may vary
- Not good:
- Waste of storage space, as records are not packed together (will discuss more on hash table)
Real Number Representations
- In computer we have two major method to represents real numbers
- Fixed point and
- Floating point
Fixed Point representations
- The location of the (decimal) "point" is fixed
- A predetermined pattern is used, e.g. a 8-bit real number can have
- first bit as sign, 2-6 bit as integer part and 7-8 bit as decimal
- A predetermined pattern is used, e.g. a 8-bit real number can have
Sign | 2^4 | 2^3 | 2^2 | 2^1 | 2^0 | 2^-1 | 2^-2 |
---|
Decimal point
Questions:
- Express the following in the above system:
- 1.5
- -7.75
- The largest binary code is 0-11111-11 in the above system, what is the denary value?
- What is the smallest (most negative) binary code? what is the denary value?
- Try to express 0.125 in the above system. Can you do it? If not, what should the computer do?
Discussion Time
In modern computers we usually have 64-bit number (integer/real)
Fixed point representation is straightforward and easy to understand, but what is the limitations?
- All real numbers can be expressed as:
- Where
- M: Mantissa ( 0 ≤ M < 1 )
- R: Radix
- E: Exponent (integer)
Floating point representations
In Denary numbers
- R is always 10
- e.g. we can express 2.54 as: 0.254 x 10^1
Denary | Floating point notation |
---|---|
25.3 | |
13475 | |
-12.9 | |
0.123 | |
0.00254 | |
-0.00195 |
Denary | Normalised floating point notation |
---|---|
A | |
A | |
A | |
B | |
B | |
B |
Your turn
Exchange notebook and put 3 denary and 3 floating point notation value in A and B respectively, return to the owner and they need to answer them
Floating point conversion
- Convert the denary real number to binary (M)
- Put binary point between integer and fractional part
- If integer part is negative, apply 2's complement
- Write exponent value zero (E) in Denary
- Normalise the M
- Shift the binary point until first two bits is different (i.e. 0.1 or 1.0)
- Each left shift add 1 to E, right shift minus 1 to E
- Convert E into binary using 2's complement
Example: 12-bit floating point
with 8-bit M, 4-bit E
Why Normalisation?
- Normalisation moves the binary point so that the first digit after the point is a significant digit
- Maximise the precision of a number
- First digit in positive number is always 0, thus, the second bit must be 1 so that it can make way to store the most digit afterwards
- However, negative number must start with 1, if second bit is 0 then maximum digits can be stored
overflow, underflow and precision
- Overflow
- The number is too large to be represented
- Overflow exception / error
- Underflow
- The number is too small in magnitude (i.e. close to zero) to be represented
- So it stores as zero
- Losing precision
- Not enough bits to store the precision and eventually loss some detail
- Precision is defined by number of bits in Mantissa
[CSAL] Data Representation
By Andy tsui
[CSAL] Data Representation
- 232