Advanced Data representations

User-defined data types

Recall what you learn before:

  • List data types you've learnt

What is the best way to store the followings?

  1. To store direction (e.g. North)
  2. To store day of the week
  3. To store telephone number of someone
  4. To store telephone number of someone, with country code, area code as well
  5. To store a mini phone book, with name and a telephone number associated with the name
  6. To store a number of "tags" related to a post
  7. To store coordinates of a point (x, y, z)

Common data structures

  1. Array: Container to store a fixed number of data in same types
  2. List: Similar to array, but the size of list can be changed, items can be inserted, removed
  3. Both Array and List use "index" to access the data
  4. Dictionary: Stores data in Key-Value pair, data can be accessed using the "Key". Note key must be unique. Data are not ordered.
  5. Set: Stores unique item with no particular order

Note: Except array, all are called "Abstract" data structures

Non-Composite user-defined types

User-defined types

  • The User here refers to programmer
  • Sometimes primitives data types are not intuitive enough to represent some types of data
  • Programming languages allows user-defined types
    • => more readable code
    • => usually shorter code
    • => more intuitive for human to handle
    • => less error in programming
  • Two types of user-defined types:
    • Composite
    • Non-Composite

Enumerated data type

  • Defines list of possible "named" values
  • Under the hood usually the value is mapped to Integer (e.g. C)

Pointer Data type

  • Store Memory Location of variables
  • "Pointing" at a certain location
  • In Pseudocode:
    • ^ means pointing to the location of a variable
    • @ means accessing the content of memory location (at that location)

More about pointer

  • Pointer usually works with direct memory access
    • Existed in lower level languages only
    • e.g. C, C++
  • Pointer is very powerful, but yet very difficult to debug
  • Python don't have pointer, but there's variable/object reference
    • Objects are always pass-by-reference, in other words, objects variables are always "pointer"
    • Primitives are always pass-by-value
    • (demo)

Composite user defined types

  • With reference to at least one other type
  • Examples included:
    • Record
    • Set
    • Objects and classes

Record

  • Contains fixed number of components
  • Can be of different types

e.g. Employee Record

 

TYPE
TEmployeeRecord
    DECLARE FirstName: STRING
    DECLARE LastName: STRING
    DECLARE DateEmployed: DATE
    DECLARE Salary: CURRENCY
ENDTYPE

 

To Access:

DECLARE Sam : TEmployeeRecord
Sam.FirstName <- "Sam"
Sam.DateEmployed <- #12/04/1995#
Sam.Salary <- 4100

Set

Set (data type)

  • Similar to list / array, a set is a data structure for multiple items of the same type (primitive/user-defined)
  • Items in set must be unique
  • Support Union, difference, intersection operations, similar to Math

Think Pair Share

  1. Can you name one scenario that requires Set to store the information?

Dictionary

  • Contains Key-Value pair
  • Key and value can be any data type
  • The Key is needed to retrieve and store the value
TYPE DictionaryEntry
    DECLARE Key: STRING
    DECLARE Value: STRING
ENDTYPE

DECLARE EnglishFrench[0:9999] : DictionaryEntry

TO INSERT:
EnglishFrench["Hello"] <- "Bonjour"

TO LOOKUP:
print(EnglishFrench["Hello"])

File organisation

  • Data file in computer usually stored in two ways:
    • Text-based files
    • Binary Files
TYPE
TGameRecord
  PlayerID: STRING (20)
  GameTime: DATETIME
  RoundsFired: ​INTEGER
  Accuracy: FLOAT
ENDTYPE

Consider we need to store the following record on server:

Text Files

  • Data stored as plain text (human readable)
    • E.g. an integer value 1234 will be stored as "1234" (characters) which is 4 human readable character
  • The format must be predefined
    • e.g. Number or data items per line must be known
    • In order to identify data in fields
      • Number of character per item must be known
      • or using delimiter (like CSV)
  • Advantages?

Question time

  • In CSV or other delimiter based design, there is always an issue with having the delimiter itself in the data. What is the solution?

Group activity

Define how data should store for 3 game records in a file, using plain text file

Research

  • For plain text data storage, there are more advanced standard that allows more flexible scheme. Some examples are:
    • XML
    • JSON
    • YAML
  • Refer to Classroom for instructions

Binary File

  • Data stored in their internal representation
    • e.g. An integer 1234 may stored as a two-byte binary representation
  • Data stored as Record - contains a fixed number of fields, fields are of fixed structure
  • Type of fields must be well defined - thus the size of each field is fixed
  • For String Field there are two methods:
    • Fixed-length string - more efficient
    • Variable-length string

Discussion Time

What is the advantages and disadvantages for binary file over text file?

File organisation methods

  • Serial
  • Sequential
  • Direct-access

Serial Files

  • Records are not organised in any defined order
  • New record is simply appended to the end of the file
    • E.g. Transaction record of customer account activities
    • Event log of a computer system
    • Thus, Above records ordered in time 

Sequential Files

  • Records are ordered by its defined Key Field
  • When a new record arrive, it will append to the end first (like Serial), but
  • It will sort later
  • Why?

Direct-access Files

  • Records are not ordered within the file, and there are methods to know where (address) of the record.
    • Key file
      • Separate sequential file to store the key value, which is mapped to the address of the record in the main file
      • Multiple key files are possible
    • Use hashing algorithm
      • Hashing algorithm is a method to generate the address from a given key
      • Collision might occurs when hashing, where two different key shared the same address

Discussion Time

Advantage and disadvantages of direct-access?

Direct-access

  • Good:
    • Access time (time from request to read) to any given record is (virtually) the same
    • In sequential access you need to go through records so the time may vary
  • Not good:
    • Waste of storage space, as records are not packed together (will discuss more on hash table)

Real Number Representations

  • In computer we have two major method to represents real numbers
    • Fixed point and
    • Floating point

Fixed Point representations

  • The location of the (decimal) "point" is fixed
    • A predetermined pattern is used, e.g. a 8-bit real number can have
      • first bit as sign, 2-6 bit as integer part and 7-8 bit as decimal
Sign 2^4 2^3 2^2 2^1 2^0 2^-1 2^-2

Decimal point

Questions:

  1. Express the following in the above system: 
    1. 1.5
    2. -7.75
  2. The largest binary code is 0-11111-11 in the above system, what is the denary value?
  3. What is the smallest (most negative) binary code? what is the denary value?
  4. Try to express 0.125 in the above system. Can you do it? If not, what should the computer do?

Discussion Time

In modern computers we usually have 64-bit number (integer/real) 

Fixed point representation is straightforward and easy to understand, but what is the limitations? 

  • All real numbers can be expressed as: 
  • Where
    • M: Mantissa ( 0 M < 1 )
    • R: Radix 
    • E: Exponent (integer)

Floating point representations

In Denary numbers

  • R is always 10
  • e.g. we can express 2.54 as: 0.254 x 10^1
Denary Floating point notation
25.3
13475
-12.9
0.123
0.00254
-0.00195
Denary Normalised floating point notation
A
A
A
B
B
B

Your turn

Exchange notebook and put 3 denary and 3 floating point notation value in A and B respectively, return to the owner and they need to answer them

Floating point conversion

  1. Convert the denary real number to binary (M)
    1. Put binary point between integer and fractional part
    2. If integer part is negative, apply 2's complement
  2. Write exponent value zero (E) in Denary
  3. Normalise the M
    1. Shift the binary point until first two bits is different (i.e. 0.1 or 1.0)
    2. Each left shift add 1 to E, right shift minus 1 to E
  4. Convert E into binary using 2's complement

Example: 12-bit floating point

with 8-bit M, 4-bit E

Why Normalisation?

  • Normalisation moves the binary point so that the first digit after the point is a significant digit
  • Maximise the precision of a number
  • First digit in positive number is always 0, thus, the second bit must be 1 so that it can make way to store the most digit afterwards
  • However, negative number must start with 1, if second bit is 0 then maximum digits can be stored

overflow, underflow and precision

  • Overflow
    • The number is too large to be represented
    • Overflow exception / error
  • Underflow
    • The number is too small in magnitude (i.e. close to zero) to be represented
    • So it stores as zero
  • Losing precision
    • Not enough bits to store the precision and eventually loss some detail
    • Precision is defined by number of bits in Mantissa

[CSAL] Data Representation

By Andy tsui

[CSAL] Data Representation

  • 228