ASCII and Unicode

H446 Computer Science

Objectives

  • Use names, symbols and corresponding powers of 2 for binary prefixes, e.g., Ki, Mi
  • Differentiate between the character code of a decimal digit and its pure binary representation
  • Describe how character sets (ASCII and Unicode) are used to represent text 

Large Values

  • Computers process and store large amounts of bytes, often in the order of millions or billions
    • When dealing with large quantities it is more convenient to summarise this using number prefixes
    • A common example of this is the kilogram, which is the equivalent of 1000g. 

Prefixes for Bytes

  • The same number prefixes for decimal values can be used to summarise large quantities of bytes
  • This includes: 
Prefix Symbol applied Multiple
kilo kB 10^3 = 1,000
mega MB 10^6 = 1,000,000
giga GB 10^9 = 1,000,000,000
tera TB 10^12 = 1,000,000,000,000
peta PB 10^15 = 1,000,000,000,000,000
exa EB 10^18 = 1,000,000,000,000,000,000
zetta ZB 10^21 = 1,000,000,000,000,000,000,000

Incorrect Prefixes

New Prefixes for CS

  • To eliminate the confusion, in 1998 the International Electrotechnical Commission (IEC) established different prefixes to represent multiples of base 2:
Prefix Symbol applied Multiple
kibi kB 10^3 = 1,024
mebi MB 10^6 = 1,048,576
gibi GB 10^9 = 1,073,741,824
tebi TB 10^12 = 1,099,511,627,776
pebi PB 10^15 = 1,125,899,906,842,624
exbi EB 10^18 = 1,52,921,504,606,846,976
zebi ZB 10^21 = 1,180,591,620,717,411,303,424

Representing Text Characters

If a computer only understands 1s and 0s, what happens when the 'M' key is pressed on the keyboard? 

ASCII Code

In 1963, the American Standard Code for Information Interchange, (ASCII), was established to encode symbols found in the English alphabet.

It was composed of a 7-bit character set, giving just 128 possible binary codes.

What are the limitations of having only a 7-bit character set? 

Representing Characters

  • Every character on the keyboard is represented by a binary value
  • Uppercase letters (capitals) have different values from lowercase characters
  • Punctuation symbols have their own characters 
  • How many characters are there on a standard keyboard?
  • How many bits would be required to represent this many combinations?
  • What character is represented by 0100000 (32)?

95 unique characters

10 digits 0-9

26 lowercase letters

26 uppercase letters

33 special characters

Characters form of decimal digits

  • Numeric characters are also encoded
    • The code 0111001 represents the character '9' in ASCII
    • The binary byte representing '9' would be 000010012
  • What are the implications of this difference?
  • What will the following code output?

Hint: ord() will return the unicode representation of a character

Characters form of decimal digits

  • Numeric characters are also encoded
    • The code 0111001 represents the character '9' in ASCII
    • The binary byte representing '9' would be 000010012
  • What are the implications of this difference?
  • What will the following code output?

Hint: ord() will return the unicode representation of a character

Unicode

  • Unicode was then introduced to standardise the encoding of characters from every language 
    • Unicode can apply a variable length of encoding at either 16 bits or 32 bits long
    • In order to improve the implementation of this the first 128 Unicode characters were set to be the same as the 128 in ASCII
  • What could be a disadvantage of using 4 bytes per character? 

Advantage of Unicode

  • In Unicode, every character in every language in the world, every mathematical and scientific symbol, etc. can be represented:

 

Español

한국어

Македонски

ਪੰਜਾਬੀ ਦੇ

ελληνικά

 

Have a go at the ASCII/Unicode worksheet on Moodle

Extension: Complete the ASCII Exam Questions on Moodle

deck

By CJackson

deck

  • 83