COMP3010: Algorithm Theory and Design

Daniel Sutantyo, Department of Computing, Macquarie University

7.2 - Huffman Coding

Motivation

Each character is in a text file is 1 byte (8-bit ASCII value)
So, if we have a text file with 1000 characters, then it should be 1000 bytes in size.
- can we do better?
- great explanation here: https://www.youtube.com/watch?v=JsTptu56GM8

7.2 - Huffman Coding

Motivation

7.2 - Huffman Coding

Motivation

7.2 - Huffman Coding

Motivation

test1:

3,103 bytes to 262 bytes

test2:

1,042 bytes to 665 bytes

7.2 - Huffman Coding

How can we do this?

Each character is in a text file is 1 byte (8-bit ASCII value)
However, 1 byte can be used to represent 256 different characters, and in English, we simply don't have that many characters,
- even after accounting for capitalisation, numbers, and other symbols (? , #, @, etc), see the ASCII table
- plus, do we use all the characters when we write a text document (~, <, > )
How about if we use variable length code instead?

7.2 - Huffman Coding

How can we do this?

We really don't have that many characters here, so why don't we use a shorter representation for 'e' (because it occurs a lot)
For example, we can use '1' to represent 'e', just one bit!
- of course we have to be careful, because if we use '11' to represent 's', then how can we differentiate between "ss" and "eeee"?

"The assignment is due next week. Reeeeeeeeeeeeeeeeeeeeeeeeeeeee"

- anonymous student

7.2 - Huffman Coding

Prefix codes

We need to use what is known as prefix codes:
- prefix codes is what we call codes where none of the code word is a prefix of another code word
- for example, if we choose to represent 'e' with 1, then we cannot have 10, 11, 100, etc, because those starts with 1
- as an example, consider the following codes:
  - 01 : a 001 : i 0001 : d 0000 : k 11 : e 10 : s
- we use shorter representations for characters that occur many times in the document

7.2 - Huffman Coding

Prefix codes

Binary prefix codes can be represented nicely using a binary tree
- 01 : a 001 : i 0001 : d 0000 : k 11 : e 10 : s
Decoding is easy:
- 0: go left
- 1: go right
- stop at leaves

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

In 1952, Huffman invented a greedy algorithm to produce an optimal prefix code for data compression
The algorithm is quite simple:
- create a single-node tree (leaf) for each character in the document and assign their frequencies as the weight for the node
- put all the node in a priority queue, least weight on top
- pop two trees from the priority queue, then join them together (making a new root) and then put them back in the priority queue, using their combined weights as the weight
- proceed until there is only one tree in the priority queue

Huffman Coding

7.2 - Huffman Coding

Example:
- a : 4 c : 2 e : 9 f : 1 h : 2 j : 1 m : 2 s : 6 w : 1

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

7.2 - Huffman Coding

Huffman Coding

Example:
- a : 4 c : 2 e : 9 f : 1 h : 2 j : 1 m : 2 s : 6 w : 1

7.2 - Huffman Coding

Huffman Coding

Can we show that this tree is optimal?
- How do we measure it?
  - we know the frequency of each character
  - the depth of the tree is the length of the code for that character
  - so given a tree we know the 'cost' of the tree:
    - multiply the frequency of each character with the height of the node containing that character

7.2 - Huffman Coding

Huffman Coding

We can show that the greedy choice is safe to make
- use proof by contradiction (the standard defense)
- in our construction, we always put the characters with the lowest frequencies at the bottom of the tree (can you show this?)
- let \(x\) and \(y\) be the two characters with the lowest frequencies (they must be siblings)

7.2 - Huffman Coding

Huffman Coding

if the resulting tree is not optimal, then there is an optimal tree where \(x\) and \(y\) are NOT at the lowest level
suppose we swap \(x\) and \(a\)
- let's suppose \(a\) occurs just one more time than \(x\) and \(a\) is only one level above \(x\) (if both have the same frequencies, then our tree is just as optimal)

7.2 - Huffman Coding

Huffman Coding

let \(f(x)\) and \(f(a)\) be the frequencies of \(x\) and \(a\) respectively, and let \(h(x)\) and \(h(a)\) be the height of \(x\) and \(a\)
if we swap \(x\) and \(a\), we will need less bits to represent \(x\) but more bits to represent \(a\)
- before swap:
  - cost of representing \(x\) : \(h(x) * f(x)\)
  - cost of representing \(a\) : \(h(a) * f(a) = (h(x) - 1) * (f(x)+1)\)
- after swap (\(h(x)\) refers to the original height of \(x\))
  - cost of representing \(x\) : \((h(x)-1) * f(x)\)
  - cost of representing \(a\) : \(h(x) * (f(x)+1)\)

7.2 - Huffman Coding

Huffman Coding

cost before swap:
- \(h(x) * f(x) + (h(x)-1)(f(x)+1) = 2h(x)f(x) - f(x) + h(x)-1 \)
cost after swap:
- \((h(x)-1)*f(x) + h(x)\left(f(x)+1\right) = 2h(x)f(x) - f(x) + h(x) \)
difference (after - before) is \(1\)

7.2 - Huffman Coding

Huffman Coding

if \(f(a) = f(x) + m\) for some \(m \ge 1\) and \(h(a) = h(x) - n\) for some \(n \ge 1\) then
- cost before swap:
  - \(h(x) * f(x) + (h(x)-n)(f(x)+m) = 2h(x)f(x) - nf(x) + mh(x)-nm \)
- cost after swap:
  - \((h(x)-n)*f(x) + h(x)\left(f(x)+m\right) = 2h(x)f(x) - nf(x) + mh(x) \)
difference (after - before) is \(mn\), always a positive difference
so it is not possible for the alternative tree to be optimal, since in fact ours is always better

7.2 - Huffman Coding

Huffman Coding

optimal substructure can also be shown using the usual cut-and-paste algorithm
we can discuss this during the workshop
it may also be a good practice for you to write an implementation

Copy of COMP3010 - 7.2 - Huffman Encoding

By Daniel Sutantyo

Copy of COMP3010 - 7.2 - Huffman Encoding

4 years ago
165

COMP3010: Algorithm Theory and Design

7.2 - Huffman Coding

Motivation

Motivation

Motivation

Motivation

How can we do this?

How can we do this?

Prefix codes

Prefix codes

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Huffman Coding

Copy of COMP3010 - 7.2 - Huffman Encoding

More from Daniel Sutantyo