Text Compression Using Huffman Coding

Consider the string:


NO LEMON NO MELON


The general text encoding is UTF-8 which encode character into a minimum of 1 byte per character.

 

The string contains 17 characters (including whitespace characters), so the memory required is 17 bytes or 136 bits.

 

Idea

assign a symbol with high frequency to a short code and

assign low frequency symbol to a longer code.

Huffman Coding

Symbol Frequency
E 2
M 2
L 2
\s 3
O 4
N 4

Huffman Tree Construction

NO LEMON NO MELON

Symbol Frequency
E 2
M 2
L 2
\s 3
O 4
N 4

Huffman Tree Construction

NO LEMON NO MELON

Symbol Frequency
E 2
M 2
L 2
\s 3
O 4
N 4

Huffman Tree Construction

NO LEMON NO MELON

Symbol Frequency
E 2
M 2
L 2
\s 3
O 4
N 4

Huffman Tree Construction

NO LEMON NO MELON

Symbol Frequency
E 2
M 2
L 2
\s 3
O 4
N 4

Huffman Tree Construction

NO LEMON NO MELON

Symbol Frequency Code
E 2 010
M 2 011
L 2 110
\s 3 111
O 4 00
N 4 10

Huffman Tree Construction

NO LEMON NO MELON

The resulting code for the string is:

 

1000111110010011001011110001110110101100010

 

which contains 43 bits, resulting in 68% compression. (original 136 bits)

Huffman Tree Algorithm

1. Create a leaf node for each symbol and add it to the priority queue.

2. While there is more than one node in the queue:

      1. Remove the two nodes of highest priority (lowest frequency) from the             priority queue

      2. Create a new node with these two nodes as children and with                           frequency equal to thesum of the two nodes' frequency.

      3. Add the new node to the queue.

3. The remaining node is the root node and the huffman tree is complete.

Since priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of unique symbols.

Time Complexity

Huffman Encoding

By nxxcxx

Huffman Encoding

  • 342