Consider the string:
NO LEMON NO MELON
The general text encoding is UTF-8 which encode character into a minimum of 1 byte per character.
The string contains 17 characters (including whitespace characters), so the memory required is 17 bytes or 136 bits.
Idea
assign a symbol with high frequency to a short code and
assign low frequency symbol to a longer code.
Symbol | Frequency |
---|---|
E | 2 |
M | 2 |
L | 2 |
\s | 3 |
O | 4 |
N | 4 |
NO LEMON NO MELON
Symbol | Frequency |
---|---|
E | 2 |
M | 2 |
L | 2 |
\s | 3 |
O | 4 |
N | 4 |
NO LEMON NO MELON
Symbol | Frequency |
---|---|
E | 2 |
M | 2 |
L | 2 |
\s | 3 |
O | 4 |
N | 4 |
NO LEMON NO MELON
Symbol | Frequency |
---|---|
E | 2 |
M | 2 |
L | 2 |
\s | 3 |
O | 4 |
N | 4 |
NO LEMON NO MELON
Symbol | Frequency |
---|---|
E | 2 |
M | 2 |
L | 2 |
\s | 3 |
O | 4 |
N | 4 |
NO LEMON NO MELON
Symbol | Frequency | Code |
---|---|---|
E | 2 | 010 |
M | 2 | 011 |
L | 2 | 110 |
\s | 3 | 111 |
O | 4 | 00 |
N | 4 | 10 |
NO LEMON NO MELON
The resulting code for the string is:
1000111110010011001011110001110110101100010
which contains 43 bits, resulting in 68% compression. (original 136 bits)
1. Create a leaf node for each symbol and add it to the priority queue.
2. While there is more than one node in the queue:
1. Remove the two nodes of highest priority (lowest frequency) from the priority queue
2. Create a new node with these two nodes as children and with frequency equal to thesum of the two nodes' frequency.
3. Add the new node to the queue.
3. The remaining node is the root node and the huffman tree is complete.
Since priority queue data structures require O(log n) time per insertion, and a tree with n leaves has 2n−1 nodes, this algorithm operates in O(n log n) time, where n is the number of unique symbols.