huffman

command module
v0.0.0-...-8f8e5a0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 2, 2023 License: BSD-3-Clause Imports: 8 Imported by: 0

README

Huffman

Check Go Report Card

Huffman encoding is an optimal, greedy algorithm over a random sequence. It can guarentee an upper bound on the length of the encoded output to be as much as 1 + entropy of the original text sequence. All with a minimal runtime. It is also a lossless compression method as well. It does so by applying a tranformation of the original text sequence to make tokens appearing more often take up less memory, while allowing tokens that appear relatively sparse taking up more space.

For example, for the sequence 'Huffffman, An Efffficient Encoding Method.', we already know that f appears way more than other charachers. Suppose that originally all characters take 4 bits to store (there are 16 kinds of characters). If the above text sequence is the only thing we care about, what we should do is to let f's take less space (potentially 3 bits or even 2 bits), while allowing h for example to take up more bits. Since there are 8 f's and only 1 h, we can save 7 bits if we shorten f by 1 bit but extending h by 1 bit. That's basically the idea of Huffman.

After what all has beed said, how do we actually calculate the Huffman tree? It's actually pretty simple. Imagine a list of tokens sorted according to their counts, that is to say, the head of the list contains the token that appears the least in the program. So what we're going to do next is to take the first two tokens, the tokens that appears the least and the second least in the program, combine them to create a new token, using the new token as a binary node and set it as the parent of the previous taken tokens, and reinsert that token in the list such that the list is still sorted. It's a lot to take in, but basically it's four things. 1. Take the first 2 tokens. 2. Store them into nodes. 3. Create a new node with value the 2 tokens' counts combined 4. Insert the new token back into the list such that the list is still sorted. The process repeats until there is only one last token left, and its count is the count of total documents in the program. But wait. If we think carefully, we'll find out that the only node in the list is a tree! Think about it. Everytime we take two nodes, create a sort of small binary tree, and insert back into the list. And we know that there is only one node left. So it must be the root of the tree! Voila! We have the Huffman tree.

And why are we building the tree this way? Turns out that Huffman trees have a very interesting property that every token that's in the document is actually in the leaf (non internal). Remember in the last paragraph I said that the root node is the total of all the leaf nodes' count combined? that is just a property when all your parent is the sum of its two children. You can draw it down on paper to prove me wrong! Spoiler alert: You can't. Haha!

We can even further speed up the process of the above 4 steps. After all, it seems that we only care about taking the minimum out of the list. So we use a heap to do that. Using a binary heap we can speed up the whole processes' time complexity from O(n^2) to O(n lg n) I'll not prove it here as it's a little bit complicated.

But wait a minute, where is the Huffman codes? Actually, the tree is the code. Huffman codes are simply the path from the root of the tree to the node itself. Whenever going for a left child 0 will be appended to the end of the code, and whenever going for a right child 1 will be appended to the end of the code. That's a mouthful! In short, when traveling from the root of the node to the leaf where tokens live, when going for a node.left, append 0 to your code, when going for a node.right, append 1 to your code. The DFS process takes O(n) steps. And that's it! This is how Huffman code is made!

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL