Zap

zap is a Huffman encoder/decoder for plain text. It can either compress files, or take compressed files and decompress them to legible text. The input and output to zap are files, and the program can read from a given file that one wants to compress/decompress, and then write to an output file either the binary or ASCII text.

This was the third project for my data structures course at Tufts University, and in my opinion, the most interesting one. This is because of the prevalence of the Huffman compression algorithm for text.

The Huffman algorithm relies on storing the frequency of the characters, and the legend that can find the original character from the frequency.

What makes this algorithm so special to me, however, is the way that the legend was built—using a “bottom-out” algorithm where the lowest frequency, or least important characters are harder to access than higher frequency ones. To think of a system like this was profound to me, and such a discovery really enticed me to learn more about the field of compression.

It interested me most to learn about what ADTs could be used to devise such a legend as seen in the Huffman algorithm—one that emphasizes the quickest access to the most important elements. The data structure that most efficiently accomplished this was the priority queue/min-heap.

The min-heap was most useful in the building of the Huffman tree, as the top of every min-heap is the smallest element of the min-heap. This invariant provided constant time access to the smallest element, which made the process faster than if there was a list or BST that stored the nodes of the characters. So, a method to find the minimum element would be \(O(n) * O(n) = O(n^2)\) (visit each element, then compare with each other element). Now, though, it is just an \(O(1)\) process run \(n\) times, making the building of the tree only \(O(n) \).

Another data structure used in this assignment was a list that was used to save the paths to each character in the Huffman tree. This was done because I wanted constant time access to each code so I could easily find it. I used the pair here because I wanted to easily find the path if the HuffmanTreeNode is found.

An interesting algorithm for this project was that to serialise the Huffman tree. This algorithm was solved primarily through taking the preorder traversal for a tree data structure (visit parent, visit left child, visit right child), recursing until a leaf, and then adding the data to the string at the leaf of the tree.

The most interesting algorithm was to deserialise the tree. This used the wrapper-recursive helper paradigm, and started by making the root. The base cases for this helper were if there was no more to read in the serial tree, and if the function had reached a leaf. At the leaf, we simply make a new HuffmanTreeNode with the new character. There is not a leaf if and only if there is an internal node, which is established by the Huffman tree guidelines, meaning that the program should create an internal node. This is interesting because one can simply treat the smaller trees below the internal node as another tree. Naturally, this requires a recursive call, so when the internal node is being allocated, the program should recursively build the left and right subtrees. At the end of the function, return the node, whether it is a null pointer, an internal node with subtrees, or a leaf with a stored character.

Another interesting algorithm, particularly one that gave me trouble was to decode the text from the string and the deserialised tree. I found that the most effective algorithm was to just read through each character in the string, move accordingly through the tree, then check if there was a leaf stumbled upon, and then reset the search to the top of the tree, and continued on with the next character. At the end, I wanted to not move back to the top of the tree, because I want to make sure that the last visited node was a leaf, because if there was not, that means there was a bad encode passed.

A program like zap solidified my confidence in designing recursive methods, which is something I struggled with even before Splendor. This was valuable to me, because often, recursive methods are easier to read and can lead to more comprehensive code. I also got a peek into compression for the first time, which I intend to pursue further, compressing photos and videos, too.

While I wish I could share the code, the course agreement from Tufts does not allow me to, but I would be very happy to show it to you.

Leave a Reply Cancel reply