CIT 594 Huffman Hints
Spring 2011, David Matuszek

“Get your data structures correct first, and the rest of the program will write itself.”

                                                                                                —Davids Johnson

Huffman encoding/decoding is a difficult assignment. You have a little more than a week to do this, and I’m sure some of you (maybe many of you) will not get it done in time. Please take the above quote seriously. Think about what data structure to use for each operation you have to do, because the choice of data structure makes a huge difference in how difficult the program is to write and debug.

Input/output

This part is routine, but it still involves a chunk of code.

One thing to consider is that you want to read in the text and get two characters at a time from it. But if the text contains an odd number of characters, you will have to insert something (a space?) at the end. Write a method to do this work for you, so you can keep the more interesting code cleaner and easier to understand.

Encoding text

You’ve seen how to create a Huffman tree, and you can probably do it on paper. You start by counting how many times each character (or in this assignment, each bigram) occurs. Then you use those numbers to build a binary tree. Then you can use the tree to make a table that tells how to encode each bigram.

encoder flowchart

Decoding text

This part will probably use the same kinds of data structures as the encoder. Probably.

decoder flowchart

General approach

Representing the data

There are as many as three “code tables” in this assignment.

The encoder and decoder are using the code table for very different purposes, so the code table probably needs to be represented differently in each case. Bear in mind that a well-chosen data structure will make the programming easy; a poorly-chosen algorithm will make the programming a nightmare.

You will have to decide how to represent the code table in the binary (compressed) file. I had said in class that I didn’t know whether bigrams would result in more compression than single characters; however, a little arithmetic shows that there could be from 10201 bigrams (if you don’t use the “control codes”) to 16384 bigrams (if you use all 128 ASCII codes). So your code table is likely to be bigger than your text. (Conclusion: This will be a terrible encoding technique for any text that is less than millions of characters long. Sorry about that!)

In a Huffman encoding, the length of the longest bit sequence for any encoded character (or bigram, in our case) is the depth of the Huffman tree. It would be helpful to have a bound on that length. I think I’ve been able to prove that, if you don’t create encodings for bigrams that never occur in the data, the maximum number of bits required is 22; anyone care to check this? (If you do encode for bigrams that don’t occur, you might create an extremely unbalanced Huffman tree.)

Counting bigrams is easy if you use the right data structure to store the counts. Using the counts to build the Huffman tree is easy if you use the right data structure to store the counts. I do not think the same data structure is appropriate in both cases.

You need a Huffman tree in order to determine how to encode bigrams as bit sequences. You need a data structure of some sort (I called it a “decoding tree” ) to turn bit sequences back into bigrams. These probably are not the same data structure.

Modularity

Besides Huffman trees, this assignment might use several different data structures. A lot of the program consists of converting from one data structure to another. In the above flowcharts, the code is in boxes, and the data is on arrows between the boxes. This is a first attempt at planning out the program, and should not be taken too seriously.

The important point is that the data will be represented in several different ways, and you have to be able to convert from one representation to another. If these representations are all mixed together in one big program, and you decide to change one of the representations, you will have big problems. Conversely, if you modularize your program, so that the output of one part becomes the input of another part, changing a representation (say, of a “code table”) will be considerably easier. Remember, it’s better to have a lot of small methods than a few big methods. Small, single-purpose methods are easier to test, which means they are easier to get correct.

Working with Mercurial

Mercurial and TDD, Test Driven Design, go well together. Both encourage small, well-tested changes.

Whenever you commit your changes, Mercurial requires you to tell what you have changed. If you have only changed one thing, it’s easy to say what it was. If you have made many changes, it’s difficult to write a meaningful description.

Also, Mercurial lets you back up to any previously committed version. It’s one thing to back up to a version with fewer features; it’s quite another thing to back up to a buggy version, and have to find and fix the same bugs all over again.

TDD, of course, consists of writing a simple test, then writing a method to satisfy the test, then (if necessary) making the test stronger, then fixing the method to satisfy the test, and so on until you have a working, well-tested method that you can feel good about committing.

So here’s my recommendation for working with Mercurial:

Finally (this is a direct quote from http://hginit.com/02.html):

When you’re working on a team, your workflow is going to look a lot like this:
  1. If you haven’t done so in a while, get the latest version that everyone else is working off of:
    • hg pull
    • hg up
  2. Make some changes
  3. Commit them (locally)
  4. Repeat steps 2-3 until you’ve got some nice code that you’re willing to inflict on everyone else
  5. When you’re ready to share:
    • hg pull to get everyone else’s changes (if there are any)
    • hg merge to merge them into yours
    • test! to make sure the merge didn’t screw anything up
    • hg commit (the merge)
    • hg push