CIT 594 Assignment 9: Huffman Encoding
Spring 2011, David Matuszek

Purposes of this assignment

General idea of the assignment

The programming project itself will be in two parts: (1) Compress a large amount of text (a complete book) and save it as a binary file, and (2) Read the compressed file and decode it into the exact same text. Since you are to work with a partner, there is a fairly natural division of labor, but you will also need to share quite a bit of common code.

In addition, either you or your partner needs to sign on to Bitbucket and create a repository there. (You can both sign on if you want, but only one of you should create a repository for this particular project.) Whoever creates this repository, you should (1) make it private, (2) invite your partner as either a writer or an administrator, and (3) invite me, at my cis address, as a reader. Alternatively, you can (1) make it public with an unguessable name, (2) invite your partner as a writer or an administrator, and (3) tell me where to find it. This latter approach has the advantage that I can tell my TAs where to get your repository, and you don't have to worry about the 5-person limit on public repositories.

Details

Do test-driven programming. This requires that you proceed in fairly small, steady steps. Also, when you code this way, it makes a lot of sense to commit each time you get a test to pass. I will be able to review your log and tell exactly who did what and when they did it. I won't have the time to look at every log (unless I write a log scraper--maybe I'll do that), but I can easily do some spot checking.

The examples in class of Huffman encoding used one code for each character. For this assignment, though, you are to encode bigrams, groups of two characters. For example,

This is the way the world ends--not with a bang, but a whimper

If I have counted correctly,

Th occurs 1 time
is occurs 1 time
 i occurs 1 time
occurs 1 time
th occurs 3 times
occurs 2 times
wa occurs 1 time
occurs 1 time
wo occurs 1 time
rl occurs 1 time
occurs 1 time
en occurs 1 time
ds occurs 1 time
-- occurs 1 time
no occurs 1 time
occurs 1 time
wi occurs 1 time
 a occurs 2 times
 b occurs 2 times
an occurs 1 time
g, occurs 1 time
ut occurs 1 time
 w occurs 1 time
hi occurs 1 time
mp occurs 1 time
er occurs 1 time
occurs 1 time

Your encoding program should

Your decoding program should

More Details

Write the program to deal with “normal” ASCII text. This uses only a few “control” characters--TAB, CR, and LF--and the characters “space” (0x20) through “tilde” (0x7E). The meaning of the so-called “extended ASCII” codes is highly variable; extended ASCII should never be used in this Internet age.

Create a repository (or go to your partner's repository) on Bitbucket. Clone it to your own computer:
     hg clone https://bitbucket.org/yourRepository directoryName
Get a text file from my Bitbucket repository:
     hg clone https://bitbucket.org/matuszek/cit594 directoryName
This will create a Mercurial repository in whatever directory you named. You only want one text file from it, so you should copy this text file to your own repository, and throw away the rest of the directory.

When you write the encoded file, you will have to put the Huffman encoding table as the first part of the file. Remember that the whole point is to compress (shrink) the file, so think about how small you can make this table. We will look at the final size of the text!

Remember that you need to produce the exact same text. What will you do if the text contains an odd number of characters? Don't overlook this case, and don't assume that "one space more or less doesn't matter." It does.

Two of your classes should be named HuffmanEncoder and HuffmanDecoder. Each of these should have a main method that takes two parameters, the file to read followed by the file to write. Create two jar files in your directory, named encode.jar and decode.jar, respectively. These files should have identical contents (be careful that one jar file does not include the other), but encode.jar should run HuffmanEncoder and decode.jar should run HuffmanDecoder.

Grading

Tuesday April 5, 6AM.

Due date

Zip your project, including the two jar files, and turn in to Blackboard by 6AM Tuesday, April 5 Friday, April 8. Only one zip file per team, please.