CIT 594, Ninth Assignment: Huffman encoding/decoding
Spring 2002, David Matuszek

Purposes of this assignment:

Your task:

Write an application (not an applet) to encode and decode text files, using a Huffman encoding.

The program should produce encoded files that are as small as possible. It should decode files exactly, so that the decoded file is identical to the original. It should perform these operations reasonably quickly.

Some approaches and data structures are suggested below. If you find that another data structure or another approach works better, that's fine. You are not required to use any specific data structures.

The main part of the program must be your own code, not taken from elsewhere. If you wish to use code from the book, or if you wish to use some code you find in other books or on the Web (such as, for example, a priority queue ADT), that's fine, so long as you identify in your comments exactly what code was taken from elsewhere, and where you got it from.

To encode files:

  1. Use the AWT's FileDialog or Swing's JFileChooser to locate the file to be encoded. Use another FileDialog or JFileChooser to determine where to put the encoded file.

  2. Read in the file and count the number of times each character occurs in it.
    Although Java stores strings internally as Unicode (16 bits per character), text files--those with a .txt extension--use ASCII (8 bits per character). Java automatically performs the conversion, so you don't have to worry about it.
  3. Compute the Huffman encoding for your text file. This will require the use of multiple data structures:

    1. You need a table of frequencies of each character. Since ASCII characters are only seven or eight bits long (assume eight), you will need a table with 256 entries. As each character is encountered, increment the corresponding frequency in your frequency table.

    2. After all the characters are read in, you need to construct the Huffman tree. Building the Huffman tree involves (1) removing the two smallest values from the frequency table, (2) adding them, and (3) putting the sum back into the frequency table. A priority queue is probably the best way to do this.

    3. You need a 256-location code table: For each possible ASCII value, you need to store the bit representation you computed for it. I recommend Strings containing '0' and '1' characters as the best representation.

    4. Finally, you need a bit vector (which I provide) for holding the encoded text. After all, it's not helpful to turn an eight-bit character into eight '0' and '1' characters.

  4. Read in the file a second time, this time using your code table to turn each character into a bit representation and putting it in the bit vector. (For better speed, you might want to keep the entire file in memory, so you don't have to read it again. I recommend against doing so, because we will test it with some pretty large files.)

  5. Write out a file containing both your coding scheme and the encoded bit vector.

To decode a file:

  1. Use a FileDialog or JFileChooser to locate the file to be decoded. Use another FileDialog or JFileChooser to determine where to put the decoded file.

  2. Read in the coding scheme and the encoded data.

  3. Decode the data into a StringBuffer (why not a String?).

  4. Write out the StringBuffer.

You can probably speed up your program (and certainly cut down on memory requirements) if you write out results as you go, rather than waiting until the entire file is decoded before you start writing it out.

Good advice:

Any program is easier to write if you spend some time planning it out before you start writing code. That goes double for this program. There are a lot of data structures involved, and if you don't think about what you are doing before you get started, you will make a tremendous amount of unnecessary work for yourself. This is definitely an assignment where poor programmers will have to do a lot more work than good programmers!

For example: You have to compute a table of frequencies. You need to put those frequencies (and their associated characters) into a priority queue. Can you make the frequency table an array of Objects, and then turn the table into a heap, instead of having a separate data structure? Would this simplify the program, or make it harder?

For example: You need to create a binary tree. Building the Huffman tree involves building a lot of little binary trees and putting them together. Do you have the tools you need to do this? Might it be easier to just throw away your BinaryTree class and do something program specific, such as adding leftChild and rightChild instance variables to the objects in your frequency table?

For example: You have to create a code table for encoding characters. The code table should be such that you can get the code for a single character in O(1) time. (Basically, this means your table should be an array, and you should use the numeric value of the character as an index into the table.) You cannot easily use the same table for decoding the bit string (think about it). Should you reorganize the table? Maybe you should read in the Huffman tree instead?

Draw pictures! You have a lot of data structures to contend with. Before you start programming, take a very small example (say, four or five characters) and draw every data structure along the way, from the frequency table all the way to the bit vector. Make sure you know how to get from one picture to the next. (One of my favorite examples for things like this is the word "abracadabra"--five different letters, eleven in all.) Don't assume, for example, that because you know how to make binary trees, you can easily get from a frequency table to a binary tree--doing it on paper is easy, but writing a method to do it is considerably more complicated. Be flexible: Don't pick your data structures and commit yourself to them, because sometimes changing your mind and using a slightly different data structure can save you literally hours of programming.

Remember: The whole point is to make the compressed file as small as possible. That means the whole compressed file, including the code table at the front. You should try to make the code table as small as you reasonably can (but you don't want to spend hours over a few bits). Just be sure that everything you write out is something you will use later, when decoding the file again.

Remember: The program should run quickly enough to be pleasant to use, even on very large files. The time you spend constructing your tables is almost certainly inconsequential--a few milliseconds, maybe. However, the time you spend processing each character is critical--if you have to do a table search for every character of a 50MB file, that will be a problem. Once everything is set up, encoding a single character should be an O(1) operation. Decoding will necessarily be slower (because you have to examine every bit), but should still be pretty fast.

The bit vector code:

I haven't posted it yet, because I haven't written the methods to read and write the bit strings (it's not just serializing my data structure). However, here's the API so far:

    public class BitVector implements Serializable {
        BitVector();                                   // constructor
        public void add(boolean booleanBit);  // append a 0 (true) or 1 (false) to this BitVector
        public void add(char ch);                 // append a 0 ('0') or a 1 ('1') to this BitVector
        public void add(String string);           // append bits given by a String of '0's and '1's
        public long computeSize();               // returns the number of bits in this BitVector
        public void writeBitVector(ObjectOutputStream out)   // writes out this BitVector
        public static BitVector readBitVector(ObjectInputStream in)  // reads and returns a BitVector

    }
    public class BitIterator {
        public BitIterator(BitVector bitVector) // constructor
        public boolean hasNext();                 // tests whether any bits remain
        public boolean next();                      // returns the next bit
    }

Notes:

Requirements:

Due date:

Midnight Tuesday, April 16. This program is worth 150 points. Both parts (encryption and decryption) must be present; we can't test whether your encryption works unless we can also decrypt your file.