CIT 594, Notes on Huffman encoding/decoding
Spring 2002, David Matuszek

Newline problems

When you decode your file, instead of (or as well as) your output being separated into lines, you may see an additional character at the end of each line. That extra character does not belong there. Where did it come from?

Line separators are implemented differently on different platforms. There are two ASCII characters, CR (carriage return) and LF (line feed), which may be used.

The newline character, '\n', represents the UNIX end of line, that is, the LF character. If you say System.out.print('\n') you get what you asked for, that is, an LF character; if you say System.out.println(), the actual character or characters that are written depends on your operating system. This is why you should prefer System.out.println() over System.out.print('\n').

In the Huffman program, the simplest and best solution is: Don't read a line at a time. Instead, read a character at a time. That way, the CR and LF characters are treated just the same as every other character, and you don't have to do anything special about them. As long as you are using a buffered reader, a large block of data is read in (to a "buffer), and "reading" a character just means getting the next character from the buffer, or refilling the buffer as necessary. Reading one character at a time will not significantly affect the execution time (that's the whole point of buffering).

If you really want to read whole lines, using LineReader or something similar, it might help to know that System.getProperty("line.separator") should return the line separator for the system you are running on.

Computing sizes

When Java writes a Object to an ObjectStream, it also writes out information that identifies the object ("This is a BitVector with fields numberOfBits..."). I'm not sure just how much information it writes out, but in any case it varies with the complexity of the class definition. That means you cannot (easily) compute exactly how many bytes are going to be on the file. For the BitVector, you can approximate the size (in bytes) as computeSize()/8.

When you compute the size of your code table, you should be able to figure out approximately how much data is going to be written out (An int is 4 bytes, a long is 8 bytes, etc.) You don't need to get an exact figure, and you certainly don't need to figure out how much additional identifying information that Java will write out.

It is possible to ask for the size of a file, but I don't know if it is possible when the file is open and being written. (If it is possible, be sure to use flush() before getting the file size.)

If someone knows a great way to compute file sizes beforehand, I would be glad to hear about it (but it will be too late to put in the current assignment).

Measuring execution time

Do not include in your measure the amount of time the user takes in selecting files!

The best way of doing this problem is:

  1. Ask for the input file
  2. Ask for the output file
  3. Read the input (possibly twice) and write the output.

When you do it this way, step 3 is the part whose execution time should be measured. You don't want to measure the time the user spends mousing around and choosing files.

If you do things in some more mixed-up order, you may have to measure the time for two or more parts of the program and add them together to get the final "translation" time.

If you have forgotten how to measure execution time, refer back to the Fibonacci assignment.