CIT 591 Assignment 4: HTML Tidy
Fall 2014, David Matuszek

Purposes of this assignment

General idea of the assignment

HTML is the language of the web. When you open a page in your browser, your browser gets a plain (unformatted) text file, with an .html (or sometimes .htm) extension. This file contains both text and commands to format that text. Sometimes it isn't very neat. Your job is to "tidy" it up and produce the equivalent .html file, but which is properly indented and does not contain excessively long lines.

Put your program on a file named tidyHTML.py. Your program should have a method named main. When executed, main should ask for the name of an HTML file, tidy it up, and (after some other file manipulations) put the result back on the same file. The extra file manipulations, described below, should prevent a bad program from accidentally destroying its input file.

What you need to know about HTML

HTML consists of text, most of which you will see on the screen, and tags, which you won't see, but which affect how things are formatted. There are two kinds of tags: Grouping tags, which have both a start tag and an end tag with enclosed text, and empty content tags.

Grouping tags have the syntax <tagname>enclosed text</tagname>, for example, <strong>word</strong> will make the word bold. After the tagname there may be whitespace followed by other information, for example, <span class="head">enclosed text</span>. Tags are case-insensitive, so for example, <Strong>enclosed text</sTrOnG> is legal. Tagnames consist only of letters--no digits or other characters.

If a < occurs in an HTML document, and the next character is a letter, it's a tag. But if the next character is not a letter, the < is just another text character, not part of a tag.

An example of an empty content tag is <hr>, which just draws a horizontal line. Standalone tags are sometimes written with a following slash, as <hr/> or <hr />.

Grouping tags may be nested. For example, to make word both bold and italic, the HTML might be <strong><em>word</em></strong>. Grouping tags may also be incorrectly nested, as in <strong><em>word</strong></em>. I will give you some rules about what to do in the simplest cases of bad nesting, but most of the time there isn't much you can reasonably do.

The empty content tags are: area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta, and param; any other tag is a grouping tag. In general, you don't need to know what any of these tags do. One exception is <pre>enclosed text</pre>, which says: Don't change the enclosed text in any way whatsoever; leave it exactly as is.

Details

Overall process

  1. Ask the user what .html (or .htm) file to read in.
  2. Copy the input file to a file of the same name, but with the additional extension .bak.
  3. Generate a name for the output file. The name should consist of a large random integer (use random.randint(1, sys.maxint)), with the .html extension.
  4. Read and process the input file, writing to the output file. You will probably find it simplest to read and process one line at a time, but it's also okay to read in the entire file.
  5. If the output file appears to have been written successfully, delete the original input file, and rename your output file to have the same name. Leave the .bak file in place (just in case).

Indentation

For each line:

Case

All tag names should be lowercase. If they aren't, lowercase them. If there is any information in the tag after the tag name, leave it as is.

Lines and line length

There should be exactly one blank line before each of the following start tags: head, body, h1, h2, h3, h4, h5, h6. Other blank lines (unless they are inside a pre tag) should be deleted.

There should be no whitespace (space or tab) at the end of any line, unless the whitespace is within a pre tag.

Lines should be no longer than 80 characters (counting indentation). If a line is longer than 80 characters, find the last blank before the 80th character position, insert a newline character, and indent the new line the same amount as the original line. Since lines might be 200 or more characters in length, repeat this procedure as necessary until no line is longer than 80 characters (unless it is within a pre tag!).

Keeping track of nesting

If tags is a list,

Lists provide an excellent way of keeping track of nesting. Start with an empty list; whenever you encounter a start tag, append it to the list. Whenever you encounter an end tag, compare it to the last thing in the list, and if it matches, pop it from the list. (If it doesn't match, the nesting is incorrect.) Empty content tags should not be put on the list.

Dealing with incorrect nesting

There's no good way. Seriously. Every browser does something different.

The following approach should work well enough for our purposes. It is based on the assumption that the most common type of error is a missing end tag.

If an end tag (call it "E") doesn't match the last thing on the list of start tags, then insert an end tag for that start tag, and pop the start tag from the list. If E still doesn't match, do it again. If, after inserting two end tags, E still doesn't match the last thing in the list, discard E.

Style

All the usual--meaningful variable names, spaces around binary operators (including **, which some of you haven't been putting spaces around), etc. Write a doc string for each function, and be sure to put your name, and your partner's name, in comments at the very top of the program.

Your program should consist of a lot of little functions, not a few big functions. Functions should be of two types: Those that do input or output, and those that do computation. You should have good unit tests for every function that does computation. (If you have just a few big functions, you will find unit testing very, very difficult!)

Grading

We won't have unit tests for your assignment, for the simple reason that I haven't told you exactly what your functions should be. Instead, I will try to write the program myself, and compare my output against yours. Hopefully, the results will be character-by-character identical!

We will also be looking at your unit tests, to see whether they seem adequate.

Due date

Turn your assignment in to Canvas before 6am Friday, September 26. Zip your program file and your test file together. This is a pair programming assignment, so you should decide which of you will turn it in (we don't want to grade two copies of the same program). Be sure to put both of your names in a comment right at the top of the program.