CIT 594 Assignment 4: Text Extractor
Spring 2011, David Matuszek

Purpose of this assignment

General idea of the assignment

Write a Java application to read in an HTML page, remove all the HTML tags from it, and display the result in a large scrollable JTextArea.

The Java classes you will need are java.util.regex.Matcher and java.util.regex.Pattern.

About HTML

HTML, like XML, is a specialization of SGML (Standard Generalized Markup Language). An important difference is that XML is designed for computers to read, and must be error-free; HTML is designed for browsers to display to humans, and is often sloppy and full of errors. Another difference is that XML is case-sensitive, whereas HTML is case-insensitive.

Most HTML tags are containers: They start with a "start tag" <tagname> and end with an "end tag" </tagname>. In HTML, the end tags are often omitted. Start tags, but not end tags, may contain attributes, of the form attributeName="value", where the value may be in double quotes, in single quotes, or (if it doesn't contain whitespace) not quoted at all. For example, <table border="0" cellpadding=4 cellspacing='0' class="data"> is a valid start tag.

A few tags, like <br> (line break; go to the next line) are not containers. Since they are their own "end tag," they are sometimes written as <br/> or <br />.

Details

Use a Swing GUI dialog box to enter a URL into your program. I recommend typing or pasting the URL into a JOptionPane.showInputDialog. As usual, you can run SwingExamples.jar to see how to do this.

Go to that URL and get the HTML file that is there; read it as text. See the example at TryURL.java to get the necessary combination of I/O calls. (Note: A 403 return code means that they don't allow you to connect this way--http://news.google.com/ is an example--just find a different site.)

You can ignore everything up to the <body> start tag.

Most tags should just be removed, but with the following exceptions:

Use regular expressions to do most of the work of finding and removing or replacing tags.

You can find various regular expression testers on the Internet. Here's one that I wrote; it has the advantage that it helps you with the Java version.

You can assume that every "<" in the HTML indicates a tag; either a start tag or an end tag. You can also assume that everything after the "<" and before the next ">" is part of the tag. You cannot, however, assume that the "<" and ">" are on the same line.

After taking care of the tags in the HTML, please make the following substitutions:

Replace With
&lt; <
&gt; >
&amp; &
&quot; "
&apos; '
&nbsp; a space

Remember:

We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.

Finally:

Due date

Please turn in your program via Blackboard before 6 AM Friday, February 11.