CIT 594 Assignment 4: Text Extractor
Spring 2011, David Matuszek

Purpose of this assignment

General idea of the assignment

Write a Java application to read in an HTML page, remove all the HTML tags from it, and display the result in a large scrollable JTextArea.

The Java classes you will need are java.util.regex.Matcher and java.util.regex.Pattern.

About HTML

HTML, like XML, is a specialization of SGML (Standard Generalized Markup Language). An important difference is that XML is designed for computers to read, and must be error-free; HTML is designed for browsers to display to humans, and is often sloppy and full of errors. Another difference is that XML is case-sensitive, whereas HTML is case-insensitive.

Most HTML tags are containers: They start with a "start tag" <tagname> and end with an "end tag" </tagname>. In HTML, the end tags are often omitted. Start tags, but not end tags, may contain attributes, of the form attributeName="value", where the value may be in double quotes, in single quotes, or (if it doesn't contain whitespace) not quoted at all. For example, <table border="0" cellpadding=4 cellspacing='0' class="data"> is a valid start tag.

A few tags, like <br> (line break; go to the next line) are not containers. Since they are their own "end tag," they are sometimes written as <br/> or <br />.


Use a Swing GUI dialog box to enter a URL into your program. I recommend typing or pasting the URL into a JOptionPane.showInputDialog. As usual, you can run SwingExamples.jar to see how to do this.

Go to that URL and get the HTML file that is there; read it as text. See the example at to get the necessary combination of I/O calls. (Note: A 403 return code means that they don't allow you to connect this way-- is an example--just find a different site.)

You can ignore everything up to the <body> start tag.

Most tags should just be removed, but with the following exceptions:

Use regular expressions to do most of the work of finding and removing or replacing tags.

You can find various regular expression testers on the Internet. Here's one that I wrote; it has the advantage that it helps you with the Java version.

You can assume that every "<" in the HTML indicates a tag; either a start tag or an end tag. You can also assume that everything after the "<" and before the next ">" is part of the tag. You cannot, however, assume that the "<" and ">" are on the same line.

After taking care of the tags in the HTML, please make the following substitutions:

Replace With
&lt; <
&gt; >
&amp; &
&quot; "
&apos; '
&nbsp; a space


We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.


Due date

Please turn in your program via Blackboard before 6 AM Friday, February 11.