CIT 597 Assignment 3: Text Extractor
Fall 2007, David Matuszek

Purpose of this assignment:

General idea of the assignment:

Write a Java application to read in an HTML page, remove all the HTML tags from it, and display the result in a large scrollable JTextArea.

Details:

Use a Swing GUI dialog box to enter a URL into your program. I recommend a JOptionPane as the quickest and easiest way to get the URL (see http://java.sun.com/docs/books/tutorial/uiswing/components/dialog.html for a tutorial). Note that you need to type a complete URL (typically starting with http://) into the dialog box.

Go to that URL and get the HTML file that is there; read it as text. See the example at TryURL.java to get the necessary combination of I/O calls. (Note: A 403 return code means that they don't allow you to connect this way--http://news.google.com/ is an example--just find a different site.)

Most tags should just be removed, but with the following exceptions:

Use regular expressions to do most of the work of finding and removing or replacing tags.

Remember:

We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.

Finally:

Due date:

Please turn in your program via Blackboard before midnight, Wednesday October 10 Friday October 12.