| CIT
597 Assignment 3: Text
Extractor Fall 2007, David Matuszek |
Write a Java application to read in an HTML page, remove all the HTML
tags from it, and display the result in a large scrollable JTextArea.
Use a Swing GUI dialog box to enter a URL into your program. I recommend
a JOptionPane
as the quickest and easiest way to get the URL (see http://java.sun.com/docs/books/tutorial/uiswing/components/dialog.html
for a tutorial). Note that you need to type a complete URL (typically
starting with http://) into the dialog box.
Go to that URL and get the HTML file that is there; read it as text. See the
example at TryURL.java to get
the necessary combination of I/O calls. (Note: A 403 return code
means that they don't allow you to connect this way--http://news.google.com/
is an example--just find a different site.)
Most tags should just be removed, but with the following exceptions:
<title>, <div>, <h1>, <h2>, <h3>, <h4>, <h5>,
or <h6> should be put on a line by itself, with a single
blank line before and after it. <p> tag results in a blank line. (Do not assume that
there will be a matching </p> end tag--few people actually
use them.) Try to avoid multiple consecutive blank lines. <br> tag should be replaced by a newline. <ul>, every <li> should
start a new line, beginning with "
* ". <ol>, every <li> should
start a new, numbered line, with the first <li> starting
at 1. Counting ends when the matching </ol> is
encountered. <img> tag with an alt="text" attribute
should be replaced by [image:
text]. If it has no alt attribute, it should be replaced by
just [image]. Use regular expressions to do most of the work of finding and removing or replacing tags.
Remember:ImG is the same as img.<img src = "foo.gif" > .<p stuff="nonsense">.>" of the start
tag.We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.
Finally:
Please turn in your program via Blackboard before midnight, Wednesday October
10 Friday October 12.