597 Assignment 3: Text
Fall 2007, David Matuszek
Write a Java application to read in an HTML page, remove all the HTML
tags from it, and display the result in a large scrollable
Use a Swing GUI dialog box to enter a URL into your program. I recommend
as the quickest and easiest way to get the URL (see http://java.sun.com/docs/books/tutorial/uiswing/components/dialog.html
for a tutorial). Note that you need to type a complete URL (typically
http://) into the dialog box.
Go to that URL and get the HTML file that is there; read it as text. See the
example at TryURL.java to get
the necessary combination of I/O calls. (Note: A
403 return code
means that they don't allow you to connect this way--http://news.google.com/
is an example--just find a different site.)
Most tags should just be removed, but with the following exceptions:
<h6>should be put on a line by itself, with a single blank line before and after it.
<p>tag results in a blank line. (Do not assume that there will be a matching
</p>end tag--few people actually use them.) Try to avoid multiple consecutive blank lines.
<br>tag should be replaced by a newline.
<li>should start a new line, beginning with
" * ".
<li>should start a new, numbered line, with the first
1. Counting ends when the matching
<img>tag with an
alt="text"attribute should be replaced by
[image: text]. If it has no alt attribute, it should be replaced by just
Use regular expressions to do most of the work of finding and removing or replacing tags.Remember:
ImGis the same as
<img src = "foo.gif" >
>" of the start tag.
We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.
Please turn in your program via Blackboard before midnight,
10 Friday October 12.