597 Link Extractor--Revised!
Fall 2004, David Matuszek
ArrayLists and regular expressions, in case you aren't already.
General idea of the assignment:
Write an application to read in an HTML page, find all the
<a> tags in it, and produce a new HTML page containing
a subset of those tags. This new HTML page will be saved to a file.
Most important changes:
JOptionPaneas the quickest and easiest way to get the URL (see http://java.sun.com/docs/books/tutorial/uiswing/components/dialog.html for a tutorial). Note that you need to type a complete URL (typically starting with
http://) into the dialog box. Note also that a File dialog is not what you need!
403return code means that they don't allow you to connect this way--http://news.google.com/ is an example--just find a different site.)
<img>tags and all the
tags in the HTML page.
<img>tag, find all the attribute-value pairs (one of them should be
src=URL), put them in a Java 1.4
HashMap. (You can use a Java 1.5 SDK, but don't use the Java 1.5-specific generics). Give this
HashMapto a filter method to determine whether to keep this tag or to discard it. The filter method should return
truefor tags that are to be kept, and
falsefor those that are to be discarded.
tag, assume that you have link text between the start and end tags. Find all the attribute-value pairs (one of them should be
href=URL), put them in a Java 1.4
HashMap. (again, don't use generics). Give this
HashMapand the link text to a filter method to determine whether to keep this tag or to discard it. This should be a different filter method than the one used for images.
<img>tag, and possibly some whitespace--between the start and end tags. In this case, use the image's URL as the link text. Don't try to handle cases more complicated than this!
ImGis the same as
<img src = "foo.gif" >
>" of the start tag.
"http://www...". In this case, you can use the URL exactly as you found it.
"foo/bar.html". In this case, you need to prepend the rest of the URL. For example, if you find this URL on page
http://www.xyz.com/something/index.html, you need to construct the absolute URL
.html, it's probably a directory, and the file within the directory is probably
index.html. For example, both http://www.cis.upenn.edu/~matuszek/ and http://www.cis.upenn.edu/~matuszek/ are abbreviations for http://www.cis.upenn.edu/~matuszek/index.html.
java.util), one containing the filtered
<img>tags in the input page, and the other containing the filtered
<a>tags in the input page. Each entry in an
ArrayListshould be a complete tag (represented as a
String), ready to add to the HTML page you will construct.
<body>tags, and two suitably-labeled lists of
<a href=...>links. One list will contain the filtered
<a>tags, and the other list will contain links to the filtered images.
So that we can more easily test your program,
.giffiles, and also reject any image whose
heightis specified and is less than
You are welcome to do this by yourself, or with a partner. If you work with a partner, turn in only one copy of your program, with both your names clearly indicated (preferably on the resultant HTML page).
We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.
Please turn in your program via Blackboard before midnight, Monday September 27.