| CIT
597 Link Extractor--Revised! Fall 2004, David Matuszek |
ArrayLists and regular expressions,
in case you aren't already.General idea of the assignment:
Write an application to read in an HTML page, find all the <img>
and <a> tags in it, and produce a new HTML page containing
a subset of those tags. This new HTML page will be saved to a file.
Most important changes:
Details:
JOptionPane
as the quickest and easiest way to get the URL (see http://java.sun.com/docs/books/tutorial/uiswing/components/dialog.html
for a tutorial). Note that you need to type a complete URL (typically
starting with http://) into the dialog box. Note also that a
File dialog is not what you need!403 return code means that
they don't allow you to connect this way--http://news.google.com/
is an example--just find a different site.) <img>
tags and all the <a href=...> tags in
the HTML page.<img> tag, find all the attribute-value
pairs (one of them should be src=URL), put them in
a Java 1.4 HashMap. (You can use a Java 1.5 SDK, but
don't use the Java 1.5-specific generics). Give this HashMap
to a filter method to determine whether to keep this tag or to discard
it. The filter method should return true for tags that are
to be kept, and false for those that are to be discarded.<a href=...> tag, assume
that you have link text between the start and end tags. Find all
the attribute-value pairs (one of them should be href=URL),
put them in a Java 1.4 HashMap. (again, don't use
generics). Give this HashMap and the link text to
a filter method to determine whether to keep this tag or to discard it.
This should be a different filter method than the one used for
images.<img>
tag, and possibly some whitespace--between the start and end tags. In
this case, use the image's URL as the link text. Don't try to handle
cases more complicated than this!ImG is the same as img.<img src = "foo.gif" > .>" of the start
tag."http://www...". In
this case, you can use the URL exactly as you found it."foo/bar.html". In
this case, you need to prepend the rest of the URL. For example, if
you find this URL on page http://www.xyz.com/something/index.html,
you need to construct the absolute URL http://www.xyz.com/something/foo/bar.html..htm or .html,
it's probably a directory, and the file within the directory is probably
index.html. For example, both http://www.cis.upenn.edu/~matuszek/
and http://www.cis.upenn.edu/~matuszek/
are abbreviations for http://www.cis.upenn.edu/~matuszek/index.html.ArrayLists (ArrayList is in java.util),
one containing the filtered <img> tags in the input page,
and the other containing the filtered <a> tags in the input
page. Each entry in an ArrayList should be a complete tag (represented
as a String), ready to add to the HTML page you will construct.<head>, <title>,
and <body> tags, and two suitably-labeled lists of <a href=...>
links. One list will contain the filtered <a> tags, and
the other list will contain links to the filtered images.So that we can more easily test your program,
.gif files, and also reject
any image whose height is specified and is less than 50.#'
character.You are welcome to do this by yourself, or with a partner. If you work with a partner, turn in only one copy of your program, with both your names clearly indicated (preferably on the resultant HTML page).
We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.
Due date:
Please turn in your program via Blackboard before midnight, Monday September 27.