| CIT
597 Link Extractor Fall 2004, David Matuszek |
ArrayLists and regular expressions,
in case you aren't already. General idea of the assignment:
Read in an HTML page, find all the <img> and <a>
tags in it, and produce a new HTML page containing a subset of those tags.
Details:
Read the HTML into your program. For now, you can assume that input comes from a file, but we will shortly be reading pages directly from the Web--so you should isolate the input in a single method that you can change easily.
Create two ArrayLists (ArrayList is in java.util),
one containing some of the <img> tags in the input page,
and the other containing some of the <a> tags in the input
page. Use regular expressions to find these tags, but remember:
ImG is the same as img.< img src = "foo.gif" > .As you find each tag, call a filter method to decide whether to keep
it. Use one filter method for the <a> tags, and a different
(but similar) method for the <img> tags. The methods should
return true if the tag is to be kept, and false otherwise.
For example, you may want to keep .jpg files but not .gif
or .png files; or you may want to eliminate hyperlinks that refer
to a specific part of a page (that contain a # in the URL). The
filter should be given all the attributes of the tag and, for <a>
links, it should also be given the link text (the text between the <a>
and the </a>).
Finally, create a new HTML page with your results. Your page should have fairly
simple <head> and <body> parts, and the
<body> part should contain two unordered lists:
<a> tags, provide an unordered list (<ul>...</ul>)
of <a> tags with the original link text, and<img> tags, provide an unordered list of <a>
tags that have the URL of the corresponding image as the link text, and that
link to the image. (For example, if you find <img src="images/foo.gif">
on page http://www.foobar.com/fiddle, your unordered list should
contain <a href="http://www.foobar.com/fiddle/images/foo.gif">images/foo.gif</a>You are welcome to do this by yourself, or with a partner. If you work with a partner, turn in only one copy of your program, with both your names clearly indicated (preferably on the resultant HTML page).
We will test your program with both "normal" tags and with "abnormal" tags (tags containing extra whitespace, split across lines, etc.).
Due date:
Please turn in your program via Blackboard before midnight, Monday September 20.