597 Link Extractor
Fall 2004, David Matuszek
ArrayLists and regular expressions, in case you aren't already.
General idea of the assignment:
Read in an HTML page, find all the
tags in it, and produce a new HTML page containing a subset of those tags.
Read the HTML into your program. For now, you can assume that input comes from a file, but we will shortly be reading pages directly from the Web--so you should isolate the input in a single method that you can change easily.
ArrayList is in
one containing some of the
<img> tags in the input page,
and the other containing some of the
<a> tags in the input
page. Use regular expressions to find these tags, but remember:
ImGis the same as
< img src = "foo.gif" >
As you find each tag, call a filter method to decide whether to keep
it. Use one filter method for the
<a> tags, and a different
(but similar) method for the
<img> tags. The methods should
true if the tag is to be kept, and
For example, you may want to keep
.jpg files but not
.png files; or you may want to eliminate hyperlinks that refer
to a specific part of a page (that contain a
# in the URL). The
filter should be given all the attributes of the tag and, for
links, it should also be given the link text (the text between the
Finally, create a new HTML page with your results. Your page should have fairly
<body> parts, and the
<body> part should contain two unordered lists:
<a>tags, provide an unordered list (
<a>tags with the original link text, and
<img>tags, provide an unordered list of
<a>tags that have the URL of the corresponding image as the link text, and that link to the image. (For example, if you find
http://www.foobar.com/fiddle, your unordered list should contain
You are welcome to do this by yourself, or with a partner. If you work with a partner, turn in only one copy of your program, with both your names clearly indicated (preferably on the resultant HTML page).
We will test your program with both "normal" tags and with "abnormal" tags (tags containing extra whitespace, split across lines, etc.).
Please turn in your program via Blackboard before midnight, Monday September 20.