597 Link Extractor Addendum
Fall 2004, David Matuszek
<a>tags without an
hrefattribute mean something else, and should be ignored by your program; and
<img>tags have a
<a>tag doesn't contain any link text?
empty link textor similar.
<a>tag encloses an image rather than link text?
<img>tag in the usual way, and use its URL (only) as the link text. Thus, for
<a href=X><img src=Y></a>
in the list of images and
in the list of links.
<E>in the Java API for
Stringand, based on some tests on that String, decide whether to keep or discard the URL. If this were a "real" program, with a "real" API, then your filters would accept everything, and your "real" users would override your filters to accept only URLs they might be interested in. Your filter should also get the attributes and values in some form (a
HashMapis ideal), in case your hypothetical users want to use some of that information in making their filtering decisions. But since this is not a real API, just an assignment, you should write example filters that accept some things and reject others (tell us which). To make it easier for us to grade, I suggest you reject
.gifimages and hyperlinks that contain a
/" (slash); a relative pathname does not. For example, if page
, that means that the directory that contains
index.htmlalso contains a subdirectory
bar.html. This is a path relative to the location
XXX, which is not necessarily the same place as your output goes. Your output should contain absolute paths. On most systems this would probably be
file://AAA/BBB/...XXX/foo/bar.html, but for some reason, on Windows we seem to need