CIT 597 Link Extractor Addendum
Fall 2004, David Matuszek
How should you read in the file?
I left this unspecified because I don't really care; in a future assignment, I may ask you to get pages from the Web. I do want you to give me a simple way (without editing and recompiling your program) to try your program out on my test file. The best way to do this is with an open file dialog (java.awt.FileDialog or javax.swing.JFileChooser).
What should you do about invalid tags?
Assume all tags are valid. If you get an invalid tag, your program should not crash,but it doesn't have to do anything sensible. It turns out, by the way, that one of my examples was unintentionally invalid--you apparently can't have whitespace between the opening "<" and the tag name.
What are the rules for tags?
Roughly speaking,
What if an <a> tag doesn't contain any link text?
You can use the words empty link text or similar.
What if an <a> tag encloses an image rather than link text?
Treat the <img> tag in the usual way, and use its URL (only) as the link text. Thus, for <a href=X><img src=Y></a> you would have <a href=Y>Y</a> in the list of images and <a href=X>X</a> in the list of links.
What if an <a> tag encloses several other tags, or text and tags?
This is getting way more complex than I had intended! If you want to try to do something reasonable, go ahead, but don't expect us to test your program on complex cases. I do want you to use regular expressions to parse tags with whitespace, and single, double, or no quotes, but for the rest of it, stick to handling the simpler cases correctly.
What's that <E> in the Java API for java.util.ArrayList?
You are looking at the documentation for Java 1.5 (also known as Java 5). Please do not use Java 1.5 features, because we don't have 1.5 installed in the labs. Eclipse doesn't yet handle Java 1.5, either. You can use a Java 1.5 SDK if you like, and your Java 1.4 program will work fine, except that it may give you some warning messages--if it does, just ignore them.
What tags should be rejected by my filters?
The filter should look at the URL as a String and, based on some tests on that String, decide whether to keep or discard the URL. If this were a "real" program, with a "real" API, then your filters would accept everything, and your "real" users would override your filters to accept only URLs they might be interested in. Your filter should also get the attributes and values in some form (a HashMap is ideal), in case your hypothetical users want to use some of that information in making their filtering decisions. But since this is not a real API, just an assignment, you should write example filters that accept some things and reject others (tell us which). To make it easier for us to grade, I suggest you reject .gif images and hyperlinks that contain a '#' character.
How do I start from a file rather than from a URL?
An absolute pathname starts with a "/" (slash); a relative pathname does not. For example, if page AAA/BBB/...XXX/index.html contains <a href="foo/bar.html">, that means that the directory that contains index.html also contains a subdirectory foo, and foo contains bar.html. This is a path relative to the location XXX, which is not necessarily the same place as your output goes. Your output should contain absolute paths. On most systems this would probably be file://AAA/BBB/...XXX/foo/bar.html, but for some reason, on Windows we seem to need file://localhost/C:/AAA/BBB/...XXX/foo/bar.html.