CIT 597 Link Extractor--Revised!
Fall 2004, David Matuszek

Purposes of this assignment:

General idea of the assignment:

Write an application to read in an HTML page, find all the <img> and <a> tags in it, and produce a new HTML page containing a subset of those tags. This new HTML page will be saved to a file.

Most important changes:

Details:

  1. Use a GUI dialog box to enter a URL into your program. You can use AWT or Swing--your choice. If you use Swing, I recommend a JOptionPane as the quickest and easiest way to get the URL (see http://java.sun.com/docs/books/tutorial/uiswing/components/dialog.html for a tutorial). Note that you need to type a complete URL (typically starting with http://) into the dialog box. Note also that a File dialog is not what you need!

  2. Go to that URL and get the HTML file that is there. (Assume that you get valid HTML; don't bother with any unnecessary error checking.) See the example at TryURL.java to get the necessary combination of I/O calls. (Note: A 403 return code means that they don't allow you to connect this way--http://news.google.com/ is an example--just find a different site.)

  3. Use regular expressions to find all the <img> tags and all the <a href=...> tags in the HTML page.

    1. For each <img> tag, find all the attribute-value pairs (one of them should be src=URL), put them in a Java 1.4 HashMap. (You can use a Java 1.5 SDK, but don't use the Java 1.5-specific generics). Give this HashMap to a filter method to determine whether to keep this tag or to discard it. The filter method should return true for tags that are to be kept, and false for those that are to be discarded.

    2. For each <a href=...> tag, assume that you have link text between the start and end tags. Find all the attribute-value pairs (one of them should be href=URL), put them in a Java 1.4 HashMap. (again, don't use generics). Give this HashMap and the link text to a filter method to determine whether to keep this tag or to discard it. This should be a different filter method than the one used for images.

      • If you wish, you can also allow for a link image--an <img> tag, and possibly some whitespace--between the start and end tags. In this case, use the image's URL as the link text. Don't try to handle cases more complicated than this!

    3. Remember:

      • Tags are not case sensitive: ImG is the same as img.
      • Tags may contain whitespace, for example <img src = "foo.gif"  >.
      • Tags may extend over more than a single line. This may affect the way you read in the HTML (one line at at time, or all at once).
      • The value of an attribute may be (1) enclosed in double-quote marks, (2) enclosed in single-quote marks, (3) terminated by whitespace, or (4) terminated by the ">" of the start tag.

    4. Also remember:

      • URLs may be absolute: "http://www...". In this case, you can use the URL exactly as you found it.

      • URLs may be relative: "foo/bar.html". In this case, you need to prepend the rest of the URL. For example, if you find this URL on page http://www.xyz.com/something/index.html, you need to construct the absolute URL http://www.xyz.com/something/foo/bar.html.

      • If a path doesn't end in .htm or .html, it's probably a directory, and the file within the directory is probably index.html. For example, both http://www.cis.upenn.edu/~matuszek/ and http://www.cis.upenn.edu/~matuszek/ are abbreviations for http://www.cis.upenn.edu/~matuszek/index.html.

  4. Create two ArrayLists (ArrayList is in java.util), one containing the filtered <img> tags in the input page, and the other containing the filtered <a> tags in the input page. Each entry in an ArrayList should be a complete tag (represented as a String), ready to add to the HTML page you will construct.

  5. Construct a complete HTML page, with <head>, <title>, and <body> tags, and two suitably-labeled lists of <a href=...> links. One list will contain the filtered <a> tags, and the other list will contain links to the filtered images.

  6. Use a dialog box to allow the user to save the newly-constructed HTML page to a file.

So that we can more easily test your program,


You are welcome to do this by yourself, or with a partner. If you work with a partner, turn in only one copy of your program, with both your names clearly indicated (preferably on the resultant HTML page).

We will test your program with both simple tags and with tags containing extra whitespace, tags that are split across lines, etc.

Due date:

Please turn in your program via Blackboard before midnight, Monday September 27.