CIT 597 Link Extractor
Fall 2004, David Matuszek

Purposes of this assignment:

General idea of the assignment:

Read in an HTML page, find all the <img> and <a> tags in it, and produce a new HTML page containing a subset of those tags.


Read the HTML into your program. For now, you can assume that input comes from a file, but we will shortly be reading pages directly from the Web--so you should isolate the input in a single method that you can change easily.

Create two ArrayLists (ArrayList is in java.util), one containing some of the <img> tags in the input page, and the other containing some of the <a> tags in the input page. Use regular expressions to find these tags, but remember:

As you find each tag, call a filter method to decide whether to keep it. Use one filter method for the <a> tags, and a different (but similar) method for the <img> tags. The methods should return true if the tag is to be kept, and false otherwise. For example, you may want to keep .jpg files but not .gif or .png files; or you may want to eliminate hyperlinks that refer to a specific part of a page (that contain a # in the URL). The filter should be given all the attributes of the tag and, for <a> links, it should also be given the link text (the text between the <a> and the </a>).

Finally, create a new HTML page with your results. Your page should have fairly simple <head> and <body> parts, and the <body> part should contain two unordered lists:

You are welcome to do this by yourself, or with a partner. If you work with a partner, turn in only one copy of your program, with both your names clearly indicated (preferably on the resultant HTML page).

We will test your program with both "normal" tags and with "abnormal" tags (tags containing extra whitespace, split across lines, etc.).

Due date:

Please turn in your program via Blackboard before midnight, Monday September 20.