CIS 554 Site Downloader
Fall 2013, David Matuszek

Purposes of this assignment

General idea of the assignment

Given the URL of a page that contains zero or more images and references zero or more subpages, download all web pages (.htm and .html), and images (.gif, .jpg, and .png) from that site. Recreate the site directory structure on your computer, so that links work. Do not download files of any other type, or any offsite files. Provide a report of files downloaded, offsite files, and broken links.

In more detail

Here's the recommended program structure. Variations are okay, as long as the behavior is essentially the same.

A very brief introduction to HTML

HTML is what web pages are written in.

HTML contains elements; each element consists of a tag with zero or more attributes, enclosed in angle brackets. Tags and attributes are not case sensitive. Attributes may occur in any order. An attribute consists of a name, an equals sign, and a value. The value is supposed to be in quotes, either single or double, but often isn't. When a value isn't in quotes, it is terminated by a blank or by the closing angle bracket.
      Example: <TABLE summary="" class='data' Border = 1 cellSpacing="0">
where "table" is the tag, and "summary", "class", "border", and "cellspacing" are the names of the attributes.

The relevant tags and attributes are:

There are several kinds of URLs:

If you need to know more, visit w3schools.

You may make the following simplifying (though incorrect) assumptions:

These assumptions are intended to minimize the amount of string manipulation required, especially if you don't know regular expressions. Feel free to do a better job handling HTML elements, but you will not get any extra points for doing so.

If you do know regular expressions, be aware that they can take time that is exponential on the length of the string being processed. If your program exhibits exponential time behavior (takes unreasonably long to run), it will be penalized severely. Regular expressions are powerful, but remember the adage: "With great power comes great responsibility."

Useful code bits

Reading from a URL
val url = new
val inStream = url.openStream
var byte = inStream read()
Writing to a file
val file = new
val outStream = new FileOutputStream(file)
Choosing a directory
val chooser = new FileChooser()
chooser.fileSelectionMode = FileChooser.SelectionMode.DirectoriesOnly
val theFileResult = chooser.showSaveDialog(null)
Creating a directory
val newDirectory = new File(fullPathName)
val success = newDirectory.mkdir()


The program should work as specified.

For up to 20 points extra credit, skip the Console I/O and make a Scala GUI for the program.

No mechanical deductions or bonuses for vars, etc. However, we reserve the right to give bonuses or deductions according to our subjective estimates of the quality of your code, especially the extent to which it embraces or violates "the Scala way." Please note that, in this context, "subjective" means we liked it or didn't like it, and as such is not open to arguments about points.

Due date

6am Monday, December 2. Zip up the entire project and submit via Canvas. Once again, remember that I do not accept email submissions!