CIS 700 Downloader
David Matuszek, Summer II, 2010

Purposes of this assignment

General idea of the assignment

Given a URL, download all web pages (.htm and .html), and images (.gif, .jpg, and .png) from that site. Recreate the site directory structure on your computer. Do not download files of any other type, or any offsite files.

In more detail

Here's a program structure that I think will work fairly well. Feel free to deviate from it, as long as you are using Actors in a reasonably appropriate fashion.

Main program

DownloadManager

Downloader

For each URL gotten (from the queue of URLs to be processed):

Parser

For each page received (from the Downloader):

A very brief introduction to HTML

HTML is what web pages are written in.

HTML contains elements; each element consists of a tag with zero or more attributes, enclosed in angle brackets. Tags and attributes are not case sensitive. Attributes may occur in any order. An attribute consists of a name, an equals sign, and a value. The value is supposed to be in quotes, either single or double, but often isn't. When a value isn't in quotes, it is terminated by a blank or by the closing angle bracket.
      Example: <TABLE summary="" class='data' Border = 1 cellSpacing="0">
where "table" is the tag, and "summary", "class", "border", and "cellspacing" are the names of the attributes.

The relevant tags and attributes are:

There are several kinds of URLs:

If you need to know more, visit w3schools.

You may make the following simplifying (though incorrect) assumptions:

These assumptions are intended to minimize the amount of string manipulation required, especially for students who don't know regular expressions. Feel free to do a better job handling HTML elements, but you will not get any extra points for doing so.

Warning

You should read and understand About /robots.txt before trying out your program. Following that, you should look at the robots.txt file for the university. If you don't obey the rules there, you may get Penn IP addresses banned, which in turn may get you into trouble with the university.

I will not require your program to read the robots.txt file, but you should read it and not break the rules.

A relatively safe place to try out your robot is any one of my old course pages at Penn. You can find a list of these at http://www.cis.upenn.edu/~matuszek/index.html. Any one of these courses is probably a good size to use for testing (except CIS700, which uses a wiki instead), but you probably don't want to download material from all 32 courses.

Useful code bits

Reading from a URL
val url = new java.net.URL(from)
val inStream = url.openStream
var byte = inStream read()
Writing to a file
val file = new java.io.File(to)
val outStream = new FileOutputStream(file)
outStream.write(byte)
Choosing a directory
val chooser = new FileChooser()
chooser.fileSelectionMode = FileChooser.SelectionMode.DirectoriesOnly
val theFileResult = chooser.showSaveDialog(null)
Creating a directory
val newDirectory = new File(fullPathName)
val success = newDirectory.mkdir()
Regular expression to match a link to an HTML page
val HtmlPattern =
"""(?i).*<\s*(a)\s+.*href\s*=\s*['"]?([^'" ]+\.html?)['" >].*""".r
Regular expression to match a link to an image
val ImgPattern =
"""(?i).*<\s*(img)\s+.*src\s*=\s*['"]?([^'" ]+\.(gif|jpg|png))['" >].*""".r
Using the regular expression to get the tag and URL.
string match {
case Pattern(tag, url) =>
println("tag: " + tag + ", url: " + url)
case Pattern(tag, url, ext) =>
println("tag: " + tag + ", url: " + url)
case x => println("*** No match: " + x + "\n")
}

Grading

The program should work as specified.

For up to 15 points extra credit, give the user the ability to stop all the actors (by sending them an appropriate message) and quit the program cleanly, if the user types q or Q.

No mechanical deductions or bonuses for vars, etc. However, we reserve the right to give bonuses or deductions according to our subjective estimates of the quality of your code, especially the extent to which it embraces or violates "the Scala way." Please note that, in this context, "subjective" means we liked it or didn't like it, and as such is not open to arguments about points.

Due date

Midnight Wednesday, August 11, Saturday, August 14, via Blackboard. Late submissions will be accepted for only a short time after that.