CIS 554 Scala Assignment 3: Index Dave's website
Fall 2012, David Matuszek

Purposes of this assignment

General idea of the assignment

Crawl my website (starting from http://www.cis.upenn.edu/~matuszek) and produce an "index" of the pages, lectures, and images that you find there. Web pages have the extensions .htm and .html, images are one of .gif, .jpg, and .png, and lectures are .ppt or .pptx.

In more detail

When I refer to the "name" of a file, I mean everything after the last slash in the URL. When I refer to "the URL", I mean everything after the http:// -- in the case of a relative URL, this means reconstructing the absolute URL.

The absolute URL can be formed by taking URL of the page you are at, removing everything after the last slash (if there is anything after the last slash), and appending the relative URL to it.

Here's a suggested program structure that I think will work fairly well. Feel free to deviate from it, as long as you are using Actors in a reasonably appropriate fashion.

Main program

Actors

Image collector

One actor should collect URLs of all the images that are reported to it. Since images are often reused, only one image with a given name should be kept. You do not need to check if images are the "same."

Lecture collector

One actor should collect URLs of all the lectures that are reported to it. I usually prefix each lecture with one or two digits and a hyphen. Since I rearrange lectures each year, the prefix should be ignored when comparing the names of lectures to see whether they are the same. Since I reuse lectures but also frequently update them, the most recent lecture with a given name should be the one whose URL is kept. (Since most of my URLs contain a year number, you can use this.)

Web page collector

One actor should collect URLs of the web pages that are reported to it. The following URLs should be discarded:

Web page processors

Every time a new web page is encountered, a new actor should be created to process it.

Processing consists of

You probably need to check for cycles in the links, as these could cause your program to run forever. Since you will be discarding URLs that begin with .., this will prevent many, but not all, cycles.

In less detail

The point of this assignment is to learn to use Actors (and to use them appropriately). The above structure is basically a Map-Reduce style structure, and I think it should work out okay; but if you think there's a better way to use lots of Actors, go ahead and do things differently.

However you structure your program, the output should be one file, containing three clearly labeled sections, for pages, lectures, and images. Each section should be alphabetical by name (not by URL), with no duplicate names.

A very brief introduction to HTML

HTML is what web pages are written in.

HTML contains elements; each element consists of a tag with zero or more attributes, enclosed in angle brackets. Tags and attributes are not case sensitive. Attributes may occur in any order. An attribute consists of a name, an equals sign, and a value. The value is supposed to be in quotes, either single or double, but often isn't. When a value isn't in quotes, it is terminated by a blank or by the closing angle bracket.
      Example: <TABLE summary="" class='data' Border = 1 cellSpacing="0">
where "table" is the tag, and "summary", "class", "border", and "cellspacing" are the names of the attributes.

The relevant tags and attributes are:

  • a, usually denoting a web page, with a possible href attribute.
    • If the a tag has no href attribute, it should be ignored.
    • If the a tag has an href attribute, the value of the href attribute is some kind of URL.
  • img, denoting an image, with a "required" src attribute
    • The value of the src attribute is some kind of URL.

If you need to know more, visit w3schools.

Possibly useful code bits

I believe the following code bits to be correct, as far as they go; but use at your own risk.

Reading from a URL
val url = new java.net.URL(from)
val inStream = url.openStream
var byte = inStream read()
Writing to a file
val file = new java.io.File(to)
val outStream = new FileOutputStream(file)
outStream.write(byte)
Choosing a directory
val chooser = new FileChooser()
chooser.fileSelectionMode = FileChooser.SelectionMode.DirectoriesOnly
val theFileResult = chooser.showSaveDialog(null)
Creating a directory
val newDirectory = new File(fullPathName)
val success = newDirectory.mkdir()
Regular expression to match a link to an HTML page
val HtmlPattern =
"""(?i).*<\s*(a)\s+.*href\s*=\s*['"]?([^'" ]+\.html?)['" >].*""".r
Regular expression to match a link to an image
val ImgPattern =
"""(?i).*<\s*(img)\s+.*src\s*=\s*['"]?([^'" ]+\.(gif|jpg|png))['" >].*""".r
Using the regular expression to get the tag and URL.
string match {
case Pattern(tag, url) =>
println("tag: " + tag + ", url: " + url)
case Pattern(tag, url, ext) =>
println("tag: " + tag + ", url: " + url)
case x => println("*** No match: " + x + "\n")
}

Grading

The program should work as specified.

The first Scala assignment encouraged you to do things "the Scala way" by avoiding vars, not using a loop where a recursion would do, not using a recursion where a call to a higher-order function would do, using pattern matching, etc.

You should continue to try to do things the Scala way. You may use vars, while loops, etc., when they are appropriate, but also use the Scala features. I reserve the right to take off significant points if it looks like you made no effort whatsoever to write in Scala, rather than writing a Java program using Scala syntax. I will probably not ever do this, so there's no need to be paranoid, but I reserve that right.

Due date

Zip up your program and your result file, and submit by 6am Friday, December 7, via Blackboard. Late submissions will be accepted for only a short time after that.