Given the URL of a page that contains zero or more images and references zero or more subpages, download all web pages (
.html), and images (
.png) from that site. Recreate the site directory structure on your computer, so that links work. Do not download files of any other type, or any offsite files. Provide a report of files downloaded, offsite files, and broken links.
Here's the recommended program structure. Variations are okay, as long as the behavior is essentially the same.
0= just this page,
1= this page and its "child pages",
2= this page and its children and "grandchildren",
selffunction to make a proxy actor for "yourself" (this thread).
Threadwith special features, such as a "mailbox." Your initial Thread is a plain old Java Thread that doesn't have these features. Think of your
selfas a "power suit" that you put on to give you these extra features.
depthLimit-1) to deal with it.
0. This widely-used technique lets you pass around one number instead of two.
main) the URL of this page, whether the download was successful (remember, the URL may be invalid or refer to a missing page), and how many URLs were discarded as not useful.
HTML is what web pages are written in.
HTML contains elements; each element consists of a tag with zero or more attributes, enclosed in angle brackets. Tags and attributes are not case sensitive. Attributes may occur in any order. An attribute consists of a name, an equals sign, and a value. The value is supposed to be in quotes, either single or double, but often isn't. When a value isn't in quotes, it is terminated by a blank or by the closing angle bracket.
<TABLE summary="" class='data' Border = 1 cellSpacing="0">
where "table" is the tag, and "summary", "class", "border", and "cellspacing" are the names of the attributes.
The relevant tags and attributes are:
a, usually denoting a web page, with a possible
atag has no
hrefattribute, it should be ignored.
atag has an
hrefattribute, the value of the
hrefattribute is some kind of URL.
img, denoting an image, with a "required"
srcattribute is some kind of URL.
There are several kinds of URLs:
http:, it is (probably) an offsite link, and should be ignored.
#character, it is a link to the middle of some page, and should be ignored.
/), it is a link to somewhere on the same site, but not in this subdirectory; ignore it.
If you need to know more, visit w3schools.
You may make the following simplifying (though incorrect) assumptions:
href, but not
atag or one
These assumptions are intended to minimize the amount of string manipulation required, especially if you don't know regular expressions. Feel free to do a better job handling HTML elements, but you will not get any extra points for doing so.
If you do know regular expressions, be aware that they can take time that is exponential on the length of the string being processed. If your program exhibits exponential time behavior (takes unreasonably long to run), it will be penalized severely. Regular expressions are powerful, but remember the adage: "With great power comes great responsibility."
|Reading from a URL||
val url = new java.net.URL(from)
|Writing to a file||
val file = new java.io.File(to)
|Choosing a directory||
val chooser = new FileChooser()
|Creating a directory||
val newDirectory = new File(fullPathName)
The program should work as specified.
For up to 20 points extra credit, skip the Console I/O and make a Scala GUI for the program.
No mechanical deductions or bonuses for
vars, etc. However, we reserve the right to give bonuses or deductions according to our subjective estimates of the quality of your code, especially the extent to which it embraces or violates "the Scala way." Please note that, in this context, "subjective" means we liked it or didn't like it, and as such is not open to arguments about points.