CIT 594 Assignment 10: MapReduce
Spring 2015, David Matuszek
Given a large (book-length) text, count how many time each letter of the alphabet occurs (1) as the first letter of a word, (2) as the last letter of a word, and (3) anywhere in a word.
Pair.java. This is a “baby” version of Google’s MapReduce framework. The primary difference is that Google’s version runs a large number of processes distributed across a network, whereas my version runs a small number of threads on a single computer.
MyMapReduce.zip, containing sample code
WordCounter.java. This code counts the number of occurrences of each “word” in the input text, where “word” is very poorly defined as any nonspace characters bounded by spaces.
The canonical first example (the “Hello World” of MapReduce) is counting the number of times each word occurs in a sample of text. It consists of two functions,
The package name
The arguments to
To keep the code simple, this version of
The package name
The arguments to
Since I know every value in the list
All data in MapReduce is represented by key-value pairs.
In Google’s MapReduce, the key and value arguments to
map may be any type of key, and any type of value. In my version, they must be strings. In Google’s MapReduce, the arguments to
Mapper.emit may be any type of key and any type of value. In my version, they must be strings.
In Google’s MapReduce, the key argument to
reduce must be the same type as the key emitted by
map. In my version, it must be a string. In Google’s MapReduce, the value argument must be a list of the values emitted by
map. In my version it must be a list of strings. In Google’s MapReduce, the arguments to
Reducer.emit may be any type of key and a list of any type of value. In my version, they must be a string and a list of strings..
In my framework, I had to hard-wire in the names of the package
myMapReduce and the classes
Your task is to read in a file of text and count (1) how many times each letter occurs, (2) how many times each letter occurs as the initial letter of a word, and (3) how many times each letter occurs as a final letter of a word. For example, given the text
"I'm writing a program," you would find the following values:
Ignore the case of the letters; uppercase and lowercase letters are counted together.
"Word" is a poorly defined concept. For our purposes, we will make the following simple definition:
') is not a letter, but may occur within a word. Hence, a word such as
"they'll"is treated as if it were
You can probably come up with examples that don't fit this definition, but this is the definition of "word" that you are to use.
Here's how to write the program:
WordCounterclass with your own
LetterCounterclass. This class should contain a
mainmethod, and it should:
JFileChooserto ask the user for an input file.
StringBuilder, because you will be working with book-length text.
MapReducerand call its
executemethod, giving it as parameters (1) the name of the file you read in, and (2) the text of the file. This method returns a
List<Pair<String, String>>of results.
mapmethod in class
MyMapperwith your own version. It should:
emit. In the word-counting example, all the information was carried by the first parameter; the second parameter was always
"1". You have to figure out how to package up a bit more information.
reducemethod in class
MyReducerwith your own version. It should:
emitwith the combined information.
readme.txt file and include it in the zip file you turn in. It should contains a short (perhaps half a page) description of how information is packaged up and how it flows through your program. This should include what information is "emitted" at each step.
This is an easy program. (At least, it’s easy if you have a basic understanding of MapReduce.) All the nasty thread-safety and synchronization business has been done for you. You just have to write a bit of conventional sequential code, and your program will magically scale to any number of processors (or threads).
That’s the point. MapReduce (and its open source version, Hadoop) make it easy to write concurrent programs that process petabytes of data distributed across thousands of processors.
Of course, it’s just silly to do something like this on a single computer. Except maybe for instructional purposes.
Your program should get the right answers, and should be clean, simple, and well-documented.
readme.txt file should display your understanding of how to use MapReduce. (This may be the hardest part of the assignment, but it will be graded generously.)