CIT 594 Assignment 11: MapReduce
Spring 2011, David Matuszek

Purposes of this assignment

General idea of the assignment

Count how many words begin with each letter of the alphabet in some input text (say, War of the Worlds). Present this as a table of counts for each of the 26 letters. Also report the total number of words.

Provided software

MapReduce.zip, containing MapReduce.java, Splitter.java, Mapper.java, Reducer.java, and Pair.java. This is a “baby” version of Google’s MapReduce framework. The primary difference is that Google’s version runs a large number of processes distributed across a network, whereas my version runs a small number of threads on a single computer.

MyMapReduce.zip, containing sample code MyMapper.java, MyReducer.java, and WordCounter.java. This code counts the number of occurrences of each “word” in the input text, where “word” is very poorly defined as any nonspace characters bounded by spaces.

Hello World

Google’s version

The canonical first example (the “Hello World” of MapReduce) is counting the number of times each word occurs in a sample of text. It consists of two functions, map and reduce:

class Mapper
  method map(documentID, document)
    for each word w in document d do
      emit(word w, count 1)

Input to map: A key, documentID, and a value, document.

map finds each word in the document (the documentID is ignored) and calls emit with the word w as the key and the count 1 as the value.

class Reducer
  method reduce(word w, counts [c1, c2, ...])
    sum = 0
    for each count c in counts [c1, c2, ...] do
      sum = sum + c
    emit(word w, count sum)

Input to reduce: A key, w, and a value, a list of counts (each of which happens to be 1)

reduce adds all the counts together (which, in this example, could be done by just getting the length of the list), and calls emit with the word w as the key and the sum as the value.

My version

package myMapReduce;

import mapReduce.Mapper;

public class MyMapper extends Mapper {
    
    @Override
    public void map(String documentID, String document) {
        String[] words = document.split(" ");
        for (String word : words) {
                emit(word, "1");
        }
    }
}

The package name myMapReduce, the class name MyMapper, and the function name map cannot be changed. The parameters to map must be Strings.

The arguments to emit (which is inherited from Mapper) must be Strings.

To keep the code simple, this version of map does a terrible job of finding words--it considers anything between spaces to be a word. You should do better than this.

package myMapReduce;

import java.util.List;
import mapReduce.Reducer;

public class MyReducer extends Reducer {

    @Override
    public void reduce(String word, List<String> counts) {
        emit(word, counts.size() + "");
    }
}

The package name myMapReduce, the class name MyReducer, and the function name reduce cannot be changed. The parameters to reduce must be a String and a List of Strings.

The arguments to emit (which is inherited from Reducer) must be Strings.

Since I know every value in the list counts is the string "1", rather than converting them to integers and adding them up, I just use size() to find out how many of them there are. Then I turn the result back into a string by concatenating the result with the empty string.

Compromises

All data in MapReduce is represented by key-value pairs.

In Google’s MapReduce, the key and value arguments to map may be any type of key, and any type of value. In my version, they must be strings. In Google’s MapReduce, the arguments to Mapper.emit may be any type of key and any type of value. In my version, they must be strings.

In Google’s MapReduce, the key argument to reduce must be the same type as the key emitted by map. In my version, it must be a string. In Google’s MapReduce, the value argument must be a list of the values emitted by map. In my version it must be a list of strings. In Google’s MapReduce, the arguments to Reducer.emit may be any type of key and a list of any type of value. In my version, they must be a string and a list of strings..

In my framework, I had to hard-wire in the names of the package myMapReduce and the classes MyMapper and MyReducer.

Your assignment

Your task is to read in a block of text, count how many words in it begin with each letter of the alphabet (2067 words begin with A, 172 words begin with B, ...), and print out the results. Also print the total number of words in the text.

Also...

Examine the supplied framework code, paying close attention to how thread safety is achieved (or not achieved, in case you discover any errors). Write up your observations in a readme.txt file and include it in the zip file you turn in.

Note

This is an easy program. (At least, it’s easy if you have a basic understanding of MapReduce.) All the nasty thread-safety and synchronization business has been done for you. You just have to write a bit of conventional sequential code, and your program will magically scale to any number of processors (or threads).

That’s the point. MapReduce (and its open source version, Hadoop) make it easy to write concurrent programs that process petabytes of data distributed across thousands of processors.

Of course, it’s just silly to do something like this on a single computer. Except maybe for instructional purposes.

Grading

Your program should get the right answers, and should be clean, simple, and well-documented. (I'll try to post what I think are the right answers for War of the Worlds in a couple of days.)

Your readme.txt file should look as if you made an honest effort to understand how MapReduce has been made thread-safe. (This may be the hardest part of the assignment, but it will be graded generously.)

Due date

Turn your zipped assignment in to Blackboard before 6AM Wednesday, April 27.