CIT 594 Assignment 10: MapReduce
Spring 2015, David Matuszek

Purposes of this assignment

General idea of the assignment

Given a large (book-length) text, count how many time each letter of the alphabet occurs (1) as the first letter of a word, (2) as the last letter of a word, and (3) anywhere in a word.

Provided software

MapReduce.zip, containing MapReduce.java, Splitter.java, Mapper.java, Reducer.java, and Pair.java. This is a “baby” version of Google’s MapReduce framework. The primary difference is that Google’s version runs a large number of processes distributed across a network, whereas my version runs a small number of threads on a single computer.

MyMapReduce.zip, containing sample code MyMapper.java, MyReducer.java, and WordCounter.java. This code counts the number of occurrences of each “word” in the input text, where “word” is very poorly defined as any nonspace characters bounded by spaces.

Hello World

Google’s version

The canonical first example (the “Hello World” of MapReduce) is counting the number of times each word occurs in a sample of text. It consists of two functions, map and reduce:

class Mapper
  method map(documentID, document)
    for each word w in document d do
      emit(word w, count 1)

Input to map: A key, documentID, and a value, document.

map finds each word in the document (the documentID is ignored) and calls emit with the word w as the key and the count 1 as the value.

class Reducer
  method reduce(word w, counts [c1, c2, ...])
    sum = 0
    for each count c in counts [c1, c2, ...] do
      sum = sum + c
    emit(word w, count sum)

Input to reduce: A key, w, and a value, a list of counts (each of which happens to be 1)

reduce adds all the counts together (which, in this example, could be done by just getting the length of the list), and calls emit with the word w as the key and the sum as the value.

My version

package myMapReduce;

import mapReduce.Mapper;

public class MyMapper extends Mapper {
    
    @Override
    public void map(String documentID, String document) {
        String[] words = document.split(" ");
        for (String word : words) {
                emit(word, "1");
        }
    }
}

The package name myMapReduce, the class name MyMapper, and the function name map cannot be changed. The parameters to map must be Strings.

The arguments to emit (which is inherited from Mapper) must be Strings.

To keep the code simple, this version of map does a terrible job of finding words--it considers anything between spaces to be a word. You should do better than this.

package myMapReduce;

import java.util.List;
import mapReduce.Reducer;

public class MyReducer extends Reducer {

    @Override
    public void reduce(String word, List<String> counts) {
        emit(word, counts.size() + "");
    }
}

The package name myMapReduce, the class name MyReducer, and the function name reduce cannot be changed. The parameters to reduce must be a String and a List of Strings.

The arguments to emit (which is inherited from Reducer) must be Strings.

Since I know every value in the list counts is the string "1", rather than converting them to integers and adding them up, I just use size() to find out how many of them there are. Then I turn the result back into a string by concatenating the result with the empty string.

Compromises

All data in MapReduce is represented by key-value pairs.

In Google’s MapReduce, the key and value arguments to map may be any type of key, and any type of value. In my version, they must be strings. In Google’s MapReduce, the arguments to Mapper.emit may be any type of key and any type of value. In my version, they must be strings.

In Google’s MapReduce, the key argument to reduce must be the same type as the key emitted by map. In my version, it must be a string. In Google’s MapReduce, the value argument must be a list of the values emitted by map. In my version it must be a list of strings. In Google’s MapReduce, the arguments to Reducer.emit may be any type of key and a list of any type of value. In my version, they must be a string and a list of strings..

In my framework, I had to hard-wire in the names of the package myMapReduce and the classes MyMapper and MyReducer.

Your assignment

Your task is to read in a file of text and count (1) how many times each letter occurs, (2) how many times each letter occurs as the initial letter of a word, and (3) how many times each letter occurs as a final letter of a word. For example, given the text "I'm writing a program," you would find the following values:

  A B C D E F G H I J K L M N O P Q R S T U V W X Y Z

Initial

1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
Final 1 0 0 0 0 0 1 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0
Total 2 0 0 0 0 0 2 0 3 0 0 0 2 1 1 1 0 3 0 1 0 0 1 0 0 0

Ignore the case of the letters; uppercase and lowercase letters are counted together.

"Word" is a poorly defined concept. For our purposes, we will make the following simple definition:

You can probably come up with examples that don't fit this definition, but this is the definition of "word" that you are to use.

Here's how to write the program:

Also...

Create a readme.txt file and include it in the zip file you turn in. It should contains a short (perhaps half a page) description of how information is packaged up and how it flows through your program. This should include what information is "emitted" at each step.

Note

This is an easy program. (At least, it’s easy if you have a basic understanding of MapReduce.) All the nasty thread-safety and synchronization business has been done for you. You just have to write a bit of conventional sequential code, and your program will magically scale to any number of processors (or threads).

That’s the point. MapReduce (and its open source version, Hadoop) make it easy to write concurrent programs that process petabytes of data distributed across thousands of processors.

Of course, it’s just silly to do something like this on a single computer. Except maybe for instructional purposes.

Grading

Your program should get the right answers, and should be clean, simple, and well-documented.

Your readme.txt file should display your understanding of how to use MapReduce. (This may be the hardest part of the assignment, but it will be graded generously.)

Due date

Turn your assignment in to Canvas before 6 a.m. Thursday, April 9.