CIS 554 n-Grams
Fall 2016, David Matuszek

# General idea

Working with n-grams is a classic computer science problem, so you may have met it before.

In this assignment you are to read in a book-length text, break it into n-grams, and use those n-grams to create a random text.

Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1,  "at" with probability 0.07, and "es" with probability 0.03.

# How to do it

In the `def main(args: Array[String]) `method, the `args` are arguments that are passed in via the command line. If there is only one argument, it is in `args(0)`.

In Eclipse, you can put the file in the top level of your project directory. Then go to `Run → Run Configurations...`, select your application, open the `Arguments` tab, and put the name of your book file in the `Program arguments:` area.

To run the program from the command line, you can go into the project's `bin` directory and enter: ```scala ProgramName ../BookFileName ```(because the file will be one level up from the executable program).

## Break the text into n-grams

An n-gram is a sequence of n characters. When you do this, be sure to get all the n-grams. For example, if the complete text is "woodchucks", the 3-grams are not just "woo", "dch", and "uck", but also "ood", "chu", "chks" and "odc", "huc".

## Generate a map of n-grams to lists of n-grams

In my code I imported and used a `scala.collection.mutable.HashMap`. Then I created a map of Strings to Lists of Strings. For each n-gram in the text that was followed by another n-gram, I used the earlier n-gram as a key, and the later n-gram as part of the value.

For example, using the text ```How much wood would a woodchuck chuck if a woodchuck could chuck wood? ```we can get the following maps:

1-grams 2-grams 3-grams
```" " --> List("w", "c", "c", "w", "a", "i",              "c", "w", "a", "w", "w", "m") "a" --> List(" ", " ") "c" --> List("k", "h", "o", "k", "h", "k",              "h", "k", "h", "h") "d" --> List("?", " ", "c", "c", " ", " ") "f" --> List(" ") "H" --> List("o") "h" --> List("u", "u", "u", "u", " ") "i" --> List("f") "k" --> List(" ", " ", " ", " ") "l" --> List("d", "d") "m" --> List("u") "o" --> List("d", "o", "u", "d", "o", "d",              "o", "u", "d", "o", "w") "u" --> List("c", "l", "c", "c", "c", "l", "c") "w" --> List("o", "o", "o", "o", "o", " ")``` ```" a" --> List(" w", " w") " c" --> List("hu", "hu", "ou") " i" --> List("f ") " m" --> List("uc") " w" --> List("oo", "ou", "oo",               "oo", "oo") "a " --> List("wo", "wo") "ch" --> List("uc", "uc", "uc",               "uc", " w") "ck" --> List(" i", " c", " w", " c") "co" --> List("ul") "d " --> List("ch", "a ", "wo") "dc" --> List("hu", "hu") "f " --> List("a ") "h " --> List("wo") "Ho" --> List("w ") "hu" --> List("ck", "ck", "ck", "ck") "if" --> List(" a") "k " --> List("wo", "co", "if", "ch") "ld" --> List(" a", " c") "mu" --> List("ch") "od" --> List("ch", " w", "ch") "oo" --> List("dc", "d?", "dc", "d ") "ou" --> List("ld", "ld") "ow" --> List(" m") "uc" --> List("k ", "k ", "h ", "k ",               "k ") "ul" --> List("d ", "d ") "w " --> List("mu") "wo" --> List("od", "od", "od", "od", "ul")``` ```" a " --> List("woo", "woo") " ch" --> List("uck", "uck") " co" --> List("uld") " if" --> List(" a ") " mu" --> List("ch ") " wo" --> List("od ", "od?", "uld", "odc", "odc") "a w" --> List("ood", "ood") "ch " --> List("woo") "chu" --> List("ck ", "ck ", "ck ", "ck ") "ck " --> List("woo", "cou", "if ", "chu") "cou" --> List("ld ") "d a" --> List(" wo") "d c" --> List("huc") "d w" --> List("oul") "dch" --> List("uck", "uck") "f a" --> List(" wo") "h w" --> List("ood") "How" --> List(" mu") "huc" --> List("k w", "k c", "k i", "k c") "if " --> List("a w") "k c" --> List("oul", "huc") "k i" --> List("f a") "k w" --> List("ood") "ld " --> List("chu", "a w") "muc" --> List("h w") "od " --> List("wou") "odc" --> List("huc", "huc") "ood" --> List("chu", "chu", " wo") "oul" --> List("d c", "d a") "ow " --> List("muc") "uch" --> List(" wo") "uck" --> List(" wo", " co", " if", " ch") "uld" --> List(" ch", " a ") "w m" --> List("uch") "woo" --> List("dch", "dch", "d w") "wou" --> List("ld ")```

For convenience, I've alphabetized the above maps according to their keys.

Notice that the lists contain repetitions. This is deliberate. For example, in the table of 1-grams above, `"h"` is followed by `"u"` four times and by `" "` just once. If you choose an element randomly from ```List("u", "u", "u", "u", " ")```, you will get `"u"` four times as often as `" "`, which is just what is desired.

## The algorithm

• Choose a random n-gram.
• While more text is required, look up an n-gram in the map, and randomly choose one of its values.
• If the map does not contain the n-gram as a key, or the associated value is the empty list, just choose any random n-gram to continue.

# The assignment

Read in a full-length book (plain text only, no fancy formatting!). Project Gutenberg is a good source.

Generate approximately 500 characters of text. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).
Do this for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.

Here's what you should expect:
• Text generated from 1-grams will be complete gibberish, but letters, spaces, and punctuation will appear with approximately the right frequency.
• Text generated from 2-grams will begin to have occasional words appear in it.
• Text generated from 3-grams will be mostly words, with some sensible phrases starting to appear.
Programming note: I tried to do this in under 50 lines of code (formatted normally),  but then my printNicely method to break the text into reasonably sized lines pushed the line count up to 58. Maybe you can do better? Comparison: I just translated this program as accurately as I could into Java. This gives:
• Number of lines: Scala 58, Java 93.
• Number of characters: Scala 1556, Java 2915.
• Average line length (omitting indentation): Scala 20.36, Java 24.93.
While shorter programs are not necessarily more readable programs, it usually helps. I believe that replacing loops with list operations helps even more.

## Acknowledgement

This assignment was inspired by Joe Zachary's Random Writer assignment.

# Due date

Zip your Scala project and turn the zip file in to Canvas. Due by 11:59pm Tuesday, November 15.