CIS 554 n-Grams
Fall 2016, David Matuszek

Purposes of this assignment

General idea

Working with n-grams is a classic computer science problem, so you may have met it before.

In this assignment you are to read in a book-length text, break it into n-grams, and use those n-grams to create a random text.

Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1,  "at" with probability 0.07, and "es" with probability 0.03.

How to do it

Read in a book

In the def main(args: Array[String]) method, the args are arguments that are passed in via the command line. If there is only one argument, it is in args(0).

In Eclipse, you can put the file in the top level of your project directory. Then go to Run → Run Configurations..., select your application, open the Arguments tab, and put the name of your book file in the Program arguments: area.

To run the program from the command line, you can go into the project's bin directory and enter: scala ProgramName ../BookFileName (because the file will be one level up from the executable program).

Break the text into n-grams

An n-gram is a sequence of n characters. When you do this, be sure to get all the n-grams. For example, if the complete text is "woodchucks", the 3-grams are not just "woo", "dch", and "uck", but also "ood", "chu", "chks" and "odc", "huc".

Generate a map of n-grams to lists of n-grams

In my code I imported and used a scala.collection.mutable.HashMap. Then I created a map of Strings to Lists of Strings. For each n-gram in the text that was followed by another n-gram, I used the earlier n-gram as a key, and the later n-gram as part of the value.

For example, using the text How much wood would a woodchuck chuck if a woodchuck could chuck wood?
we can get the following maps:

1-grams 2-grams 3-grams
" " --> List("w", "c", "c", "w", "a", "i",
             "c", "w", "a", "w", "w", "m")
"a" --> List(" ", " ")
"c" --> List("k", "h", "o", "k", "h", "k",
             "h", "k", "h", "h")
"d" --> List("?", " ", "c", "c", " ", " ")
"f" --> List(" ")
"H" --> List("o")
"h" --> List("u", "u", "u", "u", " ")
"i" --> List("f")
"k" --> List(" ", " ", " ", " ")
"l" --> List("d", "d")
"m" --> List("u")
"o" --> List("d", "o", "u", "d", "o", "d",
             "o", "u", "d", "o", "w")
"u" --> List("c", "l", "c", "c", "c", "l", "c")
"w" --> List("o", "o", "o", "o", "o", " ")
" a" --> List(" w", " w")
" c" --> List("hu", "hu", "ou")
" i" --> List("f ")
" m" --> List("uc")
" w" --> List("oo", "ou", "oo",
              "oo", "oo")
"a " --> List("wo", "wo")
"ch" --> List("uc", "uc", "uc",
              "uc", " w")
"ck" --> List(" i", " c", " w", " c")
"co" --> List("ul")
"d " --> List("ch", "a ", "wo")
"dc" --> List("hu", "hu")
"f " --> List("a ")
"h " --> List("wo")
"Ho" --> List("w ")
"hu" --> List("ck", "ck", "ck", "ck")
"if" --> List(" a")
"k " --> List("wo", "co", "if", "ch")
"ld" --> List(" a", " c")
"mu" --> List("ch")
"od" --> List("ch", " w", "ch")
"oo" --> List("dc", "d?", "dc", "d ")
"ou" --> List("ld", "ld")
"ow" --> List(" m")
"uc" --> List("k ", "k ", "h ", "k ",
              "k ")
"ul" --> List("d ", "d ")
"w " --> List("mu")
"wo" --> List("od", "od", "od", "od", "ul")
" a " --> List("woo", "woo")
" ch" --> List("uck", "uck")
" co" --> List("uld")
" if" --> List(" a ")
" mu" --> List("ch ")
" wo" --> List("od ", "od?", "uld", "odc", "odc")
"a w" --> List("ood", "ood")
"ch " --> List("woo")
"chu" --> List("ck ", "ck ", "ck ", "ck ")
"ck " --> List("woo", "cou", "if ", "chu")
"cou" --> List("ld ")
"d a" --> List(" wo")
"d c" --> List("huc")
"d w" --> List("oul")
"dch" --> List("uck", "uck")
"f a" --> List(" wo")
"h w" --> List("ood")
"How" --> List(" mu")
"huc" --> List("k w", "k c", "k i", "k c")
"if " --> List("a w")
"k c" --> List("oul", "huc")
"k i" --> List("f a")
"k w" --> List("ood")
"ld " --> List("chu", "a w")
"muc" --> List("h w")
"od " --> List("wou")
"odc" --> List("huc", "huc")
"ood" --> List("chu", "chu", " wo")
"oul" --> List("d c", "d a")
"ow " --> List("muc")
"uch" --> List(" wo")
"uck" --> List(" wo", " co", " if", " ch")
"uld" --> List(" ch", " a ")
"w m" --> List("uch")
"woo" --> List("dch", "dch", "d w")
"wou" --> List("ld ")

For convenience, I've alphabetized the above maps according to their keys.

Notice that the lists contain repetitions. This is deliberate. For example, in the table of 1-grams above, "h" is followed by "u" four times and by " " just once. If you choose an element randomly from List("u", "u", "u", "u", " "), you will get "u" four times as often as " ", which is just what is desired.

The algorithm

The assignment

Read in a full-length book (plain text only, no fancy formatting!). Project Gutenberg is a good source.

Generate approximately 500 characters of text. Print the text in reasonable length lines, breaking only at spaces (not in the middle of a word).
Do this for 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams.

Here's what you should expect:
Programming note: I tried to do this in under 50 lines of code (formatted normally),  but then my printNicely method to break the text into reasonably sized lines pushed the line count up to 58. Maybe you can do better? Smiley

Comparison: I just translated this program as accurately as I could into Java. This gives:
While shorter programs are not necessarily more readable programs, it usually helps. I believe that replacing loops with list operations helps even more.

Acknowledgement

This assignment was inspired by Joe Zachary's Random Writer assignment.

Due date

Zip your Scala project and turn the zip file in to Canvas. Due by 11:59pm Tuesday, November 15.