Working with n-grams is a classic computer science problem, so you may have met it before.
In this assignment you are to read in a book-length text, break it into n-grams, and use those n-grams to create a random text.
Suppose, for example, that you are working with 2-grams, and you have found that 80% of the time "th" is followed by "e ", 10% by "is", 7% by "at", and 3% by "es" (made-up numbers!). Then, when you are generating text, after you have generated "th" you should randomly choose "e " with probability 0.8, "is" with probability 0.1, "at" with probability 0.07, and "es" with probability 0.03.
def main(args: Array[String]) method, the
are arguments that are passed in via the command line. If there is only
one argument, it is in
In Eclipse, you can put the file in the top level of your project
directory. Then go to
Run → Run Configurations..., select
your application, open the
Arguments tab, and put the name
of your book file in the
Program arguments: area.
To run the program from the command line, you can go into the project's
directory and enter:
scala ProgramName ../BookFileName
(because the file will be one level up from the executable
An n-gram is a sequence of n characters. When you do this, be sure to get all the n-grams. For example, if the complete text is "woodchucks", the 3-grams are not just "woo", "dch", and "uck", but also "ood", "chu", "chks" and "odc", "huc".
In my code I imported and used a
Then I created a map of Strings to Lists of Strings. For each n-gram in
the text that was followed by another n-gram, I used the earlier n-gram as
a key, and the later n-gram as part of the value.
For example, using the text
How much wood would a woodchuck chuck
if a woodchuck could chuck wood?we can get the following maps:
For convenience, I've alphabetized the above maps according to their
Notice that the lists contain repetitions. This is deliberate. For
example, in the table of 1-grams above,
"h" is followed by
four times and by
" " just
once. If you choose an element randomly from
List("u", "u", "u",
"u", " "), you will get
"u" four times as often as
which is just what is desired.