CIS 554 Scala 2: KWIC Index
Fall 2014, David Matuszek

Purpose of this assignment

General idea of the assignment

A KWIC (Key Word In Context) index is an old, pre-digital way of looking things up, somewhat similar to a biblical concordance.

The basic idea is that there are two kinds of words in English: stop words, which do not convey any information about the content of an article (for example, "the,", "and", "of"), and keywords, which are basically everything else.

Your program will read in an arbitrary number of text files, and write out (to file) a KWIC index of all the keywords that it finds.

Details

You should provide two (or more) files, kwic.scala and kwic_test.scala.

Your program should start by reading in a list of stop words from a file named stop_words.txt (provided), in the same directory as your program.

Next, your program should ask the user for some text files. The user may enter any number of file names (or paths); stop reading in file names when the user enters an empty string. (Here are some I/O Examples that may be helpful.)

Next, for each file,

  1. Create a map; the keys will be keywords, and the associated values will be a list of line numbers on which the keyword occurs.
  2. Read in the file as a list of lines.
  3. For each line, find every keyword, and add the keyword + line number to the map.
  4. Write (or append) to a local file named kwic_index.txt:
    1. The name of the input file (just once), and
    2. For each keyword found in the file, for each line on which the keyword occurs, print the line number and the line. Lines are to be printed in a special way:
      1. Print, right justified, up to 30 characters preceding the keyword,
      2. Print the keyword, with an extra space before it,
      3. Print, left justified, up to 30 characters following the keyword.
    3. Lines should be printed in alphabetical order by keyword, and for each keyword group, in order by line number. (Example: All lines containing the keyword "apple" should be printed before the lines containing "banana." If the word "apple" occurs on lines 7, 42, 12, and 91, they should be printed in the order 7, 12, 42, 91.)

Use Scalatest to test some or all of your functions. You don't need to thoroughly test everything, but I'd like to see evidence that you could do thorough testing if you wanted to.

Partial example output ("recur", "recursion")

  461       If you ever, even once,  recur with the same (or harder) pro
  623 ler array. So we will plan to  recur only with smaller arrays, and
  406 o the question of when to use  recursion is simply, when
  415  good rule of thumb is to use  recursion when you're processing
  621                   We will use  recursion to find the maximum value

Notes

Data Structures

The above description refers to the "map" and "list" data structures. These are meant to be general terms, not specific data structures. Use whichever Scala data structures you feel are most appropriate. Do, however, use the Scala versions, not the Java versions.

Strings in Scala are exactly the same as Java Strings, and all the usual Java methods apply. The Scala class StringOps contains a large number of additional methods, some of which you may find useful. In particular, the Scala format method uses java.util.Formatter.format (similar to printf in C).

Scalatest

The latest version of Scala appears not to contain some jars necessary for using Scalatest. I use the Scala IDE based on Eclipse, and added the following external jars:

I think these are the most recent versions of everything, and in any case they seem to be mutually compatible. You may need a slightly different set of jars. Because this is essentially a configuration issue rather than a language issue, the use of Scalatest will be only 10% of the grade on this assignment.

Here is some sample code: Fraction.scala and ExampleTests.scala.

Style

There are some things you should know about Scala style. While we will not be grading on Scala-specific style, you will find that your program is easier to write and debug if you make some attempt to follow these suggestions.

Due date

Turn your assignment in to Canvas before 6am Monday, December 8. Important note: Canvas will be set to disallow submissions after 12:01 am, December 10. This should give us enough time to finish grading and submit final grades before our deadline.