CIT 590 Assignment 5: KWIC Index
Spring 2013, David Matuszek

Purposes of this assignment

General idea of the assignment

A KWIC (Key Word In Context) index is an old, pre-digital way of looking things up, somewhat similar to a biblical concordance.

The basic idea is that there are two kinds of words in English: stop words, which do not convey any information about the content of an article (for example, "the,", "and", "of"), and keywords, which are basically everything else.

Your program will read in text files, and write out (to file) a KWIC index of all the keywords that it finds.

Details

You should provide two (or more) files, kwic.py and kwic_test.py.

Your program should start by reading in a list of stop words from a file named stop_words.txt (provided), in the same directory as your program.

Next, your program should ask the user for some text files. The user may enter any number of file names (or paths); stop reading in file names when the user enters an empty string.

Next, for each file,

  1. Create a dictionary; the keys will be keywords, and the associated values will be a list of line numbers on which the keyword occurs.
  2. Read in the file as a list of lines.
  3. For each line, find every keyword, and add the keyword + line number to the dictionary.
  4. Write (or append) to a local file named kwic_index.txt:
    1. The name of the input file (just once), and
    2. For each keyword found in the file, for each line on which the keyword occurs, print the line number and the line. Lines are to be printed in a special way:
      1. Print, right justified, up to 30 characters preceding the keyword (Hint: use string.rjust),
      2. Print the keyword, with an extra space before it,
      3. Print, left justified, up to 30 characters following the keyword.
    3. Lines should be printed in alphabetical order by keyword (Hint: get and sort the list of keys), and for each keyword group, in order by line number. (Example: All lines containing the keyword "apple" should be printed before the lines containing "banana." If the word "apple" occurs on lines 7, 42, 12, and 91, they should be printed in the order 7, 12, 42, 91.)

Partial example output ("recur", "recursion")

  623 ler array. So we will plan to  recur only with smaller arrays, and
461 If you ever, even once, recur with the same (or harder) pro
406 o the question of when to use recursion is simply, when
415 good rule of thumb is to use recursion when you're processing
621 We will use recursion to find the maximum value in

Grading

You are required to provide unit tests for all functions that don't do input or output.

You will be graded pretty strongly on style. One important style rule is that any function that does input or output should not also do significant computation. For example, output strings have to be formatted in a certain way, as described above; this formatting should be done in a separate function (or functions) that return a string, not output the string.

Due date

Before 6am Friday February 15 . Zip together your kwic.py and kwic_test.py files (no need to submit the stop_words.txt file) to Canvas. No other form of submission will be accepted. There should be one submission per team, with both your names prominently displayed in comments at the top of each Python file.