CIT 591 Assignment 4: Readability
Fall 2009, David Matuszek

Purposes of this assignment

General idea of the assignment

There are various measures of "readability"--how good a reader need to be in order to understand a passage of English text. These measures are based on the average length of words (usually measured in syllables) and the average length of sentences (measured in words). The result is usually given as the number of years a child has to have attended school (her grade level) in order to understand the text. These measures are crude, but better than nothing. Your assignment is to let the user read in a passage of text (from a file), apply the formulae, and print out the results.

Also, just to give you some experience with Python dictionaries, I've added a measure of my own: How "rich" the vocabulary is, that is, how many different words are used.

Technical details

Here are the formulae you should apply. These have been taken from the corresponding Wikipedia articles and from http://www.readability.info/info.shtml; where these disagree, I've used the formula from Wikipedia. In addition, I've made some simplifications of my own (such as, use the entire text, not selected sentences). I'm using a simple syllable-counting algorithm from Stack Overflow. Since we will be using our own unit tests to test your program, follow this assignment exactly as written. Provide exactly the functions listed below, with exactly the same names and parameter lists, and save it in a file named exactly readability.py.

Name Formula Notes
Kincaid Kincaid = 11.8*syllables/wds+0.39*wds/sentences-15.59 Also known as "Flesch-Kincaid."
Automated Readability Index ARI = 4.71*chars/wds+0.5*wds/sentences-21.43 "Chars" is a count of letters, not all characters. Variables are total letters, words, and sentences.
Coleman-Liau Coleman-Liau = 5.89*chars/wds-29.5*sentences/wds-15.8 "Characters" is a count of letters, not all characters. Variables are total letters, words, and sentences.
Flesch Flesch Index = 206.835-84.6*syll/wds-1.015*wds/sent Not comparable to others--high scores (up to about 100) are easier.
Fog (Gunning) Fog Index = 0.4*(wds/sent+100*((wds >= 3 syll)/wds)) "Complex words" are words with three or more syllables. Variables are total words, complex words, and sentences.
Lix Lix = wds/sent+100*(wds >= 6 char)/wds Variables are total words, sentences, and long words.
SMOG SMOG-Grading = 1.043*square root of (((wds >= 3 syll)/sent)*30) + 3.1291 "Polysyllables" are words with three or more syllables; same as "complex words."
SMOG is an acronym for "Simple Measure Of Gobbledygook."

Syllable-counting algorithm

Each vowel (a, e, i, o, u, y) in a word counts as one syllable subject to the following sub-rules:

Required methods

All the methods listed are required. Be careful to get the spelling and capitalization exactly as shown. You may have additional methods, and all methods (except main() and those devoted to doing I/O) must be thoroughly unit tested.

In the following, text means a list of lines (strings); string means a single string, which might be an individual word or a complete sentence (with or without punctuation).

def initialize()
You will find it easiest to keep assorted counts (number of sentences, etc.) in global variables. This function should "clear out" any old counts, and anything else that may be left over, in order to begin analyzing a new reading sample.
def readFile(fileName)
Reads in the contents of the named file, as a list of lines (each line is a string). Since this is an I/O function, you do not need to unit test it.
def extractNextSentence(text)
Returns, as a single string, the next "sentence" from the text (a list of lines, as returned from readFile) and, as side effects, (1) the sentence is removed from the text, and (2) a global count of sentences is updated.

We will consider a "sentence" to be any sequence of one or more words, ending with a period (.), exclamation point (!), or question mark (?). As in a book, there may be many sentences on a line, or many lines in a sentence.
def getWordCount(string)
Returns a count of the number of words in the string and, as side effects, (1) updates a global count of words and (2) keeps a global count of how many times each particular word (case insensitive) has occurred.

A "word" is any sequence of consecutive letters, ignoring apostrophes (hence, "don't" is a single word.) Numbers don't count as words.
def getComplexWordCount(string)
Returns the number of words of three or more syllables in the string and, as a side effect, updates a global count of polysyllables.
def getLetterCount(string)
Returns the number of letters in the string (all letters are assumed to occur within words) and, as a side effect, updates a global count of letters.
def getSyllableCount(string)
Returns the number of syllables in the string, using the algorithm specified above; and, as a side effect, updates a global count of syllables

The following functions should be trivial (probably, each just returns the value of some global variable). Since the work should already have been done, none of them needs or uses a parameter.

def getTotalSentenceCount(string)
Returns a count of the number of sentences processed so far.
def getTotalWordCount(string)
Returns a count of the number of words processed so far.
def getTotalComplexWordCount(string)
Returns a count of the number of polysyllables (words of three or more syllables) processed so far.
def getTotalLetterCount(string)
Returns a count of the number of letters processed so far.

For Kincaid and Flesch, you need to keep a count of the total number of syllables. For the Lix measure, you also need to keep a count of the number of words having six or more letters. I did not notice this when I wrote the assignment, so how you do this is up to you. I suggest using the required methods (such as keeping track of the total word count) as a model.


def getKincaidMeasure()
def getARImeasure()
def getColemanLiauMeasure()
def getFleschMeasure()
def getGunningFogMeasure()
def getLixMeasure()
def getSmogMeasure()
Each of these returns, as a float, the named measure.
def getRichness()
This is my own (half-baked) scheme to measure how "rich" the vocabulary is. Using a dictionary, keep a count of how many times each individual word occurs in the text ("the" will probably be the most common word, followed by "and", "is", etc.). Let c be 1/2 the total word count. Add the word frequencies (93 times for "the", 32 for "and", ....) until you reach or exceed c. Return, as an integer result, the number of words you had to add to reach or exceed c.

Note: You will find it helpful to sort the dictionary by frequency. I'll tell you in the next lecture a simple way to do that.

def main()
Asks the user to enter the name of a file to read in, analyzes the text, and prints out all the above measures for the complete text. Also print out the ten most frequent words, along with how many times each occurred in the text. Then asks the user if they want to enter the name of another file. The program should be able to process as many files as the user cares to give it. Since this method handles I/O with the user (and should not itself do any computation!), you do not need to unit test it.
 

Grading:

A significant part of the grade will be based on how good your unit tests are, and whether it looks like you used TDD. In addition, you are expected to get the same answers as I do (so use the same formulae!).

Due date:

Before 6 AM, Friday October 10, 2009, via Blackboard. Turn in only one copy of the assignment for your team, making sure that both your names are on it.