CIT 591 Assignment 6: Readability
Fall 2011, David Matuszek

Purposes of this assignment

General idea of the assignment

There are various measures of "readability"--how good a reader need to be in order to understand a passage of English text. These measures are based on the average length of words (usually measured in syllables) and the average length of sentences (measured in words). The result is usually given as the number of years a child has to have attended school (her grade level) in order to understand the text. These measures are crude, but better than nothing. Your assignment is to let the user read in a passage of text (from a file), apply the formulae, and print out the results.

Also, just to give you some experience with Python dictionaries, I've added a measure of my own: How "rich" the vocabulary is, that is, how many different words are used.

Technical details

Here are the formulae you should apply. These have been taken from the corresponding Wikipedia articles and from http://www.readability.info/info.shtml; where these disagree, I've used the formula from the latter. In addition, I've made some simplifications of my own (such as, use the entire text, not selected sentences). I'm using a simple syllable-counting algorithm based on an article in Stack Overflow. Since we will be using our own unit tests to test your program, follow this assignment exactly as written. Provide exactly the functions listed below, with exactly the same names and parameter lists, and save it in a file named exactly readability.py. Save your unit tests in a second file named readability_test.py.

The following formulae use the total number of letters (not characters), syllables, and words in the document. Some of the formulae also depend on how many "big words" there are.

Name Formula Notes
Kincaid 11.8*syllables/words+0.39*words/sentences-15.59 Also known as "Flesch-Kincaid."
Automated Readability Index 4.71*letters/words+0.5*words/sentences-21.43 Also known as "ARI."
Coleman-Liau 5.89*letters/words-0.3*sentences/(100*words)-15.8  
Flesch 206.835-84.6*syllables/words-1.015*words/sentences Not comparable to others--high scores (up to about 100) are easier.
Fog (Gunning) 0.4*(words/sentences+100*((words >= 3 syllables)/words)) The words count includes the count of words with three or more syllables.
Lix words/sentences+100*(words >= 6 characters)/words The words count includes the count of words with six or more letters.
SMOG

3 + square root of (30*(words >= 3 syllables)/words)

SMOG is an acronym for "Simple Measure Of Gobbledygook." There are various versions; use this one.

The words count includes the count of words with three or more syllables.

Counting sentences

A sentence is any sequence of one or more words (there are no zero-length sentences), ending with a period (.), a question mark (?), or an exclamation point (!). For example, "Dr. Dave teaches CIS591." is two sentences.

If the input text ends with words, but not a period, question mark, or exclamation point, it also counts as a sentence. (Hint: Instead of making a special case for this, just put a period at the end of the input when you read it in.)

Counting words

A word is any consecutive sequence of letters, where an apostrophe (') is considered to be a letter. (There are cases where an apostrophe is used as a quotation mark, but they are rare, and we will ignore them.)

Anything that is not a letter (whitespace, punctuation, digits) just separates words. Hyphenated words, such as ex-wife, should be counted as two words.

Case doesn't matter. Words that differ only in case ( "Pat" and "pat") are considered to be the same word.

Counting syllables

The vowels are a, e, i, o, u, y. Each sequence of consecutive vowels counts as one syllable (so "delicious" has three syllables). However, there are some special cases:

Required methods

All the methods listed are required. Be careful to get the spelling and capitalization exactly as shown. You may have additional methods, and all methods (except main() and those devoted to doing I/O) must be thoroughly unit tested.

def readFile(fileName)
Reads in the contents of the named file, as a list of lines (each line is a string). Since this is an I/O function, you do not need to unit test it.
def getSentences(text)
Returns a list of strings (as returned by readFile), each string representing one sentence from the text (a list of lines, as returned from readFile). Gets a list of strings (as returned by readFile), and returns a list of "sentences". Each "sentence" is a string that consists only of lowercased words separated by single blanks. For example, She said, "Don't you dare!" would be returned as the string "she said don't you dare".
def getSentenceCount(sentences)
Returns a count of the number of sentences (as returned by getSentences). This is a trivial single-line function; I include it for consistency with the following functions.
def getWordCount(sentences, dictionary)
Returns a count of the number of words in the sentences and, as a side effect, either (1) adds the word to the dictionary, with a count of 1, or (2) if the word is already in the dictionary, increase its count by 1
def getLongWordCount(sentences)
Returns the number of words of six or more letters in the sentences.
def getPolysyllableWordCount(sentences)
Returns the number of words of three or more syllables in the sentences.
def getLetterCount(sentences)
Returns the number of letters in the sentences (apostrophes do not count as letters).
def getSyllableCount(sentences)
Returns the number of syllables in the sentences, using the algorithm specified above.

For reasons of efficiency, you should perform the above counts only once for each input text. Efficiency is not often an issue, but if I give you a text like Moby Dick, efficiency will become an issue.

def getKincaidMeasure(sentenceCount, wordCount, syllableCount)
def getARImeasure(sentenceCount, wordCount, letterCount)
def getColemanLiauMeasure(sentenceCount, wordCount, letterCount)
def getFleschMeasure(sentenceCount, wordCount, syllableCount)
def getGunningFogMeasure(sentenceCount, wordCount, polysyllabicWordCount)
def getLixMeasure(sentenceCount, wordCount, longWordCount)
def getSmogMeasure(wordCount, polysyllabicWordCount)
Each of these returns, as a floating point number, the named measure.
def getRichness(dictionary)
This is my own (half-baked) scheme to measure how "rich" the vocabulary is. Using a dictionary, keep a count of how many times each individual word occurs in the text ("the" will probably be the most common word, followed by "and", "is", etc.). Convert your dictionary into a list of (word, count) tuples, sorted so that the most frequent words are first. Add the word counts, starting with the most frequent word and proceeding until the sum equals or exceeds 1/2 the total word count. Return as a result the number of words used.
def main()

Functional programming

I have tried to give you an assignment where the functional programming features of Python (map, filter, reduce, list comprehensions, and lambda expressions) can be helpful, so please try to use them.

Any program can be written without these features, and it would be pretty artificial to try to force you to use them ("use map at least three times" kind of thing). And to be honest, I don't know how useful they might be in this particular program. So nothing is required, but I do encourage you to try to use functional programming features where you can.

Grading:

A significant part of the grade will be based on how good your unit tests are, and whether it looks like you used TDD. In addition, you are expected to get the same answers as I do (so use the same formulae!).

Due date:

Before 6 AM, Friday October 21, 2011, via Sakai. Zip and turn in only one copy of the assignment for your team, making sure that both your names are on it.