CIT 595 - Computer Systems II

Spring 2011

C Refresher Lab #1

This lab is intended for students who need a refresher in C programming, particularly with arrays and pointers. Choose the assignment that seems most appropriate for you. Be sure to ask the instructor if you need help!

Write a program that looks through a text file and reports the ten most common letters in the text, along with the number of occurrences (you can assume the text file is all lower case letters and contains no punctuation other than whitespace). Your program should read in a text file, determine the most frequent letters, and report the ten most common letters and the number of occurrences, in sorted order.

The most challenging part of this assignment is: how will you determine the most frequent letters? Do you need to sort the entire list? If so, how will you do that? If not, how will you find the ten most common?

In order to read from a text file, you can use this library, which you should not change. There is also a sample program which uses that code to read from this file. Download all three, compile the test program, and make sure it runs correctly.

To use the library, first call the "initialize" function, and specify the name of the file that you want to read. Then, when you call the "readNext" file, it will return the next string of characters from the text file. Keep calling "readNext" until it returns NULL, indicating the end of the file.

While developing your program, you should create your own small, simple text files to use so that you can be sure you are correctly reading the words, getting the bigrams, and keeping track of them. Once you think you have your program in a good state, you can try it on a larger file. Here is a link to the text of the Declaration of Independence. When you run your program on that file, you should have output that is similar to the following:

e: 860
t: 638
o: 514
n: 483
s: 476
a: 474
i: 449
r: 424
h: 349
d: 254

Modify the "easy" program above so that it reports the most common "bigrams" instead of the most common letters. Bigrams are groups of two written letters (or sometimes two syllables, or two words), and are very commonly used as the basis for simple statistical analysis of text. They are also used in many of the most successful language models for speech recognition. The most common bigrams in written English are "th", "er", "on", "an", and "re".

The most challenging part of this assignment is: how will you keep track of the number of occurrences of the bigrams? There are many ways to do this, including using arrays (either one- or two-dimensional) of ints.

You can use the file reader code, sample program, and Declaration of Independence text described above. Assume that bigrams do not span across words. When you run the program on the Declaration of Independence file, you should get the following output:

th: 189
he: 176
in: 118
es: 107
an: 105
nd: 103
re: 97
en: 92
on: 92
ti: 82

A "hashtable" is a data structure that can be thought of as an array of linked lists. It uses a hash function for each element to determine an integer representation (the hash code) for that element, and then uses the hash code to determine which "bucket" it belongs in, i.e. which element of the array holds the linked list it should be added to or found in. To figure out which bucket the element is in, we take the result of the hash code mod the size of the array.

As an example, consider a hashtable of strings of size 20, and a hash function that adds up the ASCII values of the characters in the string. For the string "dog", the hash code would be 100 + 111 + 103 = 314. Since 314 % 20 = 14, the string would be added to the linked list that is in the 14th spot (0-based, of course) of the array.

In this assignment, you will implement a hashtable using this skeleton code, which has a linked list implementation already defined. You simply need to implement the following functions:

There is a "main" function in the skeleton code that you can use to test your functions.