CIT 595 - Computer Systems II
C Refresher Lab #1
This lab is intended for students who need a refresher in C programming, particularly with arrays and pointers. Choose the assignment that seems most appropriate for you. Be sure to ask the instructor if you need help!
Write a program that looks through a text file and reports the ten most common letters in the text, along with the number of occurrences (you can assume the text file is all lower case letters and contains no punctuation other than whitespace). Your program should read in a text file, determine the most frequent letters, and report the ten most common letters and the number of occurrences, in sorted order.
The most challenging part of this assignment is: how will you determine the most frequent letters? Do you need to sort the entire list? If so, how will you do that? If not, how will you find the ten most common?
In order to read from a text file, you can use this library, which you should not change. There is also a sample program which uses that code to read from this file. Download all three, compile the test program, and make sure it runs correctly.
To use the library, first call the "initialize" function, and specify the name of the file that you want to read. Then, when you call the "readNext" file, it will return the next string of characters from the text file. Keep calling "readNext" until it returns NULL, indicating the end of the file.
While developing your program, you should create your own small, simple text files to use so that you can be sure you are correctly reading the words, getting the bigrams, and keeping track of them. Once you think you have your program in a good state, you can try it on a larger file. Here is a link to the text of the Declaration of Independence. When you run your program on that file, you should have output that is similar to the following:
Modify the "easy" program above so that it reports the most common "bigrams" instead of the most common letters. Bigrams are groups of two written letters (or sometimes two syllables, or two words), and are very commonly used as the basis for simple statistical analysis of text. They are also used in many of the most successful language models for speech recognition. The most common bigrams in written English are "th", "er", "on", "an", and "re".
The most challenging part of this assignment is: how will you keep track of the number of occurrences of the bigrams? There are many ways to do this, including using arrays (either one- or two-dimensional) of ints.
You can use the file reader code, sample program, and Declaration of Independence text described above. Assume that bigrams do not span across words. When you run the program on the Declaration of Independence file, you should get the following output:
A "hashtable" is a data structure that can be thought of as an array of linked lists. It uses a hash
function for each element to determine an integer representation (the hash code) for that
element, and then uses the hash code to determine which "bucket" it belongs in, i.e. which element of the
array holds the linked list it should be added to or found in. To figure out which bucket the element is
in, we take the result of the hash code mod the size of the array.
As an example, consider a hashtable of strings of size 20, and a hash function that adds up the ASCII values
of the characters in the string. For the string "dog", the hash code would be 100 + 111 + 103 = 314. Since 314 % 20 = 14, the string would be added to the linked list that is in the 14th spot (0-based, of course) of
In this assignment, you will implement a hashtable using this skeleton code, which has a linked list implementation already defined. You simply need to implement the following functions:
There is a "main" function in the skeleton code that you can use to test your functions.
- hash: For the given string refered to by the pointer "string", calculate the hashcode and update the parameter "value", which is a pointer to an int. You can implement the hash function any way you like, but it should try to come up with distinct hash codes for different strings. Return 0 if an error occurs.
- put: Add the string to the hashtable in the appropriate "bucket". To implement this function, you need to get the string's hash code, mod it by the size of the array, and then add it to the linked list at that spot in the array. You do not have to handle the case where the string is already in the hashtable, though you can if you'd like. Return 0 if an error occurs.
- get: Determine whether the specified string is in the hashtable. To implement this function, you need to get the string's hash code, mod it by the size of the array, and then look for it in the corresponding linked list. Return 1 if it is found, and 0 if it is not, or if an error occurs.