Spell Checker

Extended Deadline: The deadline for this assignment has been extended to Monday, 25 October 9:00 PM. You may still use a three hour grace period (until 11:59 PM), but you may not use any late days. Since many of you have been working very hard on this and are finished or close to finishing, we will also give an additional point of extra credit to submissions received prior to the original deadline (Thursday, 21 October 9:00 PM). Any assignment submitted by the extended deadline (Monday) will be eligible for the extra credit options at the end of the assignment.

Goals



Submission

Submit LinkedList.java, MRUList.java, Dictionary.java, SpellChecker.java and a completed readme_spellchecker.txt using the submission link on the left. Extra credit: submit a zip file named extra.zip that contains a complete version of your program (all .java files) based on Java generics where LinkedList is based on the GenericList.java interface (described below).



Introduction

A spell checker is a common application used in conjunction with document preparation. Essentially, the spell checker will maintain a dictionary of words that are spelled correctly. Then, it will go through every word in a given text document and check if that word is found in the dictionary. If it is not found, then the word is assumed to be spelled incorrectly.

Since documents typically focus on one subject, many words in a document are often repeated. Hence, it makes sense that checking the spelling of repeated words should be made as fast as possible. In this project you will implement a rudimentary spell checker that provides for faster access to frequently used words. This fast access can be provided in the manner that we organize the dictionary.

The dictionary will be implemented as an array of 26 Lists; each list will contain the words that start with a particular letter of the alphabet. To spell check a particular word (e.g., "computer"), the program will locate the appropriate list (e.g., the list of words beginning with "C") and then check to see whether the word is contained in the list. As the spell checker runs, it will dynamically reorganize the dictionary so that frequently used words are at the front of their respective lists, enabling them to be found quickly the next time those words appear.

In this assignment, you will implement two variations of the list Abstract Data Type (ADT):
  1. a linked list, exactly as we discussed in class.  
  2. a new data structure known as a most-recently used (MRU) list that is based on a linked list.  Essentially, the MRU list is a linked list where every time an element is "found" that element is moved to the head of the list. In this way, frequently accessed data is always close to the front of the list, so subsequent "finds" for that data are quicker.  For our spell-checker, this means that repeated words in a document are checked more quickly.

You will also be implementing two additional classes, as described below:
  1. the SpellChecker class, which represents the main program
  2. the Dictionary class for representing the dictionary of words and facilitating lookups 

We have already provided you with skeletons of all the classes you need to implement, which you can download here:
Additionally, you will also need to download the following files for this assignment:



Spelling Rules

These spelling rules are already implemented for you in the removePunctuationAndSplitIntoWords() function in SpellChecker.java. However, to ensure a thorough understanding of how the program works and to help with debugging, you should read them over several times.
  1. A word in the text file is defined to be a sequence of characters surrounded by whitespace. Any sequence of characters containing something other than the letters A-Z and a-z should be considered misspelled, but see below concerning punctuation characters. For example, in the string "This.example shows many_things." there are 3 words: "This.example", "shows", and "many_things.".
  2. All words should be considered to be case-sensitive.  E.g. if you find the word "monday" in the text file and that word is not found in the dictionary, then you have a misspelled word since "Monday" would be in the dictionary.
  3. Punctuation is removed from words using the following rules:



LinkedList Abstract Data Type

You should start by implementing the LinkedList class, which should represent a singly linked list as discussed in class.  To get you started, we've provided a skeleton LinkedList.java.

The List Interface

public Interface List
-----------------------------------------------------------------------------------------
boolean add(String x)             // add the object x to the end of the list
boolean add(int index, String x)  // add the object x at the specified index
    int size()                    // return the number of elements in this list
 String get(int index)            // return the element at the specified index
 String set(int index, String x)  // replace the object at the specified index, returning the replaced object
 String remove(int index)         // remove the object at the specified index, returning the replaced object
boolean isEmpty()                 // test whether this list has no elements
boolean contains(String element)  // test whether the list contains the given element
    int indexOf(String element)   // return the first index of the specified element or -1 if it is not found in the list


Your LinkedList class must implement the provided List interface (List.java) exactly as it is given.  You are not permitted to modify List.java.  In Java, an interface is a special type of construct that defines a public API.    Essentially it defines a set of public methods that a class which implements that interface must have, but it doesn't provide any implementation. 

We have provided the file List.java that specifies the (minimal) set of public methods that your LinkedList class must contain.  In order to have Java enforce this automatically, the LinkedList class must be defined as follows:
     public class LinkedList implements List {
        ...
     }
The implements keyword in Java allows us to specify the name of an interface that the class must use, in this case, List.  The standard LinkedList provides a number of other methods that you are not required to implement in this assignment.  For a full list of these methods for a standard LinkedList, you can refer to the javadocs: http://docs.oracle.com/javase/6/docs/api/java/util/LinkedList.html

Implementing LinkedList

You must complete the provided LinkedList.java skeleton. You must also implement the inner class: Node. (Hint: Your node will need two fields: a next reference and a String). Skeleton methods and explanations for LinkedList are provided in the file.

Notes:
Hint: To get started implementing the LinkedList, start by implementing the Node inner class, the constructor, add(), and get(). Then, use the unit testing code described below to ensure that you can add items and print out the list. Once these methods are working, move on to implement the others.

Unit Testing LinkedList

Test your LinkedList implementation before continuing to the next part of the assignment. Copy the following code into the main function of LinkedList to test add() and get().

      LinkedList l = new LinkedList();
    l.add("this");
    l.add("is");
    l.add("a");
    l.add("list");


    for (int i = 3; i >= 0; i--) {
        System.out.println(l.get(i));
    }

You should see the following output when you run LinkedList.

    % java LinkedList
    list
    a
    is
    this

Add additional code to your main method to test the other methods of LinkedList. If you do not have tests for your LinkedList implementation, we cannot guarantee we can help you on later parts of the assignment.

The submission script runs extensive testing on your LinkedList implementation, so you should also try submitting your LinkedList.java file before moving on to the next part of this assignment. You can always re-submit it later, and the submission script testing may prove very useful for helping to identify problems in your LinkedList implementation. Note that the online submission testing is not a substitute for doing your own unit testing in main(); the online tests only identify that you have a problem in a particular method, but do not help you debug that problem like your unit tests will.




MRUList Abstract Data Type

Once you've completely finished and tested your LinkedList class, move on to your MRUList class. 

The Most Recently Used (MRU) List is a linked list with a special property:  whenever a data element is accessed using contains(), the MRU list assumes that it is likely to be accessed again soon, and so moves it to the head of the list to accelerate subsequent accesses. 

Adding elements works exactly the same as in LinkedList -- new elements are added to the tail of the MRUList.


Inheritance from LinkedList

While we could completely re-implement this class from scratch, Java provides an easy way for us to build off of our LinkedList class through inheritance.

The MRUList class extends LinkedList via inheritance. By definition, this means that MRUList will automatically inherit all fields and methods from LinkedList.  Recall that this means that you can call any LinkedList method directly by using the super keyword.  For example,
    super.contains(element)
You may also add methods that LinkedList does not have, or you may override methods provided by LinkedList.

Implementing MRUList

Start by implementing a new contains() method in MRUList that overrides LinkedList.contains().  Your new contains method should handle the functionality of the MRUList:  each time an element is accessed, it should be moved to the head of the list.  Do not simply copy/paste what you have from LinkedList.contains() and change it; use the functionality provided by the LinkedList class to move the node. The contains() method is the only one you need to override.

You should also add one additional (new) method to your MRUList class:

Hint:  your MRUList class should be much shorter than your LinkedList class.  However, it automatically inherits all of the functionality of the LinkedList class.


Unit Testing MRUList

To test your MRUList, copy the following code into your main method:

    MRUList mru = new MRUList();
    mru.add("i");
    mru.add("like");
    mru.add("apples");
                        
    mru.printWords(3);
        
    mru.contains("apples");
    mru.printWords(3);

Which should output:
    i like apples
    apples i like

Write a few other tests to make certain that your MRUList works.


Dictionary ADT

The dictionary represents the complete set of correctly spelled words. The Dictionary is essentially an array of 26 MRULists, each of which holds words that start with a single letter. By using MRULists, the dictionary facilitates fast lookup of repeated words. 

Your Dictionary class must conform to the following API:

public class Dictionary
-----------------------------------------------------------------------------------------
        Dictionary()          // default constructor that creates an empty dictionary
   void insert(String word)   // add the specified word to the dictionary if it does not already contain that word;
                              //   throws an IllegalArgumentException if the word is invalid
boolean contains(String word) // tests whether the dictionary contains the specified word
   void printMRU(int n)       // outputs the n most recently used words in the dictionary for each letter of the alphabet

The printMRU() method outputs the n most recently used words in the dictionary for each letter of the alphabet.  If the list contains less than n words, then it outputs the entire list.  The format of this output must be exactly as specified here.  This method should output 26 lines of text, each of which starts with the letter of the alphabet followed by a colon (:) followed by a list of the n words separated by single spaces.  Here is an example output where n = 5 in the required format:

          A: apple ant Amy A any
          B: bat Bats bunny biscuit bow
          C: canary cute cup cups candle
          D: dog Dad dot duffel dumpster
          ...

        If a particular MRUList is empty, then the corresponding letter's list should not be printed. Your output may contain trailing spaces at the end of each line.


If you are unable to get your MRUList working, you can substitute LinkedList in its place in the Dictionary class for partial credit.  It will still be able to spell check, but the MRU functionality will not be enabled.

Test your dictionary class in the same way you tested your LinkedList and MRUList classes.  In this case, it is up to you to come up with the unit tests to place in the Dictionary's main function.



The SpellChecker Program

The spell checking program will be executed by the following command:
    % java SpellChecker sample-article.txt sample-dictionary.txt
where the first command line argument is the filename for the text file to spell check and the second command line argument is the filename for the dictionary file.  Sample article and dictionary files are provided in the files you downloaded above.  However, your program must be able to work with other articles and dictionaries in the same format, so you are encouraged to create other articles and dictionary files to test your program.

The dictionary file is simply a long list of words separated by spaces and newlines.  The words in the dictionary are case-sensitive.

If the user specifies an incorrect number of command line arguments or invalid arguments, the program should print an appropriate message describing how to use the program and then gracefully exit.

SpellChecker Class

The SpellChecker class has only the main method and one or more helper methods.

The main method should first read in the dictionary file and construct a Dictionary object, populating it with words. Note that you do not need to remove punctuation from any of the dictionary words; your Dictionary object should contain the words exactly as they are given in the file. You may also assume that the Dictionary file will only contain valid words (i.e., Dictionary.insert() should never throw an IllegalArgumentException when using one of our provided dictionary files).

Next, it should read in the text file, and separate it into words for spell checking. We have already provided the function
   String[] removePunctuationAndSplitIntoWords(String s)
in the SpellChecker skeleton for separating a long string s into an array of words based on the spelling rules described above. You should then spell-check each of the resulting words in the text file. 

If a word is misspelled (i.e., it is not in the Dictionary), your program should print out the word in the following format on a single line:
    1.) someone spelled *htis* word wrong
Notice how the output shows two words before the misspelled word, surrounds the misspelled words with asterisks (*), and prints two words after the misspelled word to identify the context of the spelling mistake.  In the case where the words is not preceded or succeeded by two words, the program outputs as many words before/after as are available. Each spelling mistake is numbered consecutively. Your program should only identify one misspelled word in each output line. Consecutive misspelled words require multiple output lines. For example
    1.) someone spelled *htis* word wrogn
    2.) htis word *wrogn*
would be two output lines.

Your program should ignore numbers, considering them valid words. To test whether a string is a valid number, we have provided the function isNumber(String s) in SpellChecker.java.

For example, consider the text file that contains "The dog deos tricks for darling dasy" And the dictionary file that contains "for days darling dog does the The days cat hate love tricks".  The output of this program should be:                                                                                

     % java SpellChecker article.txt dict.txt
       Spell checking article.txt using dict.txt
      
       Misspelled Words:
       1.) The dog *deos* tricks for
       2.) for darling *dasy*


SpellChecker should also support an optional argument: the -d flag (d for debug). If the String “-d” is passed in as a third argument, it should also print the first 10 words in the MRUList for each letter in the Dictionary (or all the words if there are fewer than 10 in the MRUList).

For the above example, running SpellChecker with the -d flag should yield:  

     % java SpellChecker article.txt dict.txt -d
      
Spell checking article.txt using dict.txt                  

       Misspelled Words:
       1.) The dog *deos* tricks for  

       2.) for darling *dasy*

       Most Recently Used Words:  
       C: cat
       D: darling dog days does
       F: for
       H: hate
       L: love
       T: tricks The the

Please note that your program must match this output format exactly, including all spaces, blank lines, and descriptive lines (e.g., "Misspelled Words:", etc.).

In the case where there are no misspelled words, your program should output the following:

     % java SpellChecker article.txt dict.txt
       Spell checking article.txt using dict.txt                  

       No Misspelled Words

For the provided sample dictionary and article, your output should match the following:
     % java SpellChecker sample-article.txt sample-dictionary.txt -d
     Spell checking sample-article.txt using sample-dictionary.txt
    
     Misspelled Words:
     1.) The *feld* of mobile
     2.) presented an *accelerometr* based custom
     3.) could control *streamin* audoi Geiger
     4.) control streamin *audoi* Geiger desinged
     5.) audoi Geiger *desinged* a touch
     6.) based interaction *pardaigm* with integrated
     7.) Pure Data *PD* for Linux
     8.) devices like *iPaqs* Using mobile
     9.) Greg Schiemer *ni* his PocketGamelan
     10.) ni his *PocketGamelan* instrument At
     11.) instrument At *teh* smae time
     12.) At teh *smae* time there
     13.) there has *ben* an effort
     14.) CaMus and *CaMus2* introduced systems
     15.) introduced systems *thta* use onboard
     16.) of mobile *pohnes* for tracking
     17.) for tracking *viusal* references for
     18.) for musical *intteraction*

     Most Recently Used Words:
     A: and allow an At a augmented audio at accelerometer
     B: build by been based bodies
     C: cameras CaMus commodity control could custom
     D: devices Data device designed
     E: effort enabled explored
     F: for field
     G: Greg Geiger
     H: has his
     I: introduced interactive instrument integrated interaction inspired informed in iPads
     L: live like Linux
     M: musical mobile made much music
     O: of onboard on
     P: phones performance pioneered portal Pure port PDA presented paradigm
     R: references research
     S: systems Schiemer synthesis sound screen several streaming same
     T: tracking to there time the touch that Tanaka this The
     U: use up Using using
     V: visual
     W: ways with work which



Extra Credit: Java Generics

The LinkedList class we wrote above can contain only strings; however, it is often useful to implement data structures in a more generic way such that they can hold any type of data. Java provides Generics specifically for this purpose.

For extra credit, modify your LinkedList class to be based upon the interface GenericList.java instead of List.java. The methods in GenericList are exactly the same as in List, but they are based on a generic type T that is specified when the LinkedList is declared and initialized. This will enable your LinkedList to contain any type of object, not just strings. For example, you could then have implemented the Dictionary as a LinkedList of MRULists, instead of an array of MRULists. You will likely need to modify your other classes as well to enable them to use the generic form of the LinkedList.

If you do the extra credit, you will submit two versions of the assignment: the original (non-generic) version of all files as normal, and an extra.zip file containing new versions of all four .java files (LinkedList.java, MRUList.java, Dictionary.java, SpellChecker.java) modified to work with the generic LinkedList. Even if one or more of these files are unchanged from your non-generic version, you should still include it in extra.zip.