Spell Checker
Extended Deadline: The deadline for this assignment has
been extended to Monday, 25 October 9:00 PM. You may
still use a three hour grace period (until 11:59 PM), but you
may not use any late days. Since many of you have been
working very hard on this and are finished or close to finishing,
we will also give an additional point of extra credit to
submissions received prior to the original deadline (Thursday, 21
October 9:00 PM). Any assignment submitted by the extended
deadline (Monday) will be eligible for the extra credit options
at the end of the assignment.
- Learn how to use classes to implement linked lists and other abstract data types in Java
- Learn about using inheritance in Java
Submit LinkedList.java, MRUList.java, Dictionary.java, SpellChecker.java
and a completed readme_spellchecker.txt
using the submission link on the left. Extra credit: submit a zip file named extra.zip that contains a complete version of your program (all .java files) based on Java generics where LinkedList is based on the GenericList.java interface (described
below).
A spell checker is a common application used in conjunction with
document preparation. Essentially, the spell checker will maintain a dictionary of words that are spelled correctly. Then, it will go through every word in a given text document and check if that word is found in the dictionary. If it is not found, then the word is assumed to be spelled incorrectly.
Since documents typically focus on one subject,
many words in a document are often repeated. Hence, it makes sense
that checking the spelling of repeated words should be made as fast
as possible. In this project you will implement a rudimentary spell
checker that provides for faster access to frequently used words. This fast access can be provided in the manner that we organize the dictionary.
The dictionary will be implemented as an array of 26 Lists; each list will contain the words that start with a particular letter of the alphabet. To spell check a particular word (e.g., "computer"), the program will locate the appropriate list (e.g., the list of words beginning with "C") and then check to see whether the word is contained in the list. As the spell checker runs, it will dynamically reorganize the dictionary so that frequently used words are at the front of their respective lists, enabling them to be found quickly the next time those words appear.
In this assignment, you will implement two variations of the list
Abstract Data Type (ADT):
- a linked list, exactly as we discussed in
class.
- a new data structure known as a most-recently used (MRU)
list that is based on a linked list. Essentially,
the MRU list is a linked list where every time an element is
"found" that element is moved to the head of the list. In this
way, frequently accessed data is always close to the front of
the list, so subsequent "finds" for that data are quicker.
For our spell-checker, this means that repeated words in a
document are checked more quickly.
You will also be implementing two additional classes, as described
below:
- the SpellChecker class, which represents the main
program
- the Dictionary class for representing the dictionary
of words and facilitating lookups
We have already provided you with skeletons of all the classes you
need to implement, which you can download here:
Additionally, you will also need to download the following files for this assignment:
These spelling rules are already implemented for you in the removePunctuationAndSplitIntoWords()
function in SpellChecker.java. However, to ensure a thorough understanding of how the program works and to help with debugging, you should read them over several times.
- A word in the text file is defined to be a sequence of
characters surrounded by whitespace. Any sequence of characters
containing something other than the letters A-Z and a-z should
be considered misspelled, but see below concerning punctuation
characters. For example, in the string "This.example shows
many_things." there are 3 words: "This.example", "shows", and
"many_things.".
- All words should be considered to be case-sensitive.
E.g. if you find the word "monday" in the text file and that
word is not found in the dictionary, then you have a misspelled
word since "Monday" would be in the dictionary.
- Punctuation is removed from words using the following rules:
- All punctuation characters tailing a word should be removed.
Punctuation characters are defined as periods, exclamation
points, question marks, semi-colons, and commas.
- Hyphens are the only punctuation that can be removed from
the middle of a word (i.e. ball-park).
- Quotation marks and parenthesis must be removed from the
beginning and end of words.
- For example, in the sentence "Hi, you're looking nice
today?", you would remove the quotes, comma, and the question
mark and search the dictionary for the words “Hi”, “you're”,
“looking”, “nice” and “today”.
You should start by implementing the LinkedList class, which should
represent a singly linked list as discussed in class. To get
you started, we've provided a skeleton LinkedList.java.
The List Interface
public Interface List
-----------------------------------------------------------------------------------------
boolean add(String x) // add the object x to the end of the list
boolean add(int index, String x) // add the object x at the specified index
int size() // return the number of elements in this list
String get(int index) // return the element at the specified index
String set(int index, String x) // replace the object at the specified index, returning the replaced object
String remove(int index) // remove the object at the specified index, returning the replaced object
boolean isEmpty() // test whether this list has no elements
boolean contains(String element) // test whether the list contains the given element
int indexOf(String element) // return the first index of the specified element or -1 if it is not found in the list
Your LinkedList class must implement the provided List interface
(List.java) exactly as it is given. You are not permitted to modify List.java. In Java, an interface
is a special type of construct that defines a public
API. Essentially it defines a set of public
methods that a class which implements that interface must have, but
it doesn't provide any implementation.
- For example, a TV interface might specify methods:
togglePower()
,
volumeUp()
, volumeDown()
, and setChannel(int
n)
. Then, any class like PanasonicTelevision
or SamsungTelevision
that implements this
interface must provide these public methods.
We have provided the file List.java that specifies the (minimal) set
of public methods that your LinkedList class must contain. In
order to have Java enforce this automatically, the LinkedList class
must be defined as follows:
public class LinkedList implements
List {
...
}
The implements
keyword in Java allows us to specify
the name of an interface that the class must use, in this case,
List. The standard LinkedList provides a number of other
methods that you are not required to implement in this
assignment. For a full list of these methods for a standard
LinkedList, you can refer to the javadocs:
http://docs.oracle.com/javase/6/docs/api/java/util/LinkedList.html
Implementing LinkedList
You must complete the provided LinkedList.java skeleton. You must
also implement the inner class: Node. (Hint: Your node will need two
fields: a next reference and a String). Skeleton methods and
explanations for LinkedList are provided in the file.
Notes:
- The
add()
method is overloaded in the List. One
version, add(element)
, adds an element to the end of
the list. The other version, add(index, element)
, adds
an element to a specific index of the list.
- Several methods, such as
get(index)
, take in
an index and return a String. In situations where the index is invalid,
it doesn't make sense for the method to signify this by returning null
,
since null
is a valid value for a String object (and could
certainly be contained in the list). Instead, your implementation
should throw an IllegalArgumentException
(with an appropriately detailed
message) when index
is invalid.
add(index, element)
should work correctly if 0 <= index
<= size()
. For example, this would allow you to call add(0,"Hello")
on an empty list and have it insert "Hello"
as the new head (and tail) element.
- In contrast to the above note, a node must already exist at the specified index for
set(index, element)
to work. Therefore, set(index, element)
can't be used to add new nodes to a list.
Hint: To get started implementing the LinkedList, start by implementing the Node inner class, the constructor, add(), and get(). Then, use the unit testing code described below to ensure that you can add items and print out the list. Once these methods are working, move on to implement the others.
Unit Testing LinkedList
Test your LinkedList implementation before continuing to the next
part of the assignment. Copy the following code into the main
function of LinkedList to test add()
and get
()
.
LinkedList l = new LinkedList();
l.add("this");
l.add("is");
l.add("a");
l.add("list");
for (int i = 3; i >= 0; i--) {
System.out.println(l.get(i));
}
You should see the following output when you run LinkedList.
% java LinkedList
list
a
is
this
Add additional code to your main method to test the other methods of
LinkedList. If you do not have tests for your LinkedList
implementation, we cannot guarantee we can help you on later parts
of the assignment.
The submission script runs extensive testing on your LinkedList implementation, so you should also try submitting your LinkedList.java file before moving on to the next part of this assignment. You can always re-submit it later, and the submission script testing may prove very useful for helping to identify problems in your LinkedList implementation. Note that the online submission testing is not a substitute for doing your own unit testing in main()
; the online tests only identify that you have a problem in a particular method, but do not help you debug that problem like your unit tests will.
Once you've completely finished and tested your LinkedList class,
move on to your MRUList class.
The Most Recently Used (MRU) List is a linked list with a special
property: whenever a data element is accessed using contains()
, the MRU list assumes that it is likely to
be accessed again soon, and so moves it to the head of the list to
accelerate subsequent accesses.
- Recall that finding an element in a linked list costs O(n)
time. So, if a particular element was accessed three times
in a row, the first access would cost O(n) while the next two
accesses would be O(1).
Adding elements works exactly the same as in LinkedList -- new
elements are added to the tail of the MRUList.
Inheritance from LinkedList
While we could completely re-implement this class from scratch, Java
provides an easy way for us to build off of our LinkedList class
through inheritance.
The MRUList class extends LinkedList via inheritance. By definition,
this means that MRUList will automatically inherit all fields and
methods from LinkedList. Recall that this means that you can
call any LinkedList method directly by using the super
keyword. For example,
super.contains(element)
You may also add methods that LinkedList does not have, or you may
override methods provided by LinkedList.
Implementing MRUList
Start by implementing a new contains()
method in MRUList that overrides LinkedList.contains()
.
Your new contains method should handle the functionality of the
MRUList: each time an element is accessed, it should be moved
to the head of the list. Do not simply copy/paste what you
have from LinkedList.contains()
and change it; use the
functionality provided by the LinkedList class to move the node. The contains()
method is the only one you need to override.
You should also add one additional (new) method to your MRUList class:
void printWords(int n)
: Prints the n most
recently used words in the list on a single line separated by
spaces. If there are less than n words in the MRUList, it prints all
elements.
Hint: your MRUList class should be much shorter than your
LinkedList class. However, it automatically inherits all of
the functionality of the LinkedList class.
Unit Testing MRUList
To test your MRUList, copy the following code into your main method:
MRUList mru = new MRUList();
mru.add("i
"
);
mru.add(
"
like
"
);
mru.add(
"
apples
"
);
mru.printWords(3);
mru.contains(
"
apples
"
);
mru.printWords(3);
Which should output:
i like apples
apples i like
Write a few other tests to make certain that your MRUList works.
The dictionary represents the complete set of correctly spelled
words. The Dictionary is essentially an array of 26 MRULists, each
of which holds words that start with a single letter. By using
MRULists, the dictionary facilitates fast lookup of repeated
words.
- Note that it does not matter whether the first letter of each
word is upper or lowercase. You may find
String.toLowerCase()
and String.toUpperCase()
useful.
- Strings that are null or start with anything other than a letter are considered invalid and so should trigger an
IllegalArgumentException
if passed into the insert()
method.
Your Dictionary class must conform to the following API:
public class Dictionary
-----------------------------------------------------------------------------------------
Dictionary() // default constructor that creates an empty dictionary
void insert(String word) // add the specified word to the dictionary if it does not already contain that word;
// throws an IllegalArgumentException if the word is invalid
boolean contains(String word) // tests whether the dictionary contains the specified word
void printMRU(int n) // outputs the n most recently used words in the dictionary for each letter of the alphabet
The printMRU()
method outputs the n most
recently used words in the dictionary for each letter of the
alphabet. If the list contains less than n words, then it outputs the entire list. The format of this output must be exactly as
specified here. This method should output 26 lines of
text, each of which starts with the letter of the alphabet
followed by a colon (:) followed by a list of the n words
separated by single spaces. Here is an example output
where n = 5 in the required format:
A:
apple ant Amy A
any
B: bat Bats bunny biscuit bow
C: canary cute cup cups can
dle
D: dog Dad dot duffel dumpster
...
If a particular
MRUList is empty, then the corresponding letter's list should not
be printed. Your output may contain trailing spaces at the end of each line.
If you are unable to get your MRUList working, you can substitute
LinkedList in its place in the Dictionary class for partial
credit. It will still be able to spell check, but the MRU
functionality will not be enabled.
Test your dictionary class in the same way you tested your
LinkedList and MRUList classes. In this case, it is up to you
to come up with the unit tests to place in the Dictionary's main
function.
The spell checking program will be executed by the following
command:
% java SpellChecker sample-article.txt
sample-dictionary.txt
where the first command line argument is the filename for the text
file to spell check and the second command line argument is the
filename for the dictionary file. Sample article and
dictionary files are provided in the files you downloaded
above. However, your program must be able to work with other
articles and dictionaries in the same format, so you are encouraged
to create other articles and dictionary files to test your program.
The dictionary file is simply a long list of words separated by
spaces and newlines. The words in the dictionary are
case-sensitive.
If the user specifies an incorrect number of command line arguments or
invalid arguments, the program should print an appropriate message
describing how to use the program and then gracefully exit.
SpellChecker Class
The SpellChecker class has only the main method and one or more
helper methods.
The main method should first read in the dictionary
file and construct a Dictionary object, populating it with words. Note that you do not need to remove punctuation from any of the dictionary words; your Dictionary object should contain the words exactly as they are given in the file. You may also assume that the Dictionary file will only contain valid words (i.e., Dictionary.insert()
should never throw an IllegalArgumentException
when using one of our provided dictionary files).
Next, it should read in the text file, and separate it into words for spell checking. We have already provided the function
String[] removePunctuationAndSplitIntoWords(String s)
in the SpellChecker skeleton for separating a long string s
into an array of words based on the spelling rules described above. You should then spell-check each of the resulting words in the text file.
If a
word is misspelled (i.e., it is not in the Dictionary), your program
should print out the word in the following format on a single line:
1.) someone spelled *htis* word wrong
Notice how the output shows two words before the misspelled word,
surrounds the misspelled words with asterisks (*), and prints two
words after the misspelled word to identify the context of the spelling mistake. In the case where the words
is not preceded or succeeded by two words, the program outputs as
many words before/after as are available. Each spelling mistake is numbered consecutively. Your program should only identify one misspelled word in each output line. Consecutive misspelled words require multiple output lines. For example
1.) someone spelled *htis* word wrogn
2.) htis word *wrogn*
would be two output lines.
Your program should ignore numbers, considering them valid words. To test whether a string is a valid number, we have provided the function isNumber(String s)
in SpellChecker.java.
For example, consider the text file that contains "The dog deos
tricks for darling dasy" And the dictionary file that contains "for
days darling dog does the The days cat hate love tricks". The
output of this program should be:
% java SpellChecker
article.txt dict.txt
Spell checking article.txt
using dict.txt
Misspelled Words:
1.) The
dog *deos* tricks for
2.) for darling
*dasy*
SpellChecker should also support an optional argument: the -d flag
(d for debug). If the String “-d” is passed in as a third argument,
it should also print the first 10 words in the MRUList for each
letter in the Dictionary (or all the words if there are fewer than
10 in the MRUList).
For the above example, running SpellChecker with the -d flag should
yield:
% java SpellChecker
article.txt dict.txt -d
Spell checking
article.txt using dict.txt
Misspelled
Words:
1.)
The dog *deos* tricks for
2.)
for darling *dasy*
Most
Recently Used Words:
C:
cat
D:
darling
dog days does
F:
for
H:
hate
L:
love
T:
tricks The the
Please note that your program must match this output format exactly,
including all spaces, blank lines, and descriptive lines (e.g.,
"Misspelled Words:", etc.).
In the case where there are no misspelled words, your program should output the following:
% java SpellChecker
article.txt dict.txt
Spell checking
article.txt using dict.txt
No Misspelled
Words
For the provided sample dictionary and article, your output should match the following:
% java SpellChecker sample-article.txt sample-dictionary.txt -d
Spell checking sample-article.txt using sample-dictionary.txt
Misspelled Words:
1.) The *feld* of mobile
2.) presented an *accelerometr* based custom
3.) could control *streamin* audoi Geiger
4.) control streamin *audoi* Geiger desinged
5.) audoi Geiger *desinged* a touch
6.) based interaction *pardaigm* with integrated
7.) Pure Data *PD* for Linux
8.) devices like *iPaqs* Using mobile
9.) Greg Schiemer *ni* his PocketGamelan
10.) ni his *PocketGamelan* instrument At
11.) instrument At *teh* smae time
12.) At teh *smae* time there
13.) there has *ben* an effort
14.) CaMus and *CaMus2* introduced systems
15.) introduced systems *thta* use onboard
16.) of mobile *pohnes* for tracking
17.) for tracking *viusal* references for
18.) for musical *intteraction*
Most Recently Used Words:
A: and allow an At a augmented audio at accelerometer
B: build by been based bodies
C: cameras CaMus commodity control could custom
D: devices Data device designed
E: effort enabled explored
F: for field
G: Greg Geiger
H: has his
I: introduced interactive instrument integrated interaction inspired informed in iPads
L: live like Linux
M: musical mobile made much music
O: of onboard on
P: phones performance pioneered portal Pure port PDA presented paradigm
R: references research
S: systems Schiemer synthesis sound screen several streaming same
T: tracking to there time the touch that Tanaka this The
U: use up Using using
V: visual
W: ways with work which
The LinkedList class we wrote above can contain only strings; however, it is often useful to implement data structures in a more generic way such that they can hold any type of data. Java provides Generics specifically for this purpose.
For extra credit, modify your LinkedList class to be based upon the interface GenericList.java instead of List.java. The methods in GenericList are exactly the same as in List, but they are based on a generic type T that is specified when the LinkedList is declared and initialized. This will enable your LinkedList to contain any type of object, not just strings. For example, you could then have implemented the Dictionary as a LinkedList of MRULists, instead of an array of MRULists. You will likely need to modify your other classes as well to enable them to use the generic form of the LinkedList.
If you do the extra credit, you will submit two versions of the assignment: the original (non-generic) version of all files as normal, and an extra.zip file containing new versions of all four .java files (LinkedList.java, MRUList.java, Dictionary.java, SpellChecker.java) modified to work with the generic LinkedList. Even if one or more of these files are unchanged from your non-generic version, you should still include it in extra.zip.