Some of you have some questions about hashtable basics. I've decided to make a quick page explaining hashtables to hopefully address your questions. I'll put FAQ at the end.

Overview: What is a hashtable and why use it?

One of the most common things we do in computer science is map some key to some data associated with it. There are various data structures that can be used for such functionality, each with different performance behaviors. For example, you could just use a linked list, but then the search time is O(n)-- linear in the number of items in the list. This may sound nice, but if you have a million items to search, you will on average need to search half a million to find something.

Another option is a binary search tree, which would give you O(lg(n)) search time if the tree is balanced. If not, it could end up being linear in the worst case. With lg(n) search time, finding something in the million element structure still takes maybe 10 or 20 comparisons. Much better, but can we improve more?

If you can store your data in an array, then you can find the data in 1 operation- simply index the array and you have what you want. Unfortunately, this works only if your keys are numbers, within some range that would make the array a reasonable size.

A hashtable tries to get the best of both worlds. In general, we should be able to find the element we want in 1 or 2 operations, but we can use most any sort of keys. Basically a hashtable is an array of linked lists. The basic principle operation is this:
  1. Hash the key. Hashing is some function which turns a key into a number. There is no one right or wrong way to hash a given type of key- as long as two "equal" keys produces two equal hashes, and in general different keys produce different hashes.
  2. Take the number computed by hashing the key and "mod" it by the size of the array.
  3. Index the hashtable's array at the element computed in step 2. (These are generally refered to as "buckets").
  4. Perform whatever linkedlist style operation you need on the list you find there.

Some examples

Suppose we want to have a 4 element (for the sake of example, hashtables are generally much bigger than that) hashtable in which we map names of Futurama characters to their jobs and species. We start with an empty hashtable:

There are 4 buckets. Remember the hashtable is an array of pointers to hashtable nodes- which are basically like linked lists nodes. Starting out, each of these pointers is NULL. Now we add our first item. Suppose that the string "Bender" hashes to 84. Computing 84 % 4 gives us 0, so Bender goes in bucket 0. With nothing already there, the pointer in bucket 0 points to the node with information about Bender, and that nodes next is NULL:

Now suppose we add information about Leela, and that "Leela" hashes to 403. Computing 403 % 4 tells us that Leela belongs in Bucket 3, resulting in the following:

Ok great so far, now we go to add Fry. Suppose that "Fry" hashes to 107. Well 107 % 4 = 3. Fry belongs in bucket 3, but Leela is already there! This is what is called a collision- two different items need to go into the same bucket. Collisions are why we have an array of linked lists instead of just an array, we can chain the colliding elements in the same bucket-- making a list of all the items that should go there:

Looking up an item in the hashtable follows the same prinicple. If we want to lookup the information about Leela, we hash the key again (resulting in 403), mod it by 4 (telling us bucket 3), and then search the linked list that is that bucket. At each node, we do a string comparison of the node's key to the key we are looking for, and when we find a match, we return the corresponding data.

What is rehashing, and how do I do it?

Collisions in a hashtable are a bad thing. A few are tollerable, but you dont want to have a lot. The idea of a hashtable is not to have long linked lists that are a fourth or an eighth as big, but to has linked lists of length one, or maybe two. You generally want your hashtable to have about twice as many buckets as it has elements. Or put in technical terms, you want the load factor, the ratio of elements to buckets, to be less than 0.5. What do you do when the load factor gets too high? You rehash the table: make the array of buckets larger, and put each element in its new bucket. Rehashing is an expensive operation, so you want to do it infrequently (if you double the size of the table when you rehash, the amortized cost is relatively low). To see how this works, let us consider the above example, and what occurs if we rehash the table to have 8 buckets. As you can see, in this larger table, Leela and Fry no longer collide. If we add one more element, chances are good (5/8) that we still won't have a collision.

AQ (Asked Questions, maybe frequently, maybe not)