Some of you have some questions about hashtable basics. I've decided to make a quick page explaining hashtables to hopefully address your questions. I'll put FAQ at the end.

# Overview: What is a hashtable and why use it?

One of the most common things we do in computer science is map some key to some data associated with it. There are various data structures that can be used for such functionality, each with different performance behaviors. For example, you could just use a linked list, but then the search time is O(n)-- linear in the number of items in the list. This may sound nice, but if you have a million items to search, you will on average need to search half a million to find something.

Another option is a binary search tree, which would give you O(lg(n)) search time if the tree is balanced. If not, it could end up being linear in the worst case. With lg(n) search time, finding something in the million element structure still takes maybe 10 or 20 comparisons. Much better, but can we improve more?

If you can store your data in an array, then you can find the data in 1 operation- simply index the array and you have what you want. Unfortunately, this works only if your keys are numbers, within some range that would make the array a reasonable size.

A hashtable tries to get the best of both worlds. In general, we should be able to find the element we want in 1 or 2 operations, but we can use most any sort of keys. Basically a hashtable is an array of linked lists. The basic principle operation is this:
1. Hash the key. Hashing is some function which turns a key into a number. There is no one right or wrong way to hash a given type of key- as long as two "equal" keys produces two equal hashes, and in general different keys produce different hashes.
2. Take the number computed by hashing the key and "mod" it by the size of the array.
3. Index the hashtable's array at the element computed in step 2. (These are generally refered to as "buckets").
4. Perform whatever linkedlist style operation you need on the list you find there.

## Some examples

Suppose we want to have a 4 element (for the sake of example, hashtables are generally much bigger than that) hashtable in which we map names of Futurama characters to their jobs and species. We start with an empty hashtable:

There are 4 buckets. Remember the hashtable is an array of pointers to hashtable nodes- which are basically like linked lists nodes. Starting out, each of these pointers is NULL. Now we add our first item. Suppose that the string "Bender" hashes to 84. Computing 84 % 4 gives us 0, so Bender goes in bucket 0. With nothing already there, the pointer in bucket 0 points to the node with information about Bender, and that nodes next is NULL:

Now suppose we add information about Leela, and that "Leela" hashes to 403. Computing 403 % 4 tells us that Leela belongs in Bucket 3, resulting in the following:

Ok great so far, now we go to add Fry. Suppose that "Fry" hashes to 107. Well 107 % 4 = 3. Fry belongs in bucket 3, but Leela is already there! This is what is called a collision- two different items need to go into the same bucket. Collisions are why we have an array of linked lists instead of just an array, we can chain the colliding elements in the same bucket-- making a list of all the items that should go there:

Looking up an item in the hashtable follows the same prinicple. If we want to lookup the information about Leela, we hash the key again (resulting in 403), mod it by 4 (telling us bucket 3), and then search the linked list that is that bucket. At each node, we do a string comparison of the node's key to the key we are looking for, and when we find a match, we return the corresponding data.

### What is rehashing, and how do I do it?

Collisions in a hashtable are a bad thing. A few are tollerable, but you dont want to have a lot. The idea of a hashtable is not to have long linked lists that are a fourth or an eighth as big, but to has linked lists of length one, or maybe two. You generally want your hashtable to have about twice as many buckets as it has elements. Or put in technical terms, you want the load factor, the ratio of elements to buckets, to be less than 0.5. What do you do when the load factor gets too high? You rehash the table: make the array of buckets larger, and put each element in its new bucket. Rehashing is an expensive operation, so you want to do it infrequently (if you double the size of the table when you rehash, the amortized cost is relatively low). To see how this works, let us consider the above example, and what occurs if we rehash the table to have 8 buckets.
• Bender hashed to 84. 84 mod 8 = 4. Bender goes into bucket 4 of the new table.
• Leela hashed to 403. 403 mod 8 = 3. Leela goes into bucket 3 of the new table.
• Fry hashed to 107. 107 mod 8 = 7. Fry goes into bucket 7 of the new table.
As you can see, in this larger table, Leela and Fry no longer collide. If we add one more element, chances are good (5/8) that we still won't have a collision.

## AQ (Asked Questions, maybe frequently, maybe not)

• Q: So a hash function maps keys to numbers?
A: Yes, you then need to mod that result by the table size to get the bucket.
• Q: Don't unique keys imply no collisions?
A: No, there are two ways you can have collisions with unique keys. (1) The keys can hash to the same number (consider "abc" and "cba" where our hash function is a simple addition of the numerical values in the string. They result in the same hash, no table size will avoid this collision). (2) The keys hash to different numbers that are congruent modulo the table size. This is the case in the example above: 403 % 4 = 107 % 4 = 3. A larger table can avoid this type of collision.
• Q: Will table->buckets[index] take me to a unique hash_node or to a list of hash_nodes.
A: Well it will take you to one node, but that node is the head of the list. I.e. its next field may point to another node and so on. We don't have the list structures encapsulated in a "List" class as you might in Java- we just deal directly with the nodes here.
• Q: What if table->buckets[index] is unitialized?
A: When you create the table, you should make sure each bucket is initalized to NULL- you dont ever want unitialized pointers sitting around. In the other methods, you can check for NULL appropriately (i.e. iterate until you get to NULL or whatever).
• Q: Why isn't realloc the easiest way to do the rehashing?
A: Remember when you rehash, you have to move some items to new indicies. While you can accomplish this with realloc, its probably easier to malloc a new buckets array, move everything to the right place, then free the old one.
• Q: Should my add function add to the front or the back of the list?
A: You could do either- but why add to the back? The commmon case should be that there is nothing there anyways, and adding to the front is an O(1) operation (and much simpler to write!).