Some of you have some questions about hashtable basics. I've decided to make
a quick page explaining hashtables to hopefully address your questions. I'll
put FAQ at the end.
Overview: What is a hashtable and why use it?
One of the most common things we do in computer science is map some key to some
data associated with it. There are various data structures that can be used
for such functionality, each with different performance behaviors. For
example, you could just use a linked list, but then the search time is O(n)--
linear in the number of items in the list. This may sound nice, but if you
have a million items to search, you will on average need to search half a
million to find something.
Another option is a binary search tree, which would give you O(lg(n)) search
time if the tree is balanced. If not, it could end up being linear in the
worst case. With lg(n) search time, finding something in the million element
structure still takes maybe 10 or 20 comparisons. Much better, but can we
improve more?
If you can store your data in an array, then you can find the data in 1
operation- simply index the array and you have what you want. Unfortunately,
this works only if your keys are numbers, within some range that would make the
array a reasonable size.
A hashtable tries to get the best of both worlds. In general, we should
be able to find the element we want in 1 or 2 operations, but we can use most
any sort of keys. Basically a hashtable is an array of linked lists.
The basic principle operation is this:
- Hash the key. Hashing is some function which turns a key into a number.
There is no one right or wrong way to hash a given type of key- as long as two
"equal" keys produces two equal hashes, and in general different keys produce
different hashes.
- Take the number computed by hashing the key and "mod" it by the size of
the array.
- Index the hashtable's array at the element computed in step 2. (These are
generally refered to as "buckets").
- Perform whatever linkedlist style operation you need on the list you find
there.
Some examples
Suppose we want to have a 4 element (for the sake of example, hashtables are
generally much bigger than that) hashtable in which we map names of Futurama
characters to their jobs and species. We start with an empty hashtable:
There are 4 buckets. Remember the hashtable is an array of pointers to
hashtable nodes- which are basically like linked lists nodes. Starting out,
each of these pointers is NULL.
Now we add our first item. Suppose that the string "Bender" hashes
to 84. Computing 84 % 4 gives us 0, so Bender goes in bucket 0. With nothing
already there, the pointer in bucket 0 points to the node with information
about Bender, and that nodes next is NULL:
Now suppose we add information about Leela, and that "Leela" hashes
to 403. Computing 403 % 4 tells us that Leela belongs in Bucket 3, resulting
in the following:
Ok great so far, now we go to add Fry. Suppose that "Fry" hashes to
107. Well 107 % 4 = 3. Fry belongs in bucket 3, but Leela is already there!
This is what is called a collision- two different items need to go into
the same bucket. Collisions are why we have an array of linked lists instead
of just an array, we can chain the colliding elements in the same
bucket-- making a list of all the items that should go there:
Looking up an item in the hashtable follows the same prinicple. If we want to
lookup the information about Leela, we hash the key again (resulting in 403),
mod it by 4 (telling us bucket 3), and then search the linked list that is that
bucket. At each node, we do a string comparison of the node's key to the key
we are looking for, and when we find a match, we return the corresponding data.
What is rehashing, and how do I do it?
Collisions in a hashtable are a bad thing. A few are tollerable, but you dont
want to have a lot. The idea of a hashtable is not to have long linked lists
that are a fourth or an eighth as big, but to has linked lists of length one,
or maybe two. You generally want your hashtable to have about twice as many
buckets as it has elements. Or put in technical terms, you want the load
factor, the ratio of elements to buckets, to be less than 0.5. What do you
do when the load factor gets too high? You rehash the table: make the array
of buckets larger, and put each element in its new bucket. Rehashing is an
expensive operation, so you want to do it infrequently (if you double the size
of the table when you rehash, the amortized cost is relatively low).
To see how this works, let us consider the above example, and what occurs if we
rehash the table to have 8 buckets.
- Bender hashed to 84. 84 mod 8 = 4. Bender goes into bucket 4 of the
new table.
- Leela hashed to 403. 403 mod 8 = 3. Leela goes into bucket 3 of the new
table.
- Fry hashed to 107. 107 mod 8 = 7. Fry goes into bucket 7 of the new table.
As you can see, in this larger table, Leela and Fry no longer collide. If we
add one more element, chances are good (5/8) that we still won't have a
collision.
AQ (Asked Questions, maybe frequently, maybe not)
- Q: So a hash function maps keys to numbers?
A: Yes, you then need to mod that result by the table size to get
the bucket.
- Q: Don't unique keys imply no collisions?
A: No, there are two ways you can have collisions with unique
keys. (1) The keys can hash to the same number (consider "abc"
and "cba" where our hash function is a simple addition of the
numerical values in the string. They result in the same hash, no table
size will avoid this collision). (2) The keys hash to different numbers
that are congruent modulo the table size. This is the case in the example
above: 403 % 4 = 107 % 4 = 3. A larger table can avoid this type of collision.
- Q: Will table->buckets[index] take me to a unique hash_node or
to a list of hash_nodes.
A: Well it will take you to one node, but that node is the head of
the list. I.e. its next field may point to another node and so on. We
don't have the list structures encapsulated in a "List" class as
you might in Java- we just deal directly with the nodes here.
- Q: What if table->buckets[index] is unitialized?
A: When you create the table, you should make sure each bucket is
initalized to NULL- you dont ever want unitialized pointers sitting
around. In the other methods, you can check for NULL appropriately
(i.e. iterate until you get to NULL or whatever).
- Q: Why isn't realloc the easiest way to do the rehashing?
A: Remember when you rehash, you have to move some items to new
indicies. While you can accomplish this with realloc, its probably easier
to malloc a new buckets array, move everything to the right place, then
free the old one.
- Q: Should my add function add to the front or the back of the
list?
A: You could do either- but why add to the back? The commmon case
should be that there is nothing there anyways, and adding to the front is
an O(1) operation (and much simpler to write!).