With little doubt each and every one of you arose this fine spring morning just as excited about hashing as myself. Like all interesting algorithms, the proud tradition of identifying effective hashing functions requires a mathematician's razor mind and an artist's subtle touch. If you know of anyone who has either of these attributes don't hesitate to hold an image of them in your mind as you scan the rest of this article.
Quick Hash Function
Within the hallowed pages of the Algorithm Design Manual, Steve Skiena* describes a straightforward Hashing function between strings S of length |S|, with alphabet A, and a length m hash table.
- First identify a set of integers which contains all possible allowed key values. Let's assume your alphabet is two characters { "a", "b" } and your strings all have length 4. Example keys would be "aaaa", "aaab", "bbbb" or any combination of letters. The total number of possibilities are A^|S|, 2^4 or 16.
- Next construct integer values from these strings by applying a base-A numeric conversion. The combinations yield corresponding integer values between 0-15, if we replace the letters with binary values "a"=0, "b"=1 and convert to integers. Note how the keys form a headless binary tree with left and right child nodes taking on the values "a" or "b". This method may be extended to as large an alphabet as is desired, ie "c"=2 (base-3), "d"=3 (base-4),...,"j"=9 (base 10), etc.
- We're restricted to a total number of slots m in our table, so each of these integers must map back into values 0 to m-1. The mod function takes any integer and yields its remainder in the allowed range. The final index is I mod m.
Table Collisions
The worst hashing function will generate the same index for all keys. Even for the best functions the locations that result from the hash are not guaranteed to avoid collisions. Collisions are common even for uniform distributions, see the Birthday Problem for details.
Recommended solutions to collisions include doubly linked lists, or open addressing where the former is more popular but also consumes more memory for pointers.
For hash buckets with doubly linked lists when writing we append the new value to the list, when reading scan through the list for the desires key.
Open addressing writes to the next available memory location at a hashed index. Reading requires scanning through adjacent memory locations, and deletions require reinserting any broken runs of nodes (ouch).
I haven't read mention of it but busy hash tables could have trees implemented at each bucket. If a hash table becomes congested (which data structure), you're likely better off using a tree implementation. When n is small you're better off with a list.
References:
- Brushing Up on Comp Science, Data Structures
- The Algorithm Design Manual, a valued resource. I highly recommend buying a few thousand digital copies for your university or corporation via this link :)
- The Birthday Problem
- For more complex hashing functions, see the end of Brushing up on Computer Science, Algorithms for presentations which discuss MD5 and Sha1 hash functions.
Notes:
*= I'm reminded of a conversation between Professor Skiena and a student over a decade ago. I took an algorithm overview course a few years after I graduated when taking additional grad courses was a good excuse to slip out of the office for a couple of hours. Now I recommend reading and working with other peers on the cheap.
Myself and two colleagues who also took the course were waiting outside Steve's office to discuss our final project. A foreign student was ahead of us in a meeting, and we overhead Steve going into detail as he described the Thanksgiving Holiday ritual: "A huge turkey...stuffing...yams..." From the conversation, I could imagine the student salivating over the description of the bountiful meal, and we were inclined to believe Steve may extend an invite to the student to introduce him to the holiday. We were surprised as the meeting wrapped up with Steve mentioning "it's really quite a fun event, you should.. you should get one of your friends to invite you over". Steve's generous with his dynamic programming knowledge and algorithmic mastery, but bring your own Turkey dinner ;).