home *** CD-ROM | disk | FTP | other *** search
-
- This text is from an article and program from The C Gazette, March 1988,
- Volume 2, No.4. Both were written by Howard Lowndes.
- ---------------------------------------------------------------------------
-
- USING HASH TABLES
-
- Hashing and hash tables need not be mysterious tools used only by the pro-
- fessional programmer. This article will discuss the construction and use of
- two different types of hash tables that share common algorithms. Hash tables
- are useful when an unknown number of data items needs to be stored and then
- repeatedly accessed. For example, compiler symbol tables are almost always
- implemented as hash tables.
- Hashing begins with the concept of a hash table, which is simply a regular
- in-memory table. However, entries are made into the table on the basis of a
- hashing function rather than in the more traditional sequential manner. A good
- hashing function attempts to analyze a table entry and find a way of assigning
- it a unique slot in the table. The algorithm must generate the same hashing
- key when applied to the same data, so that an item inserted into the hash
- table can subsequently be located. The function must be fast and must generate
- a reasonably good distribution of elements across the table, so that entries
- are made fairly evenly across the table.
- One simple method of hashing adds the ASCII values of the characters in
- the data item and divides the sum by the number of slots in the table. In a
- string n of length s for insertion into a table of t slots, this hashing
- function, h(n), is:
-
- Figure 1
- h(n) = ( sum s[n-1] mod t(1->n))
-
- In C:
-
- Figure 2
- int hash (string, slots)
- char *s;
- int slots;
- {
- int i, key;
- for (i = 0, key = 0; s[i]; i++)
- key += s[i];
- return (key % slots);
- }
-
- This function will work acceptably well as long as there are few strings
- with the same series of letters. For example, lora and oral would generate the
- same hashing key. The hashing of different data to the same key is termed a
- collision. It is clear from the preceding example that knowledge of the data
- to be hashed is required to design a function that will minimize collisions.
- Equally important is the construction of the hash table: it should be
- large (with many extra slots) and should have a prime number of slots. The
- latter requirement permits a greater scattering of hash values [Knuth, 1973]
- when the modulo function is applied.
- Further examples of hash functions can be analyzed, but without knowing the
- context in which they will be used, assesment of their effectiveness is
- difficult. Suffice it to say the algorithm must be fast and create a wide
- scattering of values. Empirical methods are best in the final selection of the
- hashing function.
- Once a key has been computed, the problem of storing the data arises. In
- the accompanying code, two different storage techniques have been implemented:
- a simple closed hash table with a fixed number of slots and a more complex
- open hash table, where each slot represents the head of a linked list.
- In the fixed-size hash table, an array of pointers to data items is allo-
- cated. If the FIXED model is selected in the accompanying code, the data item
- is struct entry. This structure contains only a pointer to a string although
- other data could, of course, be added. The hash table is then defined as an
- array of pointers to structures of type entry. To look up or store data in the
- array one first computes the hash value of the data and then reduces it modulo
- table size.
- Suppose you allocate a table with three slots choosing the hash function
- key = strlen(s); Initially the table looks like Figure III:
-
- (Figure III)
-
- key = 0: [empty]
- key = 1: [empty]
- key = 2: [empty]
-
- Now add the string "ab" to the table. Its key is 2, so the result is Figure
- IV:
-
- (Figure IV)
-
- key = 0: [empty]
- key = 1: [empty]
- key = 2: ["ab"]
-
- Now add the string "abcd". Its key is 4, but reduce this modulo 3 to get
- a final key of 1 (i.e., key computed as key_final = key_initial %
- table_size";). (The modulo operator % computes the remainder after the integer
- division of key_initial by table_size). Thus the table becomes Figure V:
-
- (Figure V)
-
- key = 0: [empty]
- key = 1: ["abcd"]
- key = 2: ["ab"]
-
- Now add the string "z". A problem appears immediately. The key for "z" is
- 1 and colides with the key for "abcd". All implementations of hash tables must
- deal with collision resolution. An easy solution is simply to start a forward
- search until an empty slot is found. Slot 2 is full; a loop back to the top
- reveals that slot 0 is free. Thus the table becomes Figure VI:
-
- (Figure VI)
-
- key = 0: ["z"]
- key = 1: ["abcd"]
- key = 2: ["ab"]
-
- This is a fairly typical resolution scheme for closed (i.e. fixed-size
- hash) tables. When searching the table, if the element at the hashed location
- is not the desired element, one need only read sequentially through the table
- until 1) the key is found ; 2) an empty slot is found; or 3) the emtire table
- has been read. The last two cases indicate an unsuccessful search. This
- technique has two drawbacks: it does not permit deletion of table entries
- (since the vacant slots cannot easily be disambiguated from unused slots in
- the sequential search through the table) and searching the table is costly if
- the table is close to full.
- For some applications, a fixed capacity table might work well but for
- occasions when an essentially unlimited table is needed, a slightly different
- technique can be used: the open hashing table. This is done by converting each
- slot from a pointer to a single data item to a pointer to a list of data
- items. As Figure VII shows, the data structure goes from
-
- (Figure VII)
-
- struct entry {
- char *data;
- };
-
- to
-
- struct entry {
- char *data;
- struct entry *next;
- };
-
- With this new model in mind, the original example can be reworked. First
- insert the string "ab" (which has a key of 2) as in Figure VIII:
-
- Figure VIII:
-
- key = 0: [empty]
- key = 1: [empty]
- key = 2: ["ab"]
-
- Next the string "abcd" (which has a key of 1) is added as in Figure IX:
-
- (Figure IX)
-
- key = 0: [empty]
- key = 1: ["abcd"]
- key = 2: ["ab"]
-
- Now insert "z" again. It's key is the same as the "abcd" key, but instead
- of searching for an empty slot, they can both be put into the same slot as in
- Figure X:
-
- (Figure X)
-
- key = 0: [empty]
- key = 1: ["z"] -> ["abcd"]
- key = 2: ["ab"]
-
- Thus, it should be clear that this type of table has no fixed limit to
- it's capacity. However, there are some potential drawbacks. First, the code
- required to handle the linked lists is slightly more complex than that needed
- to search a simple array and thus might take longer to execute. Second, the
- linked list associated with each slot could become quite a jumble unless even
- more code was added to keep it in some sort of order. The search of a slot's
- list of entries is effectively a random search unless, for example, the
- entries are kept in alphabetical order. Even then, the search for an entry
- beginning with "z" is often slower than for an entry starting with "a". An
- alternative approach (and the one that is implemented in the accompanying
- code) is to let the list become a jumble but always to move the most recently
- used item to the top of the list. This might make sense for a compiler, as
- variable names do tend to be used in clusters. For example, in parsing the
- code a[i] = b[i]; the variable i is used twice.
- How efficient is all of this? The efficiency of the program depends on a
- variety of factors. A bad hash function (or a set of data that is bad for a
- particular hash function) can certainly slow things. A table that is too small
- is also a problem. If 50 items are stored in a 51-slot fixed-size table, there
- are bound to be more collisions than if the same data were stored in a 100-
- slot table. A similar consideration is true for the linked list model. Even
- though collisions are not a problem, if the data is distributed throughout a
- larger table, the linked lists associated with each slot should, in general,
- be shorter. The typical measure of efficiency is to count the number of
- incorrect items to be traversed before finding the target item. The
- accompanying code has a sample driver that also includes a search counter. It
- is interesting to try different ranges of data and table sizes while keeping
- an eye on the search counter.
-
- END OF TEXT
- -----------------------------------------------------------------------------
-
-