Programmer 7500

home *** CD-ROM | disk | FTP | other *** search

/ Programmer 7500 / MAX_PROGRAMMERS.iso / INFO / C / HASH.ZIP / HASH.DOC < prev

Wrap

Text File | 1988-10-07 | 8.5 KB | 196 lines

This text is from an article and program from The C Gazette, March 1988, Volume 2, No.4. Both were written by Howard Lowndes. --------------------------------------------------------------------------- USING HASH TABLES Hashing and hash tables need not be mysterious tools used only by the pro- fessional programmer. This article will discuss the construction and use of two different types of hash tables that share common algorithms. Hash tables are useful when an unknown number of data items needs to be stored and then repeatedly accessed. For example, compiler symbol tables are almost always implemented as hash tables. Hashing begins with the concept of a hash table, which is simply a regular in-memory table. However, entries are made into the table on the basis of a hashing function rather than in the more traditional sequential manner. A good hashing function attempts to analyze a table entry and find a way of assigning it a unique slot in the table. The algorithm must generate the same hashing key when applied to the same data, so that an item inserted into the hash table can subsequently be located. The function must be fast and must generate a reasonably good distribution of elements across the table, so that entries are made fairly evenly across the table. One simple method of hashing adds the ASCII values of the characters in the data item and divides the sum by the number of slots in the table. In a string n of length s for insertion into a table of t slots, this hashing function, h(n), is: Figure 1 h(n) = ( sum s[n-1] mod t(1->n)) In C: Figure 2 int hash (string, slots) char *s; int slots; { int i, key; for (i = 0, key = 0; s[i]; i++) key += s[i]; return (key % slots); } This function will work acceptably well as long as there are few strings with the same series of letters. For example, lora and oral would generate the same hashing key. The hashing of different data to the same key is termed a collision. It is clear from the preceding example that knowledge of the data to be hashed is required to design a function that will minimize collisions. Equally important is the construction of the hash table: it should be large (with many extra slots) and should have a prime number of slots. The latter requirement permits a greater scattering of hash values [Knuth, 1973] when the modulo function is applied. Further examples of hash functions can be analyzed, but without knowing the context in which they will be used, assesment of their effectiveness is difficult. Suffice it to say the algorithm must be fast and create a wide scattering of values. Empirical methods are best in the final selection of the hashing function. Once a key has been computed, the problem of storing the data arises. In the accompanying code, two different storage techniques have been implemented: a simple closed hash table with a fixed number of slots and a more complex open hash table, where each slot represents the head of a linked list. In the fixed-size hash table, an array of pointers to data items is allo- cated. If the FIXED model is selected in the accompanying code, the data item is struct entry. This structure contains only a pointer to a string although other data could, of course, be added. The hash table is then defined as an array of pointers to structures of type entry. To look up or store data in the array one first computes the hash value of the data and then reduces it modulo table size. Suppose you allocate a table with three slots choosing the hash function key = strlen(s); Initially the table looks like Figure III: (Figure III) key = 0: [empty] key = 1: [empty] key = 2: [empty] Now add the string "ab" to the table. Its key is 2, so the result is Figure IV: (Figure IV) key = 0: [empty] key = 1: [empty] key = 2: ["ab"] Now add the string "abcd". Its key is 4, but reduce this modulo 3 to get a final key of 1 (i.e., key computed as key_final = key_initial % table_size";). (The modulo operator % computes the remainder after the integer division of key_initial by table_size). Thus the table becomes Figure V: (Figure V) key = 0: [empty] key = 1: ["abcd"] key = 2: ["ab"] Now add the string "z". A problem appears immediately. The key for "z" is 1 and colides with the key for "abcd". All implementations of hash tables must deal with collision resolution. An easy solution is simply to start a forward search until an empty slot is found. Slot 2 is full; a loop back to the top reveals that slot 0 is free. Thus the table becomes Figure VI: (Figure VI) key = 0: ["z"] key = 1: ["abcd"] key = 2: ["ab"] This is a fairly typical resolution scheme for closed (i.e. fixed-size hash) tables. When searching the table, if the element at the hashed location is not the desired element, one need only read sequentially through the table until 1) the key is found ; 2) an empty slot is found; or 3) the emtire table has been read. The last two cases indicate an unsuccessful search. This technique has two drawbacks: it does not permit deletion of table entries (since the vacant slots cannot easily be disambiguated from unused slots in the sequential search through the table) and searching the table is costly if the table is close to full. For some applications, a fixed capacity table might work well but for occasions when an essentially unlimited table is needed, a slightly different technique can be used: the open hashing table. This is done by converting each slot from a pointer to a single data item to a pointer to a list of data items. As Figure VII shows, the data structure goes from (Figure VII) struct entry { char *data; }; to struct entry { char *data; struct entry *next; }; With this new model in mind, the original example can be reworked. First insert the string "ab" (which has a key of 2) as in Figure VIII: Figure VIII: key = 0: [empty] key = 1: [empty] key = 2: ["ab"] Next the string "abcd" (which has a key of 1) is added as in Figure IX: (Figure IX) key = 0: [empty] key = 1: ["abcd"] key = 2: ["ab"] Now insert "z" again. It's key is the same as the "abcd" key, but instead of searching for an empty slot, they can both be put into the same slot as in Figure X: (Figure X) key = 0: [empty] key = 1: ["z"] -> ["abcd"] key = 2: ["ab"] Thus, it should be clear that this type of table has no fixed limit to it's capacity. However, there are some potential drawbacks. First, the code required to handle the linked lists is slightly more complex than that needed to search a simple array and thus might take longer to execute. Second, the linked list associated with each slot could become quite a jumble unless even more code was added to keep it in some sort of order. The search of a slot's list of entries is effectively a random search unless, for example, the entries are kept in alphabetical order. Even then, the search for an entry beginning with "z" is often slower than for an entry starting with "a". An alternative approach (and the one that is implemented in the accompanying code) is to let the list become a jumble but always to move the most recently used item to the top of the list. This might make sense for a compiler, as variable names do tend to be used in clusters. For example, in parsing the code a[i] = b[i]; the variable i is used twice. How efficient is all of this? The efficiency of the program depends on a variety of factors. A bad hash function (or a set of data that is bad for a particular hash function) can certainly slow things. A table that is too small is also a problem. If 50 items are stored in a 51-slot fixed-size table, there are bound to be more collisions than if the same data were stored in a 100- slot table. A similar consideration is true for the linked list model. Even though collisions are not a problem, if the data is distributed throughout a larger table, the linked lists associated with each slot should, in general, be shorter. The typical measure of efficiency is to count the number of incorrect items to be traversed before finding the target item. The accompanying code has a sample driver that also includes a search counter. It is interesting to try different ranges of data and table sizes while keeping an eye on the search counter. END OF TEXT -----------------------------------------------------------------------------