Programmer 7500

home *** CD-ROM | disk | FTP | other *** search

/ Programmer 7500 / MAX_PROGRAMMERS.iso / PASCAL / MADTRB40.ZIP / MSTREES.DOC < prev next >

Wrap

Text File | 1986-05-05 | 12.8 KB | 310 lines

Implementation of a Generalized Median Split Tree by Chris Maeder CS 577 Introduction An important task in many computer applications today is the retrieval of information associated with an identifier (or "key") from a table of identifier-information pairs. Usually these applications involve sets of keys whose frequency occurrence have a highly skewed distribution, sometimes so skewed that only a small fraction of the keys are used in most of the information retrievals. Typical applications in which such key-information pairs arise include dictionary word look-up, thesaurus word look-up, and spelling checking in text processing applications (Sheil [1] cites that in the English language, for example, 127 of the most frequent words account for approximately 50 percent of an average text), telephone directory information, the recognition of operating system commands, and the compiler's symbol look-up table (where language keywords and other commonly used variables names will appear more frequently than other words). Most look-up techniques ignore the frequencies of these key-information pairs, thereby tending to scatter the more frequently occurring keys uniformly throughout the look-up table. This usually results in an unacceptably high rate of external memory references, especially if such table is quite large. Thus it should be apparent that it is preferable to separate out the more frequently occurring keys into a separate look-up table that can be quickly searched without external memory references. We present here an implementation of a median split tree that was first presented in Communications of the ACM [2], November, 1978, that optimizes the look-up of frequently occurring keys. Generalized Median Split Trees Median split tree (MST) algorithm provides us with a means for searching sets of identifiers that have highly skewed frequency distributions. An MST is essentially a binary search tree with two key entries per node--a node key entry and a split key entry. The node key entry is the most frequently accessed key in the subtree. The split key entry partitions the remaining keys in the subtree into two equal subsets, each of which are organized as an MST. The MST uses this lexical median of a node's descendants as its split value in order to force the search tree to be balanced, thereby achieving both a space efficient representation of the tree and high search speed. A query of the MST begins at the root of the MST, then following the tree's nodes in a binary search fashion, first comparing the most frequent key and then the split key to determine which subtree to proceed downward with. The MST frequently encounters the proper keys early in a query, and therefore has a lower cost per successful search than a typical balanced binary search. We can further generalize the above algorithm, as presented in ACM Transactions on Programming Languages and Systems [3], January, 1980, to allow each node to handle more than two key entries per node. A query into the MST then examines each entry in a node sequentially, stopping once the proper key has been found. If a match has not occurred after comparing the key entries in a node, the query then proceeds down the left or right subtree depending upon the comparison with the last key in the node. This last key in the node acts as the median split key, partitioning the remaining look-up keys in the subtree into equal left and right subtrees. We can thereby define an m-median split tree to be an MST with m entries per node (where m>2). Observe that the root of each subtree in an m-MST contains the m-1 most frequently accessed keys for that particular subtree, and the median split key which partitions the remainder of the look-up keys in the subtree. One should note that when m=1 we have essentially a balanced binary search tree, when m=2 we have a conventional median split tree, and when m=n we have essentially a sequential search. Average and worse case complexity for the construction and query algorithms for m-MST can be found in the referenced material. Construction of Median Split Trees We have developed two construction algorithms for the construction of the m-MST, but note that only the first construction algorithm was fully implemented. Both construction algorithms assume that the stored set of keys have been previously sorted according to their lexical value. In both construction algorithms, the file containing the look-up keys is read and the keys are placed into a doubly linked list, maintaining the keys in their sorted lexical order. The first m-MST construction algorithm builds the m-MST using only the lexical ordering of the given set of keys. The construction algorithm is applied recursively as it builds the tree, continually splitting the doubly linked list in half at the median split key linked list element. Each m-MST subtree's root node is filled with the m-1 most frequently occurring keys in that particular subtree. Note that the elements that belong to that particular subtree are the elements that remain on the passed linked list. The most frequent elements are selected and deleted from the linked list using a sequential search. Next, the median split key is selected from the remaining keys in the linked list. The split key selection is such that it fills the current m-MST subtree's child nodes as completely as is possible. This split key splits (or partitions) the remaining keys into the node's left and right subtrees, hence the linked list is broken in half at this split key. As previously stated, the process described above is applied recursively until the entire m-MST is built. To somewhat optimize the above construction algorithm it was decided to construct the m-MST using both orderings of the given set of keys. Again, one of the orderings is the actual lexical ordering of the keys as is supplied in the file of look-up keys. These keys were then ordered according to their frequency of occurrence using a quicksort on a duplicated linked list, placing the entries into a priority queue. The priority queue is then used to determine the most frequently occurring keys in the current subtree. Note that we could have also have implemented a heap instead of a priority queue. The linked list of look-up keys and the priority queue keys are linked together using a double pointers. This was done so that when the next most frequent key was removed from the priority queue, it is possible to immediately delete that same key from the doubly linked list. Problems occurred with this second m-MST construction algorithm when attempting to devise a means of splitting the doubly linked list at the split key. Observe that the pointers into the priority queue from the linked list are very intermixed. We did not come up with a clean, straight forward method of splitting the priority queue about the split key element. We therefore stopped with this second construction implementation after it was discovered that the linked list and priority queue splitting technique was not going to add much to the construction speed of the m-MST due to its inherit complexity. To optimize the look-up algorithm it was decided that the node data record structure should be that of an array of size [1..MAX_KEYS_PER_NODE] not including the split key. The split key is separate element in the node's data structure. With this node data structure we were then able to sort the node's array of look-up keys based upon their lexical value. By sorting the keys in the node array according to their lexical value allowed us to include in the search algorithm a binary search for searching within each node, rather than the sequential search as was proposed in []. Using the binary search within each node effectively increases the look-up speed of keys, especially when implemented with large node key arrays. Observe that a interpolation search could be implemented in place of the binary search if so desired, but that more expensive numeric calculations are involved. We report that it only took a couple of seconds to construct a m-MST with 256 keys on a IBM micro computer using the first construction algorithm. Since we believe that there will not much improvement in speed with the second construction algorithm, in actual practice it's probably better to implement only the first (and far more simpler) construction algorithm. Also note that the m-MST need only be built once, then being stored in some form externally when not needed. We included in our implementation a display representation of the m-MST once it has been constructed. This display demonstrates how the more frequent keys are positioned further up in the tree, and how the split keys partition the left and right subtrees. Selection of Median Split Keys Median split trees use a node's split key value as the median for partitioning (with respect to the lexical ordering) the remaining nodes into the left and right subtrees. The choice of split value is very important since it can result in a perfectly balanced tree, thereby allowing optimal information yield for each node comparison. Observe that it would be preferable to have each node filled to its capacity in order to keep the tree as completely full as is possible. Therefore it was necessary to develop a split key function that provided a lexical order index that allowed us to select a key closest to the actual median of the keys, and yet completely fill as many of the subsequent m-MST nodes as is possible. The median split key function originally proposed in [2] was altered to take in account the modification in the search algorithm we have implemented. The following function is used in order to determine a split key that gives the key closest to the median (based upon the lexical ordering of the remaining keys), and yet fills the child nodes completely. SplitValue=((MAX_KEYS_PER_NODE+1)*IntegerMultiplier where SplitValue is chosen to be closest to the actual median of the remaining keys and IntegerMultiplier is of the form 1,2,3,... One should observe from our sample run that only one node is not completely full in the entire m-MST. This is because of the way the above split key function splits the keys at the m-MST root node so that all remaining nodes, except possibly the right-most node, will be filled completely. Queries into the m-MST Queries into the m-MST are very straight forward. First the node array is scanned using a binary search. As was stated previously, this binary search is an optimization on the given search algorithm. If after the node array search a match has not been found, the split key is then checked to determine if it possibly matches the query. Failing that test a determination is made which of the node's two subtrees the query should continue to search. A look-up key comparison counter was included in the search implementation so that there could be some indication of the number of comparisons being performed for each m-MST query. Further Enhancements Considered for the m-MST Algorithm One possible method that was considered but not implemented was to map the m-MST keys into an array instead of storing the keys in a actual tree. This type of storage technique has previously been implemented to store binary search tree representations of data in languages that do not support pointers, such as Fortran and Cobol. Note that an already constructed m-MST can be very quickly read into an array. A problem that must be overcome in other array implementations of search trees is tree balance and completeness. This is especially difficult when dealing with dynamic data where the tree is continually growing and shrinking. An array implementation of a m-MST makes a great deal of sense since it is remains stable (i.e. is not dynamic), is complete, and is very well balanced. To access a particular node in the array a node access function similar to what is used in binary search tree array implementation could be used for the m-MST. References 1. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 948. 2. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 947-958. 3. Comer, D. A Note on Median Split Trees. ACM Trans on Prog. Lang. and Sys. 2, 1 (Jan. 1980), 129-133. 4. Comer, D. A Note on Median Split Trees. ACM Trans on Prog. Lang. and Sys. 2, 1 (Jan. 1980), 130. 5. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 950.