home *** CD-ROM | disk | FTP | other *** search
- Implementation of a Generalized
- Median Split Tree
-
- by Chris Maeder
- CS 577
-
-
-
-
- Introduction
-
- An important task in many computer applications today is the
- retrieval of information associated with an identifier (or "key")
- from a table of identifier-information pairs. Usually these
- applications involve sets of keys whose frequency occurrence have a
- highly skewed distribution, sometimes so skewed that only a small
- fraction of the keys are used in most of the information retrievals.
-
- Typical applications in which such key-information pairs arise
- include dictionary word look-up, thesaurus word look-up, and spelling
- checking in text processing applications (Sheil [1] cites that in the
- English language, for example, 127 of the most frequent words account
- for approximately 50 percent of an average text), telephone directory
- information, the recognition of operating system commands, and the
- compiler's symbol look-up table (where language keywords and other
- commonly used variables names will appear more frequently than other
- words).
-
- Most look-up techniques ignore the frequencies of these
- key-information pairs, thereby tending to scatter the more frequently
- occurring keys uniformly throughout the look-up table. This usually
- results in an unacceptably high rate of external memory references,
- especially if such table is quite large. Thus it should be apparent
- that it is preferable to separate out the more frequently occurring
- keys into a separate look-up table that can be quickly searched
- without external memory references.
-
- We present here an implementation of a median split tree that was
- first presented in Communications of the ACM [2], November, 1978,
- that optimizes the look-up of frequently occurring keys.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Generalized Median Split Trees
-
- Median split tree (MST) algorithm provides us with a means for
- searching sets of identifiers that have highly skewed frequency
- distributions.
-
- An MST is essentially a binary search tree with two key entries per
- node--a node key entry and a split key entry. The node key entry is
- the most frequently accessed key in the subtree. The split key entry
- partitions the remaining keys in the subtree into two equal subsets,
- each of which are organized as an MST. The MST uses this lexical
- median of a node's descendants as its split value in order to force
- the search tree to be balanced, thereby achieving both a space
- efficient representation of the tree and high search speed.
-
- A query of the MST begins at the root of the MST, then following the
- tree's nodes in a binary search fashion, first comparing the most
- frequent key and then the split key to determine which subtree to
- proceed downward with. The MST frequently encounters the proper keys
- early in a query, and therefore has a lower cost per successful
- search than a typical balanced binary search.
-
- We can further generalize the above algorithm, as presented in ACM
- Transactions on Programming Languages and Systems [3], January, 1980,
- to allow each node to handle more than two key entries per node. A
- query into the MST then examines each entry in a node sequentially,
- stopping once the proper key has been found. If a match has not
- occurred after comparing the key entries in a node, the query then
- proceeds down the left or right subtree depending upon the comparison
- with the last key in the node. This last key in the node acts as the
- median split key, partitioning the remaining look-up keys in the
- subtree into equal left and right subtrees.
-
- We can thereby define an m-median split tree to be an MST with m
- entries per node (where m>2). Observe that the root of each subtree
- in an m-MST contains the m-1 most frequently accessed keys for that
- particular subtree, and the median split key which partitions the
- remainder of the look-up keys in the subtree.
-
- One should note that when m=1 we have essentially a balanced binary
- search tree, when m=2 we have a conventional median split tree, and
- when m=n we have essentially a sequential search.
-
- Average and worse case complexity for the construction and query
- algorithms for m-MST can be found in the referenced material.
-
-
-
-
-
-
-
-
-
- Construction of Median Split Trees
-
- We have developed two construction algorithms for the construction of
- the m-MST, but note that only the first construction algorithm was
- fully implemented. Both construction algorithms assume that the
- stored set of keys have been previously sorted according to their
- lexical value.
-
- In both construction algorithms, the file containing the look-up keys
- is read and the keys are placed into a doubly linked list,
- maintaining the keys in their sorted lexical order.
-
- The first m-MST construction algorithm builds the m-MST using only
- the lexical ordering of the given set of keys. The construction
- algorithm is applied recursively as it builds the tree, continually
- splitting the doubly linked list in half at the median split key
- linked list element.
-
- Each m-MST subtree's root node is filled with the m-1 most frequently
- occurring keys in that particular subtree. Note that the elements
- that belong to that particular subtree are the elements that remain
- on the passed linked list. The most frequent elements are selected
- and deleted from the linked list using a sequential search.
-
- Next, the median split key is selected from the remaining keys in the
- linked list. The split key selection is such that it fills the
- current m-MST subtree's child nodes as completely as is possible.
- This split key splits (or partitions) the remaining keys into the
- node's left and right subtrees, hence the linked list is broken in
- half at this split key.
-
- As previously stated, the process described above is applied
- recursively until the entire m-MST is built.
-
- To somewhat optimize the above construction algorithm it was decided
- to construct the m-MST using both orderings of the given set of keys.
- Again, one of the orderings is the actual lexical ordering of the
- keys as is supplied in the file of look-up keys. These keys were
- then ordered according to their frequency of occurrence using a
- quicksort on a duplicated linked list, placing the entries into a
- priority queue.
-
- The priority queue is then used to determine the most frequently
- occurring keys in the current subtree. Note that we could have also
- have implemented a heap instead of a priority queue.
-
- The linked list of look-up keys and the priority queue keys are
- linked together using a double pointers. This was done so that when
- the next most frequent key was removed from the priority queue, it is
- possible to immediately delete that same key from the doubly linked
- list.
-
- Problems occurred with this second m-MST construction algorithm when
- attempting to devise a means of splitting the doubly linked list at
- the split key. Observe that the pointers into the priority queue
- from the linked list are very intermixed. We did not come up with a
- clean, straight forward method of splitting the priority queue about
- the split key element. We therefore stopped with this second
- construction implementation after it was discovered that the linked
- list and priority queue splitting technique was not going to add much
- to the construction speed of the m-MST due to its inherit complexity.
-
- To optimize the look-up algorithm it was decided that the node data
- record structure should be that of an array of size
- [1..MAX_KEYS_PER_NODE] not including the split key. The split key is
- separate element in the node's data structure. With this node data
- structure we were then able to sort the node's array of look-up keys
- based upon their lexical value.
-
- By sorting the keys in the node array according to their lexical
- value allowed us to include in the search algorithm a binary search
- for searching within each node, rather than the sequential search as
- was proposed in []. Using the binary search within each node
- effectively increases the look-up speed of keys, especially when
- implemented with large node key arrays. Observe that a interpolation
- search could be implemented in place of the binary search if so
- desired, but that more expensive numeric calculations are involved.
-
- We report that it only took a couple of seconds to construct a m-MST
- with 256 keys on a IBM micro computer using the first construction
- algorithm. Since we believe that there will not much improvement in
- speed with the second construction algorithm, in actual practice it's
- probably better to implement only the first (and far more simpler)
- construction algorithm. Also note that the m-MST need only be built
- once, then being stored in some form externally when not needed.
-
- We included in our implementation a display representation of the
- m-MST once it has been constructed. This display demonstrates how
- the more frequent keys are positioned further up in the tree, and how
- the split keys partition the left and right subtrees.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Selection of Median Split Keys
-
- Median split trees use a node's split key value as the median for
- partitioning (with respect to the lexical ordering) the remaining
- nodes into the left and right subtrees. The choice of split value is
- very important since it can result in a perfectly balanced tree,
- thereby allowing optimal information yield for each node comparison.
-
- Observe that it would be preferable to have each node filled to its
- capacity in order to keep the tree as completely full as is possible.
- Therefore it was necessary to develop a split key function that
- provided a lexical order index that allowed us to select a key
- closest to the actual median of the keys, and yet completely fill as
- many of the subsequent m-MST nodes as is possible.
-
- The median split key function originally proposed in [2] was altered
- to take in account the modification in the search algorithm we have
- implemented. The following function is used in order to determine a
- split key that gives the key closest to the median (based upon the
- lexical ordering of the remaining keys), and yet fills the child
- nodes completely.
-
- SplitValue=((MAX_KEYS_PER_NODE+1)*IntegerMultiplier
-
- where SplitValue is chosen to be closest to the actual median of the
- remaining keys and IntegerMultiplier is of the form 1,2,3,...
-
- One should observe from our sample run that only one node is not
- completely full in the entire m-MST. This is because of the way the
- above split key function splits the keys at the m-MST root node so
- that all remaining nodes, except possibly the right-most node, will
- be filled completely.
-
-
-
-
- Queries into the m-MST
-
- Queries into the m-MST are very straight forward. First the node
- array is scanned using a binary search. As was stated previously,
- this binary search is an optimization on the given search algorithm.
- If after the node array search a match has not been found, the split
- key is then checked to determine if it possibly matches the query.
- Failing that test a determination is made which of the node's two
- subtrees the query should continue to search.
-
- A look-up key comparison counter was included in the search
- implementation so that there could be some indication of the number
- of comparisons being performed for each m-MST query.
-
-
-
-
-
- Further Enhancements Considered for the m-MST Algorithm
-
- One possible method that was considered but not implemented was to
- map the m-MST keys into an array instead of storing the keys in a
- actual tree. This type of storage technique has previously been
- implemented to store binary search tree representations of data in
- languages that do not support pointers, such as Fortran and Cobol.
- Note that an already constructed m-MST can be very quickly read into
- an array.
-
- A problem that must be overcome in other array implementations of
- search trees is tree balance and completeness. This is especially
- difficult when dealing with dynamic data where the tree is
- continually growing and shrinking. An array implementation of a
- m-MST makes a great deal of sense since it is remains stable (i.e. is
- not dynamic), is complete, and is very well balanced.
-
- To access a particular node in the array a node access function
- similar to what is used in binary search tree array implementation
- could be used for the m-MST.
-
-
- References
-
- 1. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for
- Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 948.
-
- 2. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for
- Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 947-958.
-
- 3. Comer, D. A Note on Median Split Trees. ACM Trans on Prog. Lang.
- and Sys. 2, 1 (Jan. 1980), 129-133.
-
- 4. Comer, D. A Note on Median Split Trees. ACM Trans on Prog. Lang.
- and Sys. 2, 1 (Jan. 1980), 130.
-
- 5. Sheil, B. A. Median Split Trees: A Fast Look-up Technique for
- Frequently Occurring Keys. Comm. ACM 21, 11 (Nov. 1978), 950.
-