The Collection-Extensions library

Copyright (c) 1994  Carnegie Mellon University
All rights reserved.
The various modules in this library contain a few new types and operations which are compatible with the collection types specified in the Dylan Interim Reference Manual, but which are not part of that specification.

Users of Mindy 1.1 should be aware that the string-searching functions of the string-search module have been moved to the String-extensions library. The collection-extensions module "string-search" has for that reason been renamed "vector-search". The semantics of the moved functions have been changed, so string-search users are advised to read the String-extensions documentation.

It is to be expected that more collection types will appear in time, and they will likely be added to this library. This may also result in reorganizations which could force incompatible changes to the existing modules. We hope to minimize such imcompatibilities and, when forced to them, will include sufficient information to facilitate conversion of existing code.

The particular modules provided at present are:

heap
Provides <heap>, a popular data structure for priority queues. The semantics are basically those of a sorted sequence, with particularly efficient implementations of add!, first, and pop (i.e. "remove-first").
self-organizing-list
Provides "self-organizing lists". These explicit key collections provide roughly the semantics of hash tables, but use a probabilistic implementation which provides O(n) worst case performance but can provide very fast constant time access in the best case.
subseq
Provides "subsequences", which represent an aliased reference to some part of an existing sequence. These are analogous to slices (in Ada or Perl) or displaced arrays (in Common Lisp). Subsequences are themselves subclasses of <sequence>, and can therefore be passed any <collection> or <sequence> operation.
vector-search
Provides a small assortment of specialized operations for searching and modifying <vector>s. These operations are analogous to existing collection operations but provide keywords and efficiency improvements which are meaningful only within the more limited domain.

Heap

A heap is an implementation of the abstract data type "sorted list". A heap is a sorted sequence of items. Most likely the user will end up writing something like
define class <heap-item> (<object>);
  slot priority;
  slot data;
end class <heap-item>;
with appropriate methods defined for < and =. The user could, however, have simply a sorted list of integers, or have some item where the priority is an integral part of the item itself.

make on heaps supports the less-than: keyword, which supply the heap's comparison and defaults to <.

Heaps support all the usual sequence operations. The most useful ones:

     push(heap, item) => updated-heap
     pop(heap)        => smallest-element
     first(heap)      => smallest-element
     second(heap)     => second-smallest-element
     add!(heap, item) => updated-heap
     sort, sort!      => sorted-sequence
These are all "efficient" operations (defined below). As with <deque>, push is another name for add!, and does exactly the same thing except that push doesn't accept any keywords. sort and sort! return a sequence that's not a heap. Not necessarily efficient but useful anyhow:
     add-new!(heap, item, #key test:, efficient:) => updated-heap
     remove!(heap, item, #key test:, efficient:) => updated-heap
     member?(heap, item, #key test:, efficient:) => <boolean>
The efficient: keyword defaults to #f. If #t, it uses the random-iteration-protocol (which is considerably more efficient, but isn't really standard behavior, so it had to be optional). Conceivably most sequence methods could support such a keyword, but they don't yet.

The user can use element-setter or the iteration protocol to change an item in the heap, but changing the priority of an item is an error and Bad Things(tm) will happen. No error will be signaled. Both of these operations are very inefficient.

Heaps are not <stretchy-collection>s, although add! and pop can magically change the size of the heap.

Efficiency: Approximate running times of different operations are given below: (N is the size of the heap)

    first, first-setter                             O(1)
    second (but not second-setter)                  O(1)
    size                                            O(1)
    add!                                            O(lg N)
    push                                            O(lg N)
    pop(heap)                                       O(lg N)
    sort, sort!                                     O(N * lg N)
    forward-iteration-protocol          
                            setup:                  O(N)
                            next-state:             O(lg N)
                            current-element:        O(1)
                            current-element-setter: O(N)
    backwards-iteration-protocol
                            setup:                  O(N * lg N)
                            next-state:             O(1)
                            current-element:        O(1)
                            current-element-setter: O(N)
    random-iteration-protocol           
                            setup:                  O(1)
                            next-state:             O(1)
                            current-element:        O(1)
                            current-element-setter: O(1)
    element(heap, M)                                O(M*lg N + N)
    element-setter(value, heap, M)                  O(N + M*lg N + M)
element, element-setter on arbitrary keys use the forward-iteration-protocol (via the inherited methods), and have accordingly bad performance.

Self-Organizing-List

The "Self Organizing List" is a "poor man's hash table". More precisely, <self-organizing-list> is a subclass of <dictionary> for which addition and retrieval are both linear in the worst case, but which use a probabilistic strategy which yields nearly constant time in the best case. (A <dictionary> is a subclass of <mutable-explicit-key-collection> and <stretchy-collection>, and is described in mindy.doc)

Because they have a very low overhead, self-organizing lists may provide better peformance than hash tables in cases where references have a high degree of temporal locality. They may also be useful in situations where it is difficult to create a proper hash function.

Define new self-organizing-lists with

  make(<self-organizing-list>, test: test)
Test is expected to be an equality function. In particular, it is expected to satisfy the identity and transitivity requirements described in chapter 5. If not specified, test defaults to \==.

Subseq

<Subsequence> is a new subclass of <sequence>. A subsequence represents an aliased reference to some part of an existing sequence. Although they may be created by make (with required keywords source:, start: and end:) on one of the instantiable subclasses, they are more often created by calls of the form
  subsequence(sequence, start: 0, end: 3)
where start: and end: are optional keywords which default to the beginning and end, respectively, of the source sequence. No other new operations are defined for subsequences, since all necessary operations are inherited from <sequence>.

Because subsequences are aliased references into other sequences, several properties must be remembered:

  1. The contents of a subsequence are undefined after any destructive operation upon the source sequence.
  2. Destructive operations upon subsequences may be reflected in the source. The results of reverse! and sort! should be expected to affect the source sequence for vector subsequences.
If the source sequences are instances of <vector> or <string>, then the implementation will use subclasses of <subsequence> which are also subclasses of <vector> or <string>.

Efficiency notes:

  1. The implementation tries to insure that subsequences of subsequences can be accessed as efficiently as the original subsequence. (For example, the result of
      subsequence(subsequence(source, start: 1), start: 2)
    
    would produce a subsequence identical to the one produced by
      subsequence(source, start: 3)
    
  2. Vector subsequences, like all other vectors, implement constant time element access.
  3. Non-vector subsequences may take non-constant time to create, but will provide constant-time access to the first element. This should produce the best performance provided some element of the subsequence is accessed at least once.

Vector-Search

The "vector-search" module provides basic search and replace capabilities upon restricted subsets of <sequence> -- primarily <vector>. Exploiting the known properties of these types yields substantially better performance than can be achieved for sequences in general.

The following functions are supplied:

find-first-key vector predicate? #key start end failure => key
Find the index of first element (after start but before end) of a vector which satisfies the given predicate. If no matching element is found, return failure. The defaults for start, end and failure are, respectively, 0, size(vector), and #f. This function is like find-key, but accepts start: and end: rather than skip:.)

find-last-key vector predicate? #key start end failure => key
This is like find-first-key, but goes backward from end.