home *** CD-ROM | disk | FTP | other *** search
- MSORTP - Merge Sort Unit for Protected and Real Mode
- Kim Kokkonen
- TurboPower Software
- January 1993
-
- Introduction
- ---------------------------------------------------------------------
- MSORTP is a unit for sorting an unlimited number of items in real or
- protected mode applications. It can be used with TPW 1.0, TPW 1.5, or
- BP7 for DOS real mode, DOS protected mode, or Windows targets. Because
- it takes advantage of PChar strings, MSORTP cannot be used with
- earlier Turbo Pascal compilers.
-
- Although MSORTP isn't source-compatible with Borland's old LSORT unit
- from the Database Toolbox, its interface is quite similar. Hence, it
- is a good candidate to replace LSORT in programs being upgraded to new
- platforms.
-
- MSORTP takes advantage of all available heap space, including the
- extended memory heap when running in protected mode. When it runs out
- of heap space, it uses a merge sort algorithm that stores temporary
- intermediate files on disk.
-
- MSORTP is an excerpt from TurboPower Software's commercial product
- B-Tree Filer.
-
-
- Using MSORTP
- ---------------------------------------------------------------------
- The following small program shows how to implement a text file sort
- filter using MSORTP.
-
- {$R-,S-,X+,I+}
- program TextSort;
- {-Protected mode text file sort filter}
- uses
- Strings, MSortP;
- var
- LineBuf : array[0..1024] of Char;
- MaxHeap : LongInt;
- MinHeap : LongInt;
- Status : Word;
-
- procedure SendToSortEngine; far;
- var
- P : PChar;
- begin
- while not Eof do begin
- {Read the next line from the input file as a PChar}
- ReadLn(LineBuf);
- {Don't sort empty lines to avoid StrNew quirk}
- if StrLen(LineBuf) <> 0 then begin
- {Copy string to heap}
- P := StrNew(LineBuf);
- {Store the pointer in the sort engine}
- if not PutElement(P) then
- Exit;
- end;
- end;
- end;
-
- function Less(var X, Y) : Boolean; far;
- begin
- Less := (StrComp(PChar(X), PChar(Y)) < 0);
- end;
-
- procedure GetFromSortEngine; far;
- var
- P : PChar;
- begin
- {Return each pointer and write the associated string to output}
- while GetElement(P) do
- WriteLn(P);
- end;
-
- begin
- {Determine how much heap space MergeSort can use}
- MaxHeap := MemAvail div 10;
- MinHeap := 4*MinimumHeapToUse(SizeOf(PChar));
- if MaxHeap < MinHeap then
- MaxHeap := MinHeap;
-
- {Perform the sort}
- Status := MergeSort(MaxHeap, SizeOf(PChar),
- SendToSortEngine, Less, GetFromSortEngine,
- DefaultMergeName);
- if Status <> 0 then
- WriteLn('Sort failed, Status ', Status);
- end.
-
- This program provides three routines for the use of MSORTP.
- SendToSortEngine reads lines of text from the standard input device,
- allocates space for each line on the heap, and passes the string
- pointer to MSORTP by calling PutElement. Note that only the pointer is
- passed to MSORTP, since there is no need for MSORTP to keep a second
- copy of a string that is already stored in memory.
-
- The Less function compares pairs of elements for MSORTP. This routine
- simply uses the StrComp function from Borland's STRINGS unit to
- perform a case-sensitive comparison of two PChars. Note the typecast
- from the untyped VAR parameters Less receives to the PChar data type
- stored in the sort engine.
-
- Finally, GetFromSortEngine retrieves the elements in sorted order. In
- this program, GetElement return a pointer to the actual string data.
- When the $X+ compiler directive is enabled, as it is here, the WriteLn
- statement writes the string pointed to by each pointer P.
-
- The main block of the program calls the MergeSort function to perform
- the sort. First it must decide how to partition the heap between the
- needs of string storage and the needs of MSORTP. MSORTP is given about
- 10% of the heap and the rest is reserved for text strings. Since the
- sort engine is storing 4 byte elements, this is a balanced split if
- the average string length is 40 bytes. Also note that the
- MinimumHeapToUse function is called to guarantee that the sort engine
- gets at least 4 times as much heap space as it requires at a minimum.
- In this way merging is minimized and the sort will always succeed
- unless there is insufficient disk space or insufficient heap space to
- hold the input strings.
-
- The second parameter passed to MergeSort is the size of the elements
- to be sorted, in this case 4 bytes, the size of a PChar. The next
- three parameters are the routines used to pass elements to the sort
- engine, to compare elements, and to get sorted elements back from the
- engine. The final parameter is a routine used to name each merge file.
- In this program we use the default routine DefaultMergeName provided
- by MSORTP, which stores the merge files in the current directory.
-
- MergeSort returns a Word result indicating the status of the sort.
- The result is zero to indicate success, or a non-zero result to
- explain the reason for a failure.
-
- TMSORTP.PAS, also in this archive, is a more extensive program for
- testing the accuracy and speed of the MSORTP unit.
-
-
- Interfaced Identifiers of MSORTP
- ---------------------------------------------------------------------
- The following sections describe the constants, types, procedures, and
- functions interfaced by the MSORTP unit.
-
- Constants
- =========
- MaxSelectors = 256;
- Maximum number of selectors allocated. MergeSort allocates memory
- buffers using segments of up to 65535 bytes, but the size is always
- an even power of two times the sort record length. As a result, the
- default value of MaxSelectors allows MergeSort to directly access
- 8-16 megabytes of extended memory. If 8 megabytes is insufficient,
- you may want to increase MaxSelectors to 512. If 8 megabytes is more
- than enough, you may want to decrease MaxSelectors. The data segment
- space MSORTP uses for selectors (2*MaxSelectors bytes) dominates the
- data segment usage of the unit.
-
- MedianThreshold = 16;
- The partition length below which the in-memory quick sort simply
- uses the middle element of the partition for the pivot element. For
- partition lengths at least this size, MergeSort uses the median of
- the left, middle, and right elements for the pivot. The
- median-of-three algorithm significantly improves the speed of large
- in-memory sorts for input that is already sorted, reverse sorted, or
- nearly sorted.
-
- MergeOrder = 5;
- Input files used at a time during merge. You can set MergeOrder to
- any value in the range from 2 to 10 inclusive. However,
- experimentation indicates that the default value of 5 is optimal
- under a wide range of conditions.
-
- MinRecsPerRun = 4;
- Minimum number of records that must fit in memory during a sort. If
- fewer records fit in memory, MergeInfo and MergeSort return an error
- code. If even MinRecsPerRun records fit in memory, MergeSort
- performs merging to complete the sort. Increase this constant if you
- prefer that the sort fail instead of doing an excessive amount of
- merging.
-
- SwapThreshold = 64;
- The record size below which MergeSort swaps complete data records.
- For records SwapThreshold bytes or larger, MergeSort swaps pointers
- to records instead of the records themselves. Swapping pointers is
- the faster approach for large records sorted in memory, but this
- approach has a memory overhead of 4 bytes per record plus a buffer
- segment that must be used for a run output buffer. The default of 64
- was chosen to keep the typical overhead below 10%. Reducing the
- default also provides no significant improvement in performance.
-
- Types
- =====
- ElementIOProc = procedure;
- Specifies the type of the routine passed as the SendToSortEngine
- and GetFromSortEngine parameters to MergeSort. These routines must
- be declared FAR and must have no parameters.
-
- ElementCompareFunc = function (var X, Y) : Boolean;
- Specifies the type of the routine passed as the Less parameter to
- MergeSort. MergeSort calls this function to compare pairs of
- elements as needed. It must be declared FAR and must have the form
- shown here. It should return True if and only if element X is
- strictly less than element Y. You should typecast the untyped
- parameters to treat them as elements of the type you are sorting.
-
- MergeNameFunc = function (Dest : PChar; MergeNum : Word) : PChar;
- Specifies the type of the routine passed as the MergeName
- parameter to MergeSort. MergeSort calls this function to obtain
- the name of each merge file when needed. MSORTP interfaces a
- default MergeNameFunc that can be used in many circumstances. This
- routine, DefaultMergeName, returns a name of the form
- 'SORnnnnn.TMP', where nnnnn is the merge file number given by
- MergeNum. A MergeNameFunc must return a unique file name for each
- value of MergeNum. Typically you specify a non-default MergeNameFunc
- if you want the merge files to be located on a different drive or
- directory than the current one.
-
- MergeInfoRec =
- record
- SortStatus : Word; {Predicted status of sort, assuming disk ok}
- MergeFiles : Word; {Total number of merge files created}
- MergeHandles : Word; {Maximum file handles used}
- MergePhases : Word; {Number of merge phases}
- MaxDiskSpace : LongInt; {Maximum peak disk space used}
- HeapUsed : LongInt; {Heap space actually used}
- SelectorCount: Word; {Number of selectors allocated}
- RecsPerSel : Word; {Records stored in each selector}
- end;
- Describes the structure returned by the MergeInfo function. This
- function predicts the status of a sort and its resource usage
- given certain information about it. See MergeInfo for more
- information.
-
- Procedures and Functions
- ========================
- Declaration
- procedure AbortSort;
- Purpose
- Halts a sort prematurely.
- Description
- Call this routine from your Less, SendToSortEngine, or
- GetFromSortEngine routines to abort a sort. The Less function must
- always return False if it calls AbortSort. When you call
- AbortSort, MergeSort always returns a status of 1.
- See Also
- MergeSort
-
- Declaration
- function DefaultMergeName(Dest : PChar; MergeNum : Word) : PChar;
- Purpose
- Returns a default name for each merge file.
- Description
- The default merge name is SORnnnnn.TMP, where nnnnn corresponds to
- the MergeNum. For example, for MergeNum = 1, DefaultMergeName
- returns 'SOR1.TMP'. For MergeNum = 999, DefaultMergeName returns
- 'SOR999.TMP'. For MergeNum = 65535 (the largest possible value),
- DefaultMergeName returns 'SOR65535.TMP'.
-
- DefaultMergeName stores the null-terminated string in the buffer
- pointed to by Dest, and it returns Dest as the function result.
- MergeSort provides an 80 character buffer each time it calls the
- merge name function.
-
- You can write your own merge name function that puts the merge files
- somewhere besides the current directory. Follow the example of
- DefaultMergeName and pass your function to MergeSort.
- See Also
- MergeSort
-
- Declaration
- function GetElement(var X) : Boolean;
- Purpose
- Returns the next element in sorted order.
- Description
- Call this routine repeatedly in your GetFromSortEngine routine to
- retrieve the sorted elements. GetElement returns True until there
- are no more sorted elements to retrieve. GetElement copies the next
- element into the variable you pass as the parameter X. Be sure that
- this variable is large enough to hold an entire record; otherwise
- GetElement will overwrite memory. When GetElement returns False, the
- parameter X is not initialized.
- See Also
- PutElement
-
- Declaration
- procedure MergeInfo(MaxHeapToUse : LongInt;
- RecLen : Word;
- NumRecs : LongInt;
- var MI : MergeInfoRec);
- Purpose
- Predicts status and resource usage of a merge sort.
- Description
- MaxHeapToUse is the maximum number of bytes of heap space the sort
- should use. MergeInfo actually allocates heap space up to this
- amount; if there is less heap space available, the MergeInfo results
- apply only to the available heap space.
-
- RecLen is the size in bytes of each record to be sorted. NumRecs is
- the total number of records to be sorted.
-
- MI returns information about the sort. MI.SortStatus is zero if the
- sort is predicted to succeed. MergeInfo assumes that there is
- sufficient disk space and that no disk errors will occur.
-
- MI.MergeFiles is the total number of merge files that will be
- created.
-
- MI.MergeHandles is the total number of file handles used. This will
- always be in the range of 0 to MergeOrder+1 inclusive.
-
- MI.MergePhases is the number of merge phases. A value of 0 indicates
- that the sort can be done completely in memory. 1 indicates that
- MergeOrder or fewer merge files are created and will be merged in
- one pass. Higher values mean that multiple merge passes are
- required, with the output from earlier passes feeding the input of
- later passes.
-
- MI.MaxDiskSpace is the peak disk space required. Since merge files
- are deleted as soon as they are used, the disk space used in a merge
- sort grows and shrinks. All merge files are deleted when the sort is
- complete. MaxDiskSpace is always smaller than 2*RecLen*NumRecs. The
- analysis that MergeInfo performs to determine MaxDiskSpace requires
- that MI.MergeFiles be smaller than 16384, and that 4*MI.MergeFiles
- bytes of heap space be free when MergeInfo is called. If these
- requirements aren't met, MergeInfo returns -1 for MI.MaxDiskSpace.
-
- MI.HeapUsed is the number of bytes of heap space the sort will
- actually use. This is always less than or equal to MaxHeapToUse.
-
- MI.SelectorCount is the number of segments of heap space the sort
- will allocate.
-
- MI.RecsPerSel is the number of records stored in each segment. This
- is always a power of two.
- See Also
- MinimumHeapToUse OptimumHeapToUse
-
- Declaration
- function MergeSort(MaxHeapToUse : LongInt;
- RecLen : Word;
- SendToSortEngine : ElementIOProc;
- Less : ElementCompareFunc;
- GetFromSortEngine : ElementIOProc;
- MergeName : MergeNameFunc) : Word;
- Purpose
- Sorts a group of elements.
- Description
- MaxHeapToUse specifies the maximum number of bytes of heap space the
- sort will use. It is not an error for MaxHeapToUse to exceed
- MemAvail; MergeSort will use whatever is available. If you know in
- advance how many records will be sorted, it is a good idea to pass
- the result returned by OptimumHeapToUse for this parameter.
-
- RecLen is the number of bytes in each record to be sorted.
-
- SendToSortEngine is a procedure you write that passes the sort
- elements into the sort engine. Your procedure calls PutElement for
- each element to be sorted. If PutElement returns False, an error has
- occurred (usually out of disk space) and you should exit your
- SendToSortEngine procedure.
-
- Less is a function you write that compares pairs of elements. This
- function must return True if and only if element "X" (the first
- parameter) is strictly less than element "Y" (the second parameter).
-
- GetFromSortEngine is a procedure you write that retrieves the sorted
- elements from the sort engine. Your procedure calls GetElement to
- retrieve each element in sorted order. When GetElement returns
- False, all elements have been retrieved or an error has occurred.
-
- MergeName is a function that provides a name for each merge file.
- You can often pass DefaultMergeName for this parameter.
- DefaultMergeName puts all of the merge files in the current
- directory at the time of the merge sort.
-
- MergeSort returns a status code in its function result. It can
- return the following values:
-
- 0 success
- 1 user abort (AbortSort was called)
- 8 insufficient memory to sort
- 106 invalid input parameter
- (RecLen zero, MaxHeapToUse too small)
- 204 invalid pointer returned by GlobalLock, or
- SelectorInc <> 8
- 213 no elements available to sort
- 214 more than 65535 merge files
- else DOS or Turbo Pascal error code
-
- The most common Turbo Pascal error code returned by MergeSort is
- 101, which means that there was insufficient disk space to store the
- merge files.
-
- If you want to predict whether a sort can succeed before actually
- attempting it, use the MergeInfo procedure.
- See Also
- DefaultMergeName GetElement MergeInfo PutElement
-
- Declaration
- function MinimumHeapToUse(RecLen : Word) : LongInt;
- Purpose
- Returns the minimum heap space that allows MergeSort to succeed.
- Description
- Given the size of each record (RecLen), MinimumHeapToUse returns the
- smallest amount of heap space that will allow a sort to succeed. You
- can pass this value to MergeSort to sort a group of elements using
- the smallest amount of memory. Note that the value returned by
- MinimumHeapToUse is often very small and can cause a significant
- amount of merging, so it's generally better to multiply the result
- by a reasonable factor (say 2-4) even if you want to minimize heap
- usage of a sort.
-
- Because of a quirk in the BP7 DPMI heap manager, MinimumHeapToUse
- adds 2048 bytes to the actual computed heap requirement. This is
- important because sometimes the DPMI heap manager consumes about
- 2000 bytes of heap space for its internal use.
- See Also
- OptimumHeapToUse
-
- Declaration
- function OptimumHeapToUse(RecLen : Word;
- NumRecs : LongInt) : LongInt;
- Purpose
- Returns the smallest heap space for a sort with no merging.
- Description
- Given the size of each record (RecLen) and the number of records to
- be sorted (NumRecs), OptimumHeapToUse returns the amount of heap
- space needed to perform the sort entirely in memory. Additional heap
- space does not help the sort. Less heap space causes merging.
-
- Because of a quirk in the BP7 DPMI heap manager, OptimumHeapToUse
- adds 2048 bytes to the actual computed heap requirement. This is
- important because sometimes the DPMI heap manager consumes about
- 2000 bytes of heap space for its internal use.
- See Also
- MinimumHeapToUse
-
- Declaration
- function PutElement(var X) : Boolean;
- Purpose
- Submits an element to the sort system.
- Description
- Call this function within your SendToSortEngine routine for each
- element to be sorted. Pass the element to be sorted in the untyped
- parameter X. PutElement returns True if the element is successfully
- processed by MergeSort. It returns False if an error occurred; do
- not continue to call PutElement in this case.
- See Also
- GetElement
-
-
- How MSORTP Works
- ---------------------------------------------------------------------
- The MergeSort function allocates a group of identical segments by
- calling GlobalAlloc repeatedly. (When MSORTP is compiled for DOS real
- mode, it provides a substitute routine for GlobalAlloc that is just a
- shell around the normal Turbo Pascal GetMem routine.) Each segment
- holds an even power-of-two number of records. The segments
- collectively use no more than MaxHeapToUse bytes. The last record
- element of the group of segments is used to store a pivot element. The
- next to last record element is usually used for an element swap buffer
- (see below). The remaining elements are collectively called the run
- buffer and are used for an in-memory sort of each "run" of elements.
- The "run" must number at least MinRecsPerRun or the sort will not
- proceed. MergeSort also chooses the segment size so that there are at
- least MergeOrder+1 segments, since these segments are reused as
- buffers later during the merge process.
-
- To avoid the use of complicated code for writing and reading buffers
- and an excessive number of LongInt operations, each segment must be
- smaller than 65536 bytes. Subject to the restrictions already
- mentioned, however, the segment size is maximized. Hence, typical
- segment sizes are 32768-65535 bytes. If the segment size is 32768, the
- maximum memory MergeSort can access is 8MB for the default value of
- MaxSelectors (256). If this situation adversely effects the
- performance of a sort, MaxSelectors can be increased to 512 to provide
- access to the full 16MB of RAM that the DPMI manager can use. The cost
- is 512 additional bytes of data segment space. Conversely, if a sort
- will never need to access 8MB of memory, MaxSelectors can be reduced
- and data segment usage will decrease proportionately.
-
- Elements are obtained when SendToSortEngine calls PutElement, which
- adds elements to the run buffer. When the run buffer fills, or when
- there are no more elements, the run buffer is sorted in memory using a
- non-recursive quicksort which calls the Less function to compare
- elements. If all elements fit into a single run buffer, no merging is
- required and GetFromSortEngine retrieves the elements directly from
- the sorted run buffer by calling GetElement.
-
- If there are more elements than fit in one run buffer, the sorted run
- buffer is written to a merge file assigned a sequential number. Merge
- files are created until all elements have been obtained and all run
- buffers sorted.
-
- When all merge files have been created, the merge process starts. The
- first MergeOrder segments of memory are reused as input buffers for
- reading from up to MergeOrder input files opened simultaneously. The
- next segment is used as an output buffer for storing the merged
- result. Any other allocated segments are not used during merging. Up
- to MergeOrder merge files are opened for input and a bufferful is read
- from each merge file into its buffer. The first unused element in each
- buffer is compared and the smallest is sent to the output. When the
- elements in an input file are exhausted, the file is closed and
- deleted. When all input files are gone, the output file is closed. The
- output file then becomes part of the input merge sequence for the next
- merge phase.
-
- When fewer than MergeOrder input files remain, each merged output
- element is passed directly back to the caller via the GetElement
- routine instead of writing it to another merge file. In this way,
- MergeSort avoids a final pass of writing to disk and reading back the
- records.
-
- Due to a quirk of the BP7 DPMI heap manager, MergeInfo or MergeSort
- may appear to leak heap space. Both routines deallocate all of the
- segments that they initially allocate. The quirk is that, after the
- DPMI heap manager makes a small number of allocations, it allocates
- about 2000 bytes of memory for itself and does not release this memory
- even if all of the user allocations are later released. Hence,
- MergeInfo may return a value of HeapUsed about 2000 bytes larger than
- needed, and MemAvail may be 2000 bytes smaller after you call
- MergeInfo or MergeSort. OptimumHeapToUse and MinimumHeapToUse also
- return a value 2048 bytes bigger than may be needed because of this
- quirk.
-
- When MergeSort performs an in-memory quick sort of a run, the size of
- the data record determines whether MergeSort swaps pointers to records
- or the complete records. When record size is greater than or equal to
- SwapThreshold (default 64), MergeSort swaps pointers. Otherwise, it
- swaps entire records. When it swaps pointers, MergeSort must allocate
- an additional 4 bytes of space for each record to hold the swap
- pointer. It must also reserve one segment for writing the sorted run
- to a merge file. Given the default SwapThreshold value of 64, the
- memory overhead of swapping pointers is typically less than 10%.
- Swapping pointers gives large gains in sort speed (2x or more) for
- large records that can be sorted completely in memory.
-
-
- Version History
- ---------------------------------------------------------------------
- Version 5.40 1/93
- Initial release. Version number synchronized with that of B-Tree
- Filer.
-
- Copyright and License Information
- ---------------------------------------------------------------------
- MSORTP is copyright (c) 1993 by TurboPower Software. All Rights
- Reserved.
-
- Although the MSORTP files are copyrighted, you may use and distribute
- them freely as long as you do not sell them or include them with other
- software that you sell.
-
- The latest version of MSORTP is always available in LIB 6 of the
- PCVENB forum of CompuServe. It's also available on the TurboPower BBS
- at 719-260-9726. Look for MSORTP.ZIP.
-
- MSORTP is an excerpt from TurboPower Software's product B-Tree Filer.
- B-Tree Filer is a collection of routines for managing single user and
- network databases, as well as for accessing common network functions
- on Novell and NetBIOS networks. Download FILER.BRO from LIB 6 of
- PCVENB for more information, or give TurboPower a call.
-
- You can reach TurboPower at:
-
- TurboPower Software
- P.O. Box 49009
- Colorado Springs, CO 80949-9009
-
- 719-260-6641 (voice, Monday-Friday 9AM-5PM)
- 719-260-7151 (fax)
- 719-260-9136 (sales)
- 719-260-9726 (BBS)
- Compuserve: 76004,2611