Carousel Volume 2 #1

home *** CD-ROM | disk | FTP | other *** search

/ Carousel Volume 2 #1 / carousel.iso / mactosh / appl / browser.sit / Browser tech note < prev next >

Wrap

Text File | 1987-05-14 | 8.2 KB | 173 lines | [TEXT/M4PL]

Technical note on the Browser - 1987 April 12 - Updated 1987 Apr 29 & May 14 by Mark^Zimmermann 9511 Gwyndale Drive Silver Spring, MD 20910 301-565-2166 science@@nems.arpa - 75066,2044 on CompuServe This note is meant to describe briefly the design philosophy and algorithms used in Browser (a.k.a. Tiny Browser, Recall, Boxcars, etc.) as of today. For additional details, contact me at the above addresses, or consult the source code of the program itself. OVERVIEW The Browser goal is to provide a user with an exceedingly simple, exceeding fast environment in which to be exceedingly productive in working with very large (many megabytes) but very disorganized files of free-text information. In order to be useful in the real world, the program must be up and running (not just a research project) this year. It must run on real hardware, available to real people, not on systems too expensive, experimental, or unreliable to use. It must not require its users to go through gyrations in order to get data into or out of the program. It must not assume that data comes in neat, tidy, well-formatted bundles. Ideally, the Browser should run on any piece of hardware that is available. Practically, the first system that it is being developed on is the Apple Macintosh, because I have one and because it has an excellent user interface and programming environment for developing the Browser. A near-term goal is to transport some version of the program to the IBM-PC and compatibles, and to the Sun Workstation. Longer term goals are to make all or part of it run on mainframe IBM computers, Cray supercomputers, etc. The program at present is written in MacForth Plus and runs on any Macintosh system with 512KB of memory or more, including the Mac II. It consists of about 2,000 lines of Forth code, heavily commented. Run the program to get an idea of what features have been implemented and what the user interface looks like ... they change from week to week as I put more things in. DATA STRUCTURES In keeping with the spirit of the program, as well as the simple-mindedness of the programmer, there is really only one data structure used in the Browser, and it is highly primitive. Above all, locality of reference is the Holy Grail -- with the idea of enhancing the program's speed when running with devices such as optical disks, which currently have rather slow rates for moving their heads around on the media. Side-effects of the simple approach are good execution speed, simplicity in sorting and accessing the data, and ease of debugging. The main penalty is wasted disk storage space -- a penalty that is becoming increasingly irrelevant as big, fast mass media become cheaper every day. The inputs to the Browser are two types of files: text files and index files. Text files consist of ASCII characters, as usual. An index file consists of index records of 16 bytes each: - 4 bytes of pointer into the text file being indexed, and - 12 bytes of index key The index key is extracted from the text file beginning at the pointer value, turned into all capital letters, and filled out with blanks from the end of the word indexed to the end of the 12 bytes. These 16-byte index records are then simply arranged in alphabetical order, sorted by their index key field, in a linear array -- the index file. For a specific example, if the input file consisted of the words "Now is the time now" then the index file would look like: 4IS 0NOW 16NOW 7THE 11TIME where there is obviously a lot of wasted blank space. The recent addition of a Subindex search capability, described below, arguably adds another data structure to the Browser; see the discussion in that section for details. ALGORITHMS Similar to the discussion of data structures above, there is only one algorithm that deserves the name in the current Browser, and that algorithm is many decades old and very simple. In order to sort the index of words into alphabetical order, the program executes a classical radix sort. This is the same sort used on old punched-card sorters. Starting with the rightmost of the 12 columns of the index key field, we take the index records and throw them into bins corresponding to the letter in that column. We take the contents of those bins and stack them together, and repeat the process, this time using the letter in column 11 to determine which bin each card goes into. Etc., etc., until the final pass on column 1 ends with all of the cards, I mean index records, sorted. Try it out by hand to get an idea of how it works, or read about it in any standard textbook. The radix sort has many advantages besides its simplicity. It takes time that grows linearly with the number of index records to be sorted -- thus, there is no logarithmically-increasing speed penalty. It is a stable sort, in that identical words will end up in the index appearing in the same order in which they appeared in the original file. (That is convenient for the user who wants to browse sequentially through the data.) The radix sort vectorizes well, an advantage when sorting on supercomputers or other vector-oriented machines. Additionally, the radix sort can be used with a very simple data buffering structure, since information is read in strictly sequentially from mass storage, and is moved out in simple, repetitive patterns. The simple radix sort does suffer from one disadvantage: it requires free space on the disk equal in size to the file being sorted. That space is used for the storage of the "bins" into which items are thrown during the 12 sorting passes, and it is free again at the end of the sort. I have not found that requirement for space to be a significant problem. (In most cases, a user would like to keep some space free for later data accumulation, note-taking, etc., and so it seems unlikely that a file that occupied all of free disk space would ever need to be sorted.) Other than the radix sort, the only other computational technique that comes to mind as being used in the Browser is the binary search, in order to respond to a user's keystrokes in the Index Window display. Does that count as an "algorithm"? SUBINDEX IMPLEMENTATION Versions of Browser after v.240 have a new capability: Subindex searching. If a user wants to find, for example, all occurrences of the words "computer" or "computers" near the word "Apple", it is now easy to do so without being forced to visually scan through a large number of irrelevant context items. The way I have implemented that is to use a simple linear array of flag bits, one for every 32 characters (bytes) in the original text file. If a flag bit is set to 1, then that region of the database is "valid"; if 0, it is "invalid". The Index Window now counts and displays occurrences of indexed words in valid regions of the data file. That information is shown along with the total number of occurrences of items in the database (including valid and invalid regions). A user can clear ("Empty") out the working subindex, can set it to equal the entire database ("Fill"), and can take the complement of the subindex ("Invert", a boolean NOT operation). The bit array for the new Index window starts out all 0's. As a user selects items by shift-clicking on them, flag bits around the terms selected are set to 1's. The distance on each side of each selected term to set bits to 1's is a user option; for close proximity, only a few bits are be set; for more remotely linked terms, many bits are set. The flag array can be held completely in memory, since it is a factor of 256 smaller than the original database file. So, operations on it are very fast. CONCLUSIONS Ultimately, the Browser should be a natural extension of a user's memory. It thus must be both fast and simple. In order to be used, it must also be cheap, both in acquisition cost and in the cost of getting data into and out of it. With a simple enough programming philosophy, and modern hardware and systems software, that goal should achievable. "Make everything as simple as possible, but no simpler." - A. Einstein