home *** CD-ROM | disk | FTP | other *** search
Text File | 1987-05-14 | 8.2 KB | 173 lines | [TEXT/M4PL] |
- Technical note on the Browser - 1987 April 12 - Updated 1987 Apr 29 & May 14
-
- by Mark^Zimmermann
- 9511 Gwyndale Drive
- Silver Spring, MD 20910
- 301-565-2166
- science@@nems.arpa - 75066,2044 on CompuServe
-
- This note is meant to describe briefly the design philosophy and
- algorithms used in Browser (a.k.a. Tiny Browser, Recall, Boxcars, etc.)
- as of today. For additional details, contact me at the above
- addresses, or consult the source code of the program itself.
-
-
- OVERVIEW
-
- The Browser goal is to provide a user with an exceedingly
- simple, exceeding fast environment in which to be exceedingly
- productive in working with very large (many megabytes) but very
- disorganized files of free-text information. In order to be useful
- in the real world, the program must be up and running (not just a
- research project) this year. It must run on real hardware,
- available to real people, not on systems too expensive,
- experimental, or unreliable to use. It must not require its users
- to go through gyrations in order to get data into or out of the
- program. It must not assume that data comes in neat, tidy,
- well-formatted bundles.
-
- Ideally, the Browser should run on any piece of hardware
- that is available. Practically, the first system that it is being
- developed on is the Apple Macintosh, because I have one and because
- it has an excellent user interface and programming environment for
- developing the Browser. A near-term goal is to transport some
- version of the program to the IBM-PC and compatibles, and to the
- Sun Workstation. Longer term goals are to make all or part of it
- run on mainframe IBM computers, Cray supercomputers, etc.
-
- The program at present is written in MacForth Plus and runs on
- any Macintosh system with 512KB of memory or more, including the
- Mac II. It consists of about 2,000 lines of Forth code, heavily
- commented. Run the program to get an idea of what features have
- been implemented and what the user interface looks like ... they
- change from week to week as I put more things in.
-
-
- DATA STRUCTURES
-
- In keeping with the spirit of the program, as well as the
- simple-mindedness of the programmer, there is really only one data
- structure used in the Browser, and it is highly primitive.
- Above all, locality of reference is the Holy Grail -- with the idea
- of enhancing the program's speed when running with devices such as
- optical disks, which currently have rather slow rates for moving
- their heads around on the media. Side-effects of the simple
- approach are good execution speed, simplicity in sorting and
- accessing the data, and ease of debugging. The main penalty is
- wasted disk storage space -- a penalty that is becoming
- increasingly irrelevant as big, fast mass media become cheaper
- every day.
-
- The inputs to the Browser are two types of files: text
- files and index files. Text files consist of ASCII characters, as
- usual. An index file consists of index records of 16 bytes each:
- - 4 bytes of pointer into the text file being indexed, and
- - 12 bytes of index key
- The index key is extracted from the text file beginning at the
- pointer value, turned into all capital letters, and filled out with
- blanks from the end of the word indexed to the end of the 12 bytes.
- These 16-byte index records are then simply arranged in
- alphabetical order, sorted by their index key field, in a linear
- array -- the index file.
-
- For a specific example, if the input file consisted of the
- words "Now is the time now" then the index file would look like:
- 4IS 0NOW 16NOW 7THE 11TIME
- where there is obviously a lot of wasted blank space.
-
- The recent addition of a Subindex search capability, described
- below, arguably adds another data structure to the Browser; see the
- discussion in that section for details.
-
-
- ALGORITHMS
-
- Similar to the discussion of data structures above, there
- is only one algorithm that deserves the name in the current
- Browser, and that algorithm is many decades old and very simple.
- In order to sort the index of words into alphabetical order, the
- program executes a classical radix sort.
-
- This is the same sort used on old punched-card sorters.
- Starting with the rightmost of the 12 columns of the index key
- field, we take the index records and throw them into bins
- corresponding to the letter in that column. We take the contents
- of those bins and stack them together, and repeat the process, this
- time using the letter in column 11 to determine which bin each card
- goes into. Etc., etc., until the final pass on column 1 ends with
- all of the cards, I mean index records, sorted. Try it out by
- hand to get an idea of how it works, or read about it in any
- standard textbook.
-
- The radix sort has many advantages besides its simplicity. It
- takes time that grows linearly with the number of index records to
- be sorted -- thus, there is no logarithmically-increasing speed
- penalty. It is a stable sort, in that identical words will end up
- in the index appearing in the same order in which they appeared in
- the original file. (That is convenient for the user who wants to
- browse sequentially through the data.) The radix sort vectorizes
- well, an advantage when sorting on supercomputers or other
- vector-oriented machines. Additionally, the radix sort can be used
- with a very simple data buffering structure, since information is
- read in strictly sequentially from mass storage, and is moved out
- in simple, repetitive patterns.
-
- The simple radix sort does suffer from one disadvantage: it
- requires free space on the disk equal in size to the file being
- sorted. That space is used for the storage of the "bins" into
- which items are thrown during the 12 sorting passes, and it is free
- again at the end of the sort. I have not found that requirement
- for space to be a significant problem. (In most cases, a user
- would like to keep some space free for later data accumulation,
- note-taking, etc., and so it seems unlikely that a file that
- occupied all of free disk space would ever need to be sorted.)
-
- Other than the radix sort, the only other computational
- technique that comes to mind as being used in the Browser is
- the binary search, in order to respond to a user's keystrokes in
- the Index Window display. Does that count as an "algorithm"?
-
-
- SUBINDEX IMPLEMENTATION
-
- Versions of Browser after v.240 have a new capability: Subindex
- searching. If a user wants to find, for example, all occurrences of
- the words "computer" or "computers" near the word "Apple", it is now
- easy to do so without being forced to visually scan through a large
- number of irrelevant context items.
-
- The way I have implemented that is to use a simple linear
- array of flag bits, one for every 32 characters (bytes) in
- the original text file. If a flag bit is set to 1, then that
- region of the database is "valid"; if 0, it is "invalid". The
- Index Window now counts and displays occurrences of indexed words
- in valid regions of the data file. That information is shown along
- with the total number of occurrences of items in the database (including
- valid and invalid regions). A user can clear ("Empty") out the working
- subindex, can set it to equal the entire database ("Fill"), and can
- take the complement of the subindex ("Invert", a boolean NOT operation).
-
- The bit array for the new Index window starts out all 0's. As a
- user selects items by shift-clicking on them, flag bits around the
- terms selected are set to 1's. The distance on each side of each
- selected term to set bits to 1's is a user option; for close
- proximity, only a few bits are be set; for more remotely
- linked terms, many bits are set. The flag array can be
- held completely in memory, since it is a factor of 256 smaller than
- the original database file. So, operations on it are very fast.
-
-
- CONCLUSIONS
-
- Ultimately, the Browser should be a natural extension of
- a user's memory. It thus must be both fast and simple. In order
- to be used, it must also be cheap, both in acquisition cost and in
- the cost of getting data into and out of it. With a simple enough
- programming philosophy, and modern hardware and systems software,
- that goal should achievable.
-
- "Make everything as simple as possible, but no simpler."
- - A. Einstein
-
-
-