home *** CD-ROM | disk | FTP | other *** search
- This is your introduction to the MIR Tutorial series.
- It attempts to answer the questions: What? For whom? Why? How?
- The "How" takes the form of interactive publishing in which you are
- invited to contribute. Part of our aim is to make the MIR computer
- indexing and retrieval techniques widely available, so we include
- in full the Free Software Foundation's GNU General Public License.
- This license provides the legal means to ensure there is the
- maximum freedom (and minimum restriction) for all who wish to
- understand, use, and further develop techniques of computerized
- indexing.
-
- ════════════════════════════════════
-
- 1. COMPUTER INDEXING
- AND RETRIEVAL TECHNIQUES
-
- ════════════════════════════════════
-
-
- These tutorials are about people and information.
- People need information. The MIR (Mass Indexing and Retrieval)
- project has one objective: to make available leading edge
- technology which may be used to enable people to find information
- quickly and easily within large quantities of computerized data.
- The technology is being shared through this introduction plus five
- sets of tutorials, each accompanied by software with source code.
-
- The tutorial series subtitle is "Finding Information in
- a Gigabyte World". A gigabyte is 1,073,741,824 characters of data.
- Visualize a stack of computer paper 140 feet high, or a library of
- 500 books, or 10,000 hours of reading. More and more, it is
- becoming commonplace for people to search through quantities of
- data of that magnitude. The one certainty is that no-one ever wants
- to read through a pile like that, even at computer speeds, in order
- to find an item of information. So our focus in this project is on
- computerized indexing and retrieval techniques. Well designed
- index structures and logic can reduce time for a complex search
- down to seconds or a fraction of a second.
-
- The Mass Indexing and Retrieval project got under way
- in March, 1991. A freeware introduction was published late in May
- 1992. The first of five sets of "tutorials" based on the research
- was released as shareware in July 1992. 25 software tools for data
- analysis, complete with source code, were placed on CompuServe and
- Canada Remote Systems BBS. We plan to release each of the
- remaining four tutorials with related programs according to demand.
- That is, Tutorial TWO will be released when there have been 1,000
- shareware registration fees paid for Tutorial ONE, Tutorial THREE
- will be released when there have been 1,000 registrations for
- Tutorial TWO, etc. When all five tutorials have been released, we
- hope to publish a reference text based on the series. Each
- tutorial has eight or more sections, and invites inputs from
- readers.
-
- All materials are copyright, but permission is given to
- copy and further distribute any of them. The freeware introduction
- and the shareware tutorial text may not be changed in any way. The
- software may be freely used, revised, and further distributed
- within the terms of Free Software Foundation's GNU General Public
- License.
-
- What is meant by "interactive" tutorials? I believe
- that many minds are better than one, and that everybody gains
- through "open architecture" sharing. The quality of the final
- software and the final published version of the tutorials will be
- improved by your questions and suggestions. I encourage you to
- share technical insights, ideas, clearer wording, source code
- amendments and even whole new programs. I look to you in
- particular to expand the range of worked examples; send in real
- world data that may be included. (While we have worked on hundreds
- of different databases, you may be able to come up with other
- interesting challenges.) Tutorials are meant to be a dialogue.
- This to me is the exciting part of a learning situation... the more
- people pitch in with their ideas, and the more enthusiasm they
- show, the more everybody learns (including the teacher!)
-
- Watch for sections like this in the interactive
- tutorials:
-
- ═════>> QUESTION:
- Are you with me so far? I may be too close to this
- stuff, and assume that you should know what is in my
- mind. What parts need clarification? Send in your
- comments. Make a copy of the RESPONSE file which comes
- with the software. Fill in the relevant sections, and
- identify any other files that you are sending. The
- RESPONSE file contains the FAX and e-mail numbers and
- the mail address. If sending anything lengthy by
- normal post, please put it on a PC-compatible diskette.
- <<═════
-
- We continue with an overview of each of the five
- tutorials and of the final cumulative publication.
-
-
-
- ═════════════════════════════
- 1.1 Tutorial ONE...
- Database Analysis
- ═════════════════════════════
-
- ═════>> QUESTION:
- Contest!! "Database Analysis" is a humdrum title. We
- could use snappy headings for everything... for the
- tutorials, the topics within each, and even individual
- sections. Maybe our Table of Contents could be as neat
- as Jerry Weinberg's The Secrets of Consulting... "The
- Law of the Jiggle", "The Edsel Edict", "The Bigness is
- Not the Horse", and so on like that. Make notes as you
- read, and send in a batch of headings.
- <<═════
-
-
- [ This section is copied from topic 1.2 in the first
- tutorial.]
-
- The purpose of MIR Tutorial ONE is to enable you to
- analyze computerized data from an indexing perspective.
-
- The first topic, source code guidelines, explains the
- perspectives that have been built into the software that is
- provided with the tutorials. People who wish to improve on the
- technology are shown how to share their insights and C language
- source code.
-
- Methods of data gathering affect the cost, the quality
- and the complexity of the task of indexing. An index adds value to
- data, so we pay attention to some marketing considerations.
-
- Data analysis has to do with recognizing various forms
- in which data is accumulated, and detecting the inconsistencies
- (common in large sets of data) that make indexing more challenging.
- Data format offers possibilities and imposes limitations that will
- face searchers who wish to extract information. How might the data
- be structured in a way that better suits the needs of searchers?
- The reader is provided with a variety of software tools for this
- critical data analysis function.
-
- The ability to identify patterns in byte sequences
- quickly is critical to keeping indexing costs low. We examine a
- series of software tools for this purpose.
-
- Worked examples are provided of the analysis stage.
- These topics are at a "nuts and bolts" level... use such and such
- a program, here is the input, here is the output, and here is what
- the results mean. The sequence is from simplest to most complex...
- simple ASCII text, ASCII with markup, fielded text, fixed length
- records, the addition of packed numbers, then various forms of
- binary data
-
- Data deblocking is explained at this stage since it may
- be required in order to finish analysis of the data.
-
- At the end of TUTORIAL ONE, the participant has
- detailed exposure to the techniques of data analysis, and is able
- to use a selection of analysis tools (source code provided) to
- recognize and interpret a wide range of data types.
-
-
- ═══════════════════════════════════════
- 1.2 Tutorial TWO...
- Secrets of Data Preparation
- ═══════════════════════════════════════
-
- The first topic sets out a simple ASCII text format
- which makes data suitable for automated indexing. Careful planning
- of data sequence and layout can speed up response to search
- requests. What the searcher sees later depends on a series of
- decisions made during data preparation.
-
- Example: What is to be the unit of search (article,
- paragraph, computer record, a fixed length record, etc.)? A second
- topic delves into other issues in data organization: the use of
- invisible fields, pointers, parameter controls, data that must
- remain accessible to other software and the handling of multimedia
- data.
-
- Standard Generalized Markup Language enhances the end
- user's ability to control layouts of records found during search.
- It may be embedded in data without hindering automated indexing.
- We look at how to distinguish flexible versus fixed display, how to
- handle oversize tables, etc.
-
- Data preprocessing describes the task of converting
- data to a standardized production format. In some cases, it's
- easy. If the analysis has been thorough, there should be few
- surprises. Yet experience shows that setting up the preprocessing
- sequence can still be the most expensive aspect of all. We look at
- a series of standardized tools to make the job easier and more
- efficient.
-
- Worked examples show how to use combinations of
- standardized tools and custom software. We look at how to extract
- data from several kinds of typesetting codes. This section is
- intended to be as practical as possible, so readers are invited to
- submit sample real world data.
-
- One of the surprises for you in this tutorial is a
- detailed analysis of why compression before indexing makes more
- sense than would at first appear. The standardized ASCII layout
- can be used as an intermediate step toward a compressed version
- which greatly increases the indexing capacity of a personal
- computer. We examine some integerizing techniques and software.
-
- At the end of TUTORIAL TWO, the user can make decisions
- about the layout of data, and implement those decisions using a
- variety of data conversion tools. Source code for these tools is
- provided with TUTORIAL TWO. You will have been exposed to issues
- in writing custom data conversion tools. The user is able to
- compress large databases into integerized format, in order to make
- it practical to index them on a personal computer.
-
-
- ══════════════════════════════════════
- 1.3 Tutorial THREE...
- Keys to Automated Indexing
- ══════════════════════════════════════
-
- Indexing basics start with an explanation of index
- formats, and how they may be combined through Boolean logic. We
- look at grouping indexes within separate field lists, and also at
- how to tag index items within a global index list.
-
- The topics on search term selection show how to go
- beyond simple word indexing to enable search on word fragments,
- phrases, topics and numeric or date ranges. Files are created for
- each "field" in a database. We look at means to upgrade these
- field files and to ensure strict quality control over the indexes.
-
- Specialized index preparation leads us into "fuzzy
- search" of alternate verb forms (search on "is", calls up "were",
- "shall be", "was", "isn't", "to be", etc.) and nouns (possessives,
- plurals, etc.) Search on synonyms and correlates is related; the
- power depends on how much context is taken into account to
- distinguish homonyms... words of one spelling with radically
- different meanings. Pattern indexing provides extra speed where
- the searcher may specify extended word sequences. The issue of
- "relevance" of found records carries the discussion further into
- automated subject recognition.
-
- Automated indexing is critical to limiting costs; one
- efficient set of software programs (called an "inversion engine")
- can be used to build the indexes for virtually any data originally
- expressed in alphabetic letters, digits, and other keyboard
- characters. The structure of the index is critical to how quickly
- the retrieval software can perform Boolean combinations... ((this-
- word OR that-phrase) AND something else AND NOT another term). The
- automated indexing software creates indexes in a format geared to
- high speed Boolean operations when used for search.
-
- We look at software (source code provided) for two
- "inversion engines", one using strings, the other working from
- integerized data.
-
- At the end of TUTORIAL THREE, the user is familiar with
- the tools necessary to set up and create computer indexes,
- tailoring the index types according to the needs of searchers in
- the target database.
-
-
- ═════════════════════════════════
- 1.4 Tutorial FOUR...
- Search Engines and
- Information Retrieval
- ═════════════════════════════════
-
- This is the most technical of the five tutorials.
- Everything up to this point has been the concern of the indexer.
- Now we turn to the "run time" or retrieval software. Retrieval
- describes the search process... specifying a search, performing
- Boolean logic on combinations of terms, identifying data that meets
- the search criteria, and making the selected data available to the
- searcher.
-
- Under the topic dealing with Search Engine Servers, we
- review an SFQL (Structured Full Text Query Language) server which
- is provided with TUTORIAL FOUR. Alternate server options (CD-RDx
- et al) will be reviewed.
-
- Search Engine Client (interface) software is
- deliberately left outside the "copyleft" software set; no single
- interface can encompass the range of features desirable for all
- data types and search situations. We comment on current issues in
- standardization.
-
- Search extensions include:
-
- » optimization of index structures;
-
- » search across multiple databases at a time; and
-
- » dynamic definition of search objects.
-
- By the end of TUTORIAL FOUR, the user has available the
- know-how and software to analyze, prepare, index, and provide
- search capability for a diverse range of data types and search
- requirements. Any engine-independent interface built to SFQL
- specifications may be used to implement search at high speed across
- large quantities of data.
-
-
- ═══════════════════════════════════════════
- 1.5 Tutorial FIVE...
- Related Topics and Applications
- ═══════════════════════════════════════════
-
- The list of related topics and applications will
- continue to grow, based on reader comments on earlier tutorials.
- Our experience in CD-ROM preparation has already led us to include
- the following areas of interest:
-
- » Very often desired text or records are not found,
- because the words and phrases used to describe the
- target are not present. Automated concept recognition
- gets around the problem. Automated key word selection
- is a related method that reduces costs in preparing an
- index, and increases the power of search.
-
- » Encryption: We believe that encryption merely dissuades
- the idle browser and raises costs to the determined
- criminal. We discuss straight-forward methods that
- serve these purposes admirably. Even where the
- technique is known, it takes an inordinate amount of
- computer time for the thief to identify the seed
- values.
-
- » Data cleaning combines the benefits of indexing with
- spell checking to enable low cost cleanup of massive
- databases.
-
- » Records and Information Management (RIM) is a full
- discipline in its own right. The technology and
- plummeting costs of full text archiving is bringing
- about a revolution in RIM philosophy and methods of
- records retention. There are some simple tricks that
- can be applied to archiving with spectacular results.
-
- » Correlation studies using indexed retrieval and high
- speed Booleans can change the nature of research. A
- cell in a correlation table turns out to be a search
- count. Mainframe, move aside. The PC is here.
-
-
- ═══════════════════════════════
- 1.6 The MIR Tutorials:
- The Book and CD-ROM
- ═══════════════════════════════
-
- As the five interactive tutorials are released there
- will be an ongoing revision and updating process. This will
- reflect your responses and improvements on the content, and
- encompass many of the samples and suggestions that you have made.
- The first four reworked tutorials will be put together with
- Tutorial FIVE and be published as an ongoing reference work. We
- will decide closer to the final publication date whether the final
- version will be
-
- » loose-leaf, or
-
- » bound as a reference or text book, and/or
-
- » electronic (ASCII, WordPerfect, and PageMaker
- files) on a CD-ROM.
-
- Whatever the form of the tutorial text, all programs, source code
- and worked examples will be supplied on a CD-ROM.
-
-
- ═════════════════════════════════════════
- 1.7 Timing of successive releases
- ═════════════════════════════════════════
-
- The major unknown in the Mass Indexing and Retrieval
- project is the readiness of the marketplace to deal with copyleft
- and the notion that, through sharing, the benefits of $800,000 in
- development can be picked up for less than $500. This is the old
- marketing problem of perception of value. We are taking the risk
- of volume shareware pricing; we are betting that there are enough
- people in the field who can recognize value based on the
- introduction and the first tutorial. Marpex Inc. reserves the
- right to discontinue the project if there is insufficient demand.
-
- As mentioned earlier, we plan to release each tutorial
- according to demand for the previous tutorial. Tutorial TWO will
- be released when there have been 1,000 shareware registrations of
- Tutorial ONE, Tutorial THREE will be released when there have been
- 1,000 registrations of Tutorial TWO, etc. At the same time as a
- Tutorial is released, the related software will be placed on BBS
- (bulletin board systems) under "copyleft" redistribution rules.
-
- What about organizations with a burning need to proceed
- faster than the general market? Software may be made available
- prior to the release dates for alpha site testing by registered
- users who get actively involved and contribute their improvements.
- For others, we do offer consulting services.
-
-
- ═══════════════════
- 1.8 Summary
- ═══════════════════
-
- This completes our introduction to the series of five
- tutorials on how to enable people to retrieve information from
- large accumulations of data. Related high speed indexing and
- retrieval software is being distributed under the "copyleft" rules
- of the Free Software Foundation. Interactive publishing enables
- you to:
-
- » study the techniques in the tutorials and examples;
-
- » put the source code to use, personally or commercially,
- without payment of license fees;
-
- » further develop the computer source code; and
-
- » contribute your insights.