Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / doc / mir / 01what < prev next >

Wrap

Text File | 1992-06-29 | 19.3 KB | 417 lines

This is your introduction to the MIR Tutorial series. It attempts to answer the questions: What? For whom? Why? How? The "How" takes the form of interactive publishing in which you are invited to contribute. Part of our aim is to make the MIR computer indexing and retrieval techniques widely available, so we include in full the Free Software Foundation's GNU General Public License. This license provides the legal means to ensure there is the maximum freedom (and minimum restriction) for all who wish to understand, use, and further develop techniques of computerized indexing. ════════════════════════════════════ 1. COMPUTER INDEXING AND RETRIEVAL TECHNIQUES ════════════════════════════════════ These tutorials are about people and information. People need information. The MIR (Mass Indexing and Retrieval) project has one objective: to make available leading edge technology which may be used to enable people to find information quickly and easily within large quantities of computerized data. The technology is being shared through this introduction plus five sets of tutorials, each accompanied by software with source code. The tutorial series subtitle is "Finding Information in a Gigabyte World". A gigabyte is 1,073,741,824 characters of data. Visualize a stack of computer paper 140 feet high, or a library of 500 books, or 10,000 hours of reading. More and more, it is becoming commonplace for people to search through quantities of data of that magnitude. The one certainty is that no-one ever wants to read through a pile like that, even at computer speeds, in order to find an item of information. So our focus in this project is on computerized indexing and retrieval techniques. Well designed index structures and logic can reduce time for a complex search down to seconds or a fraction of a second. The Mass Indexing and Retrieval project got under way in March, 1991. A freeware introduction was published late in May 1992. The first of five sets of "tutorials" based on the research was released as shareware in July 1992. 25 software tools for data analysis, complete with source code, were placed on CompuServe and Canada Remote Systems BBS. We plan to release each of the remaining four tutorials with related programs according to demand. That is, Tutorial TWO will be released when there have been 1,000 shareware registration fees paid for Tutorial ONE, Tutorial THREE will be released when there have been 1,000 registrations for Tutorial TWO, etc. When all five tutorials have been released, we hope to publish a reference text based on the series. Each tutorial has eight or more sections, and invites inputs from readers. All materials are copyright, but permission is given to copy and further distribute any of them. The freeware introduction and the shareware tutorial text may not be changed in any way. The software may be freely used, revised, and further distributed within the terms of Free Software Foundation's GNU General Public License. What is meant by "interactive" tutorials? I believe that many minds are better than one, and that everybody gains through "open architecture" sharing. The quality of the final software and the final published version of the tutorials will be improved by your questions and suggestions. I encourage you to share technical insights, ideas, clearer wording, source code amendments and even whole new programs. I look to you in particular to expand the range of worked examples; send in real world data that may be included. (While we have worked on hundreds of different databases, you may be able to come up with other interesting challenges.) Tutorials are meant to be a dialogue. This to me is the exciting part of a learning situation... the more people pitch in with their ideas, and the more enthusiasm they show, the more everybody learns (including the teacher!) Watch for sections like this in the interactive tutorials: ═════>> QUESTION: Are you with me so far? I may be too close to this stuff, and assume that you should know what is in my mind. What parts need clarification? Send in your comments. Make a copy of the RESPONSE file which comes with the software. Fill in the relevant sections, and identify any other files that you are sending. The RESPONSE file contains the FAX and e-mail numbers and the mail address. If sending anything lengthy by normal post, please put it on a PC-compatible diskette. <<═════ We continue with an overview of each of the five tutorials and of the final cumulative publication. ═════════════════════════════ 1.1 Tutorial ONE... Database Analysis ═════════════════════════════ ═════>> QUESTION: Contest!! "Database Analysis" is a humdrum title. We could use snappy headings for everything... for the tutorials, the topics within each, and even individual sections. Maybe our Table of Contents could be as neat as Jerry Weinberg's The Secrets of Consulting... "The Law of the Jiggle", "The Edsel Edict", "The Bigness is Not the Horse", and so on like that. Make notes as you read, and send in a batch of headings. <<═════ [ This section is copied from topic 1.2 in the first tutorial.] The purpose of MIR Tutorial ONE is to enable you to analyze computerized data from an indexing perspective. The first topic, source code guidelines, explains the perspectives that have been built into the software that is provided with the tutorials. People who wish to improve on the technology are shown how to share their insights and C language source code. Methods of data gathering affect the cost, the quality and the complexity of the task of indexing. An index adds value to data, so we pay attention to some marketing considerations. Data analysis has to do with recognizing various forms in which data is accumulated, and detecting the inconsistencies (common in large sets of data) that make indexing more challenging. Data format offers possibilities and imposes limitations that will face searchers who wish to extract information. How might the data be structured in a way that better suits the needs of searchers? The reader is provided with a variety of software tools for this critical data analysis function. The ability to identify patterns in byte sequences quickly is critical to keeping indexing costs low. We examine a series of software tools for this purpose. Worked examples are provided of the analysis stage. These topics are at a "nuts and bolts" level... use such and such a program, here is the input, here is the output, and here is what the results mean. The sequence is from simplest to most complex... simple ASCII text, ASCII with markup, fielded text, fixed length records, the addition of packed numbers, then various forms of binary data Data deblocking is explained at this stage since it may be required in order to finish analysis of the data. At the end of TUTORIAL ONE, the participant has detailed exposure to the techniques of data analysis, and is able to use a selection of analysis tools (source code provided) to recognize and interpret a wide range of data types. ═══════════════════════════════════════ 1.2 Tutorial TWO... Secrets of Data Preparation ═══════════════════════════════════════ The first topic sets out a simple ASCII text format which makes data suitable for automated indexing. Careful planning of data sequence and layout can speed up response to search requests. What the searcher sees later depends on a series of decisions made during data preparation. Example: What is to be the unit of search (article, paragraph, computer record, a fixed length record, etc.)? A second topic delves into other issues in data organization: the use of invisible fields, pointers, parameter controls, data that must remain accessible to other software and the handling of multimedia data. Standard Generalized Markup Language enhances the end user's ability to control layouts of records found during search. It may be embedded in data without hindering automated indexing. We look at how to distinguish flexible versus fixed display, how to handle oversize tables, etc. Data preprocessing describes the task of converting data to a standardized production format. In some cases, it's easy. If the analysis has been thorough, there should be few surprises. Yet experience shows that setting up the preprocessing sequence can still be the most expensive aspect of all. We look at a series of standardized tools to make the job easier and more efficient. Worked examples show how to use combinations of standardized tools and custom software. We look at how to extract data from several kinds of typesetting codes. This section is intended to be as practical as possible, so readers are invited to submit sample real world data. One of the surprises for you in this tutorial is a detailed analysis of why compression before indexing makes more sense than would at first appear. The standardized ASCII layout can be used as an intermediate step toward a compressed version which greatly increases the indexing capacity of a personal computer. We examine some integerizing techniques and software. At the end of TUTORIAL TWO, the user can make decisions about the layout of data, and implement those decisions using a variety of data conversion tools. Source code for these tools is provided with TUTORIAL TWO. You will have been exposed to issues in writing custom data conversion tools. The user is able to compress large databases into integerized format, in order to make it practical to index them on a personal computer. ══════════════════════════════════════ 1.3 Tutorial THREE... Keys to Automated Indexing ══════════════════════════════════════ Indexing basics start with an explanation of index formats, and how they may be combined through Boolean logic. We look at grouping indexes within separate field lists, and also at how to tag index items within a global index list. The topics on search term selection show how to go beyond simple word indexing to enable search on word fragments, phrases, topics and numeric or date ranges. Files are created for each "field" in a database. We look at means to upgrade these field files and to ensure strict quality control over the indexes. Specialized index preparation leads us into "fuzzy search" of alternate verb forms (search on "is", calls up "were", "shall be", "was", "isn't", "to be", etc.) and nouns (possessives, plurals, etc.) Search on synonyms and correlates is related; the power depends on how much context is taken into account to distinguish homonyms... words of one spelling with radically different meanings. Pattern indexing provides extra speed where the searcher may specify extended word sequences. The issue of "relevance" of found records carries the discussion further into automated subject recognition. Automated indexing is critical to limiting costs; one efficient set of software programs (called an "inversion engine") can be used to build the indexes for virtually any data originally expressed in alphabetic letters, digits, and other keyboard characters. The structure of the index is critical to how quickly the retrieval software can perform Boolean combinations... ((this- word OR that-phrase) AND something else AND NOT another term). The automated indexing software creates indexes in a format geared to high speed Boolean operations when used for search. We look at software (source code provided) for two "inversion engines", one using strings, the other working from integerized data. At the end of TUTORIAL THREE, the user is familiar with the tools necessary to set up and create computer indexes, tailoring the index types according to the needs of searchers in the target database. ═════════════════════════════════ 1.4 Tutorial FOUR... Search Engines and Information Retrieval ═════════════════════════════════ This is the most technical of the five tutorials. Everything up to this point has been the concern of the indexer. Now we turn to the "run time" or retrieval software. Retrieval describes the search process... specifying a search, performing Boolean logic on combinations of terms, identifying data that meets the search criteria, and making the selected data available to the searcher. Under the topic dealing with Search Engine Servers, we review an SFQL (Structured Full Text Query Language) server which is provided with TUTORIAL FOUR. Alternate server options (CD-RDx et al) will be reviewed. Search Engine Client (interface) software is deliberately left outside the "copyleft" software set; no single interface can encompass the range of features desirable for all data types and search situations. We comment on current issues in standardization. Search extensions include: » optimization of index structures; » search across multiple databases at a time; and » dynamic definition of search objects. By the end of TUTORIAL FOUR, the user has available the know-how and software to analyze, prepare, index, and provide search capability for a diverse range of data types and search requirements. Any engine-independent interface built to SFQL specifications may be used to implement search at high speed across large quantities of data. ═══════════════════════════════════════════ 1.5 Tutorial FIVE... Related Topics and Applications ═══════════════════════════════════════════ The list of related topics and applications will continue to grow, based on reader comments on earlier tutorials. Our experience in CD-ROM preparation has already led us to include the following areas of interest: » Very often desired text or records are not found, because the words and phrases used to describe the target are not present. Automated concept recognition gets around the problem. Automated key word selection is a related method that reduces costs in preparing an index, and increases the power of search. » Encryption: We believe that encryption merely dissuades the idle browser and raises costs to the determined criminal. We discuss straight-forward methods that serve these purposes admirably. Even where the technique is known, it takes an inordinate amount of computer time for the thief to identify the seed values. » Data cleaning combines the benefits of indexing with spell checking to enable low cost cleanup of massive databases. » Records and Information Management (RIM) is a full discipline in its own right. The technology and plummeting costs of full text archiving is bringing about a revolution in RIM philosophy and methods of records retention. There are some simple tricks that can be applied to archiving with spectacular results. » Correlation studies using indexed retrieval and high speed Booleans can change the nature of research. A cell in a correlation table turns out to be a search count. Mainframe, move aside. The PC is here. ═══════════════════════════════ 1.6 The MIR Tutorials: The Book and CD-ROM ═══════════════════════════════ As the five interactive tutorials are released there will be an ongoing revision and updating process. This will reflect your responses and improvements on the content, and encompass many of the samples and suggestions that you have made. The first four reworked tutorials will be put together with Tutorial FIVE and be published as an ongoing reference work. We will decide closer to the final publication date whether the final version will be » loose-leaf, or » bound as a reference or text book, and/or » electronic (ASCII, WordPerfect, and PageMaker files) on a CD-ROM. Whatever the form of the tutorial text, all programs, source code and worked examples will be supplied on a CD-ROM. ═════════════════════════════════════════ 1.7 Timing of successive releases ═════════════════════════════════════════ The major unknown in the Mass Indexing and Retrieval project is the readiness of the marketplace to deal with copyleft and the notion that, through sharing, the benefits of $800,000 in development can be picked up for less than $500. This is the old marketing problem of perception of value. We are taking the risk of volume shareware pricing; we are betting that there are enough people in the field who can recognize value based on the introduction and the first tutorial. Marpex Inc. reserves the right to discontinue the project if there is insufficient demand. As mentioned earlier, we plan to release each tutorial according to demand for the previous tutorial. Tutorial TWO will be released when there have been 1,000 shareware registrations of Tutorial ONE, Tutorial THREE will be released when there have been 1,000 registrations of Tutorial TWO, etc. At the same time as a Tutorial is released, the related software will be placed on BBS (bulletin board systems) under "copyleft" redistribution rules. What about organizations with a burning need to proceed faster than the general market? Software may be made available prior to the release dates for alpha site testing by registered users who get actively involved and contribute their improvements. For others, we do offer consulting services. ═══════════════════ 1.8 Summary ═══════════════════ This completes our introduction to the series of five tutorials on how to enable people to retrieve information from large accumulations of data. Related high speed indexing and retrieval software is being distributed under the "copyleft" rules of the Free Software Foundation. Interactive publishing enables you to: » study the techniques in the tutorials and examples; » put the source code to use, personally or commercially, without payment of license fees; » further develop the computer source code; and » contribute your insights.