home *** CD-ROM | disk | FTP | other *** search
-
-
- ══════════════════════════
-
- 3. DATA GATHERING
-
- ══════════════════════════
-
-
- ════════════════════════════
- 3.1 Some definitions
- ════════════════════════════
-
- Terms like "data" and "information" are often used
- interchangeably. It would be helpful to distinguish among the
- following terms and propose working relationships among them:
-
- datum ══>
- data ══>
- record ══>
- information ══>
- knowledge
-
- A datum is a single fact or historical observation or
- calculated value. In itself, a datum has little meaning; it
- doesn't "inform". The digit '5' is a datum, a statistic without
- context.
-
- The term data is the plural of datum. "Data" is a term
- used in a very general way for any collection of individual facts,
- observations, or calculated values. (The same word is also used as
- a collective singular. One can say, "The data is in the report" OR
- "The data are in the report".) How many kinds of data are there?:
- As many as there are phenomena in the universe that can be observed
- or derived by humans. If we limit the focus to computerized data,
- we find even that can take ever so many forms... numbers, readable
- text (words, phrases, sentences, etc.), sounds, pictures or
- graphics, animation, video sequences, and so forth.
-
- A record is sufficient related data to reconstruct an
- event. Each datum provides context for other data within the
- record, so that the combined total takes on meaning. Example: The
- datum "5" out of context tells us virtually nothing. Look what
- happens when we put it within the "record" of a business
- transaction: listing five pairs of black Oxford shoes, style
- D-438, size 10-D, sold to ABC Company on October 20 at $59 per
- pair. This says something useful, especially to persons who
- created the record.
-
- A record is treated as a single unit for search
- purposes. When the searcher enters attributes or words or phrases
- in combination, the retrieval system responds by returning each set
- of data (each "record") that holds those terms or is described by
- those attributes. Some more examples of records: a paragraph, an
- article in a newspaper, a screenful of text, one house in a real
- estate database, a Bible verse, a dictionary entry, etc.
-
- Information is created when a person (or program)
- searches for, and selects, records according to a purpose. For
- example, the cumulative statistics combining records for all sales
- of black Oxford shoes, compared year to year and by region, are
- informative to a manufacturer who must decide on production plans
- for various shoe styles. Here we have a selection according to a
- purpose. Merely browsing through the data does not create
- information. A purpose is needed to clarify how some records are
- to be selected and all others rejected.
-
- Knowledge is the accumulation of information linked
- into useful relationships within a human mind. (As a definition,
- this won't satisfy the teacher in Philosophy 101, but that isn't
- our purpose.) In a sense, knowledge consists of mutually
- reinforcing sets of information. Relationship or linkage is the
- key. Example: The shoe product manager puts together information
- on past production, current costs, financial condition of the
- company, status of equipment, available skilled labor, economic
- forecasts, market trends, analysis by sales people, and a personal
- awareness of the company and industry. It is this total set of
- linked information that forms the knowledge on which a decision
- will be reached.
-
- Let's put these ideas together:
-
- datum + datum + datum + .... = data
-
- enough related data to
- reconstruct an event = a record
-
- records selected with purpose = information
-
- information linked in a mind = knowledge
-
-
- ════════════════════════════
- 3.2 Why gather data?
- ════════════════════════════
-
- People and organizations accumulate data because it is
- a means to create value or add to value.
-
- Note that data is a means, not an end in itself. Data
- is raw material out of which information is derived. Data has
- value for its potential. But it remains potential until a purpose
- is applied to select and group the data into useful information.
- And it need not be one single purpose. Example: Tens or hundreds
- of thousands of copies of a large metropolitan area telephone book
- are distributed to the user public. Such a data base may be
- referred to for a hundred thousand different purposes in a single
- day. Copied into a personal telephone index or dialed on a
- telephone, the data takes on the value associated with the purpose
- for which it was selected. The value is often small... Time is
- saved, or the toll charge for phoning an Information operator is
- avoided. Sometimes the value is beyond estimation... Ask a parent
- who happened to have the number of the local poison control center
- handy when it was desperately needed.
-
- Data has zero value if it is not accessible. Data is
- a means. The value is according to purpose. If the purpose cannot
- be applied, there is no value. If you can't find it, it's of no
- use to you. If you can't find it, you cannot generate information
- and knowledge with it. Accessibility is the heart of the argument
- in favor of records and information management, quality indexing,
- and simple, powerful retrieval methods. Everything in the MIR
- series aims to add value to data.
-
-
- ═════════════════════════════════
- 3.3 Who are data gathers?
- ═════════════════════════════════
-
- Who are data gatherers? Any person who ever recorded
- an observation, or collected observations made by others qualifies
- as a data gatherer. If we leave the definition that broad, every
- civilization for which we have any recorded history had its data
- gatherers. Using this definition, even the early cave painters
- would qualify. Let's narrow the focus somewhat. For our purposes,
- data gatherers are organizations or persons who put facts and
- observations into a form that can be manipulated by use of a
- computer. The data gatherers may create new data, or alternatively
- collect existing data. In either case, their output is "machine
- readable". The data may be intended for uses that create value
- internally within the organization, or there may be possible profit
- in wider distribution or publication.
-
-
-
-
-
-
- ═══════════════════════════════
- 3.4 Keyboard data input
- ═══════════════════════════════
-
- To get data into machine readable form, some form of
- computer software is required.
-
- Many computer users are familiar with text processing.
- Such programs are devices for entering, modifying, deleting and
- formatting text data. They are particularly useful for continuous
- text, such as letters and reports. Typesetting software and
- desktop publishing software are variations that offer extended
- capabilities to prepare text for widespread distribution. These
- usually insert a wide variety of codes to control the format of the
- text. Format controls include underlining, bold text, margin
- sizes, paragraph indentation, centering and justifying text, font
- selection, type size, etc.
-
- Another method of input, good for highly structured
- data, is to present the user with a template in which fields may be
- filled in. For example, here's part of a primitive real estate
- template:
-
- ASKING PRICE, $: ___________ MAP GRID: _____________
- HOUSE #: _____ STREET NAME: _________________________
- DISTRICT: ____________________ CITY: ___________________
- LOT SIZE (sq ft): _________ HOUSE SIZE: _____________
- NO. OF BEDROOMS: ___ FIREPLACES: ____ GARAGE UNITS: __
- IN-GROUND POOL: ___ ABOVE-GROUND POOL: __ SAUNA: __
-
- So-called "fourth generation" programming languages are
- well adapted to creating and manipulating these templates. Each
- template may appear to have its own program. Actually, one program
- behind the scenes may manipulate data in a variety of templates,
- putting limits on the kinds of data and the value ranges that are
- acceptable in many of the fields.
-
-
- ══════════════════════════════
- 3.5 Scanned data input
- ══════════════════════════════
-
- Many records are created by scanning devices. Point of
- purchase devices interpret universal product symbols; these are
- increasingly common in grocery and other retail stores. Entire
- warehouses can be automated with the help of bar code scanners
- stationed along control points of conveyor belts. Movements of
- goods are entered as records, with exceptional accuracy and
- efficiency.
-
- Not all scanning works that well. Optical scanners
- (which look very much like photocopiers) are used to input the text
- content of sheets of paper or pages of books. Scanning is only as
- good as the software that is used in conjunction with the scanner
- AND the quality of the text being scanned. Optical character
- recognition (OCR) has advanced dramatically with "omnifont"
- software that recognizes characteristics of letters as opposed to
- predetermined layouts. Curiously, the quality of printed text may
- be deteriorating with the spread of desktop publishing. Typeset
- text normally leaves clear space around each character. Low cost
- desktop equipment may cause individual letters to run together
- slightly... especially double letters ('ss' in assembly). I tried
- scanning a 1976 and a 1991 copy of an annual publication that had
- switched from typesetting in 1990. The error rate was 3 per page
- in the 1976 typeset copy, and 103 per page in the 1991 version!
- (One consolation... If desktop publishing was used, somebody
- somewhere may have backup of the computer files; in that case,
- scanning is unnecessary.)
-
- Scanning can present difficulty where the page is not
- a single block. Suppose the page is in three parallel columns.
- Can the system recognize the switch from one column to the next?
- Or is text horizontally in line across the columns run together as
- if it were continuous? Words hyphenated at column ends (and page
- ends) are particularly vulnerable to error.
-
- Early in the 1990s, 99 per cent accuracy in text
- scanning was considered very good. That may be acceptable for
- small databases. But think what a one per cent error rate means
- for a gigabyte of scanned information. Assuming the average word
- is 7.6 characters long, 1,316,000 words would contain errors. A
- good portion might be found through comparison with listings of
- accepted spellings. But a smudge can turn the word "leap" into the
- entirely different word "heap", and only the most sophisticated
- software has any chance of catching word substitutions of this
- sort. Correcting errors in very large databases is not as
- straight-forward as in the typical letter or report; sheer size
- creates its own problems. (...Or opportunities! There will be
- more on data cleaning software in Tutorial FIVE.)
-
- Data input is the most labor intensive part of making
- data accessible on computers. It is the area of greatest cost
- (barring outrageous royalty charges); input is an area that offers
- much opportunity for improvement in quality.
-
- Here are some considerations prior to scanning a large
- quantity of material:
-
-
- » If the work has been republished in recent years, was
- the text newly typeset? If yes, it may be possible to
- work from the typesetting tape or diskettes. Some
- desktop publishing systems make it easy to extract
- ASCII copies. Extracting text from typesetting codes
- is more complex, but it may be the quickest way to
- produce a clean copy of the text.
-
- » Consider scanning only when there is no really usable
- machine readable alternative. Search out the best
- possible copy of the typeface which is to be scanned.
- The poorer the quality, the higher the error rate.
- Also use recent scanning software, not more than two
- years old.
-
- » Set a timer as someone proofs a portion of the result.
- Don't expect a spell checker to provide adequate
- proofing; very few check the context. Correct
- spellings of wrong words garble the result with
- surprising frequency.
-
- » If the tests above are within budget, go ahead.
- Otherwise seriously consider having the whole database
- entered at keyboards. (Sigh!)
-
-
- ══════════════════════════════════════════════
- 3.6 Format, standards and common sense
- ══════════════════════════════════════════════
-
- From an indexer's point of view, the ideal world would
- be one in which all computerized data is received in a standard
- format on a standard, large scale medium with a standard, publicly
- shared set of markup codes. Notice a word being repeated? It
- comes from the experience over several years of having to figure
- out the most incredible variations in the way computer data is
- assembled.
-
- Non-standard media? It's still around; obsolete
- typesetting systems are the worst offenders in producing media that
- other machines simply cannot read. A variation is the nine-track
- tape (so far, so good) that turns out to have been created by
- back-up software that makes the tape unreadable for any machine not
- using the same operating system.
-
- Wrong scale media? Consider the friendly customer who
- provides 200 million characters of data on floppy diskettes,
- 360,000 bytes at a time. At the other extreme, I handled another
- database of 2.3 billion characters on good nine-track tape; hours
- were wasted because 2.0 billion characters were blank padding in
- empty fields. Then there was the neatly formatted hierarchical
- text database, beautifully ready in every detail but one. The
- paragraphs were set 90 characters wide. Since the target machine
- had the standard presentation width of 80 characters, it was back
- to the drawing board!
-
- Why this small digression? Obviously it makes me feel
- good. Far, far more important... the failure to use standards
- costs the information industry and the end customer bundles of
- money. Standards save money! The use of standards and common
- sense greatly increase the accuracy of cost and time requirement
- forecasts. Jobs get done on time. The customer is well served.
-
- One last thought along this line: The searcher does
- not need to know the intricacies of a particular standard. What is
- important is that technical staff accept the responsibility to
- ensure standards are applied. For example, increasingly there are
- advantages to using some form of SGML (Standard Generalized Markup
- Language). It permits the end user control over the way data is
- presented for viewing on the screen or the way it is printed. The
- results are pleasing, particularly on computers that allow changes
- in print size and character fonts. (Again, the results are even
- more pleasing to the wallet.)
-
-
- ════════════════════════
- 3.7 Data quality
- ════════════════════════
-
- The potential value of data increases with accuracy.
- The single best protection against errors is neither accuracy
- checks nor precise verification methods. It's people who care. If
- there are trained workers with pride of workmanship who are
- permitted reasonable time to ensure quality, then quality has a
- fair chance. There is an attitude, all too common among managers,
- that data entry is a menial job to be done in the cheapest possible
- way. They get what they pay for... cheap performance. The real
- cost is borne by the searcher later on. Data entry errors lead to
- missed records, incomplete search results, and frustration.
-
- Some data input systems make accuracy easier. Template
- based software often includes data type and range checks for each
- field; this stops many errors at their source. Word processing
- packages have spell checkers which catch all but word substitution
- errors. These too should be used as part of the daily entry
- routine.
-
- Quality problems with Optical Character Recognition
- (OCR) equipment and software were mentioned earlier. Visual
- checking by a human is the only effective way to ensure validity of
- scanned numeric data or of words in isolation. Error checking of
- continuous text can be automated up to a point. But comparison to
- lists of correctly spelled words is not enough. Some kind of check
- of nearby vocabulary is needed to catch word substitutions. Since
- intelligent context checking software is not all that common, the
- cost of validating scanned input may turn out to be higher than
- that of the original scanning.
-
- Timeliness is another aspect of quality. Have you ever
- kept receiving mail for the previous occupants of your home, five
- years after they have moved away? Mailing lists and many other
- forms of data are vulnerable to obsolescence. Again, the cost of
- errors is felt, not by the data gatherer, but by the user.
-
- Consistency is another quality issue that arises in
- text data that has been accumulated over an extended period of
- time. There may have been changes in the software used to enter
- the data. The change may be only in successive revisions of the
- software, so there may be reasonable consistency over time. But
- complete changeovers to different software packages do occur. In
- gigabyte size databases, the resulting inconsistencies may lie
- buried until an attempt is made to prepare the data for indexing
- and search. If so, expect unpredictable and undesirable results.
-
- Data quality can be summed up in terms of the
- willingness of the data gatherer to accept costs to ensure
- accuracy, timeliness, and consistency. Be ready to ask some tough
- questions of organizations providing data for you: How was the
- data gathered? Where was it entered into a computer, and under
- what conditions? Were the keypunchers working in their first
- language? What incentives for accuracy were given to keypunchers
- and to their supervisors? What measures were in place to ensure
- prompt, accurate updates as data changed?
-
-
- ═════════════════════════
- 3.8 Value of data
- ═════════════════════════
-
- Recall that people and organizations accumulate data
- because it is a means to create value or add to value. The primary
- marketing question in data gathering is: For whom? Who will gain
- by having the data available? What are the characteristics of
- persons or groups who are most likely to be able to create value
- using this data?
-
- Is a record worth creating in the first place? We
- don't know, apart from awareness of its potential use. Does the
- data have inherent worth? Any response is idle speculation, apart
- from awareness of who is the potential user. The wise data
- gatherer addresses marketing questions early in planning any new
- project.
-
- The way the data gatherer plans has a direct bearing on
- the quality and cost of use by the searcher (the end customer).
- Here is a series of marketing decisions that impact directly.
-
- Market capacity: If there are lots of people who
- already have a felt need for the data being offered, who have the
- computer equipment and money available, volume pricing might be
- used right from the start. If the data is specialized, and of
- interest to relatively few potential users, the market capacity is
- lower. In this case, expect a "cherry picking" strategy... top
- price at first to reach those most eager, then successive moderate
- price drops to broaden the customer base.
-
- Cost recovery strategy: How much was the investment in
- research and development for this project, and how quickly must
- those funds be recovered? If competition is threatening, these
- costs must be covered quickly. This boosts initial prices. But
- expect more dramatic reductions over time. Alternately,
- overpricing for fast cost recovery may simply kill the market's
- interest in the product. This happened with great regularity in
- the early days of the CD-ROM industry. The standing joke was that
- the only companies making money on CD-ROMs were the mail couriers.
-
- Educating the market: If enough prospects recognize
- the potential value of the data, marketing and sales costs can be
- held to a moderate level. If on the other hand there must be heavy
- investment in communicating product benefits and in customer hand
- holding, these costs must be loaded into the price.
-
- Perception of value: It is easy to kill interest in a
- product by underpricing. "Oh, if it's only that much, it can't be
- very good." An effective marketing technique (perfected in the
- cosmetic industry and carried over into information products) is to
- build a mystique and sense of prestige around use of the product.
- The other end of the scale is give-away pricing... setting the
- information product price low or literally free in order to move
- associated products (usually computer hardware).
-
- Value added through combination: A database may
- attract limited interest in its own right. But combined with other
- data, whole new applications open up. A telephone book alone is
- useful for looking up individual addresses and phone numbers. Add
- mailing codes, type of dwelling, years in that residence, and
- demographics (relative rankings for small clusters of dwellings for
- income level, numbers of children, numbers of retired people, etc.)
- and then the combination proves potent for creating targeted
- mailing lists.
-
-
- ══════════════════════════
- 3.9 Data ownership
- ══════════════════════════
-
- Data gatherers have an understandable interest in
- getting paid for their work. Public opinion has been rather
- casual. Copyright, at least in theory, provides protection for
- intellectual property. In reality, losses through illicit copying
- are substantial. The difficulty is that computer data is so very
- easily copied. Anti-piracy software and encryption of data offer
- partial protection; what they really do is raise the cost of
- illegal use high enough to discourage all but the most ardent
- computer hack. Publishing media such as CD-ROM raise the cost by
- the sheer volume of data. Who would want to copy 600 megabytes
- onto hard disk? The worst nightmare for the data gatherer is the
- offshore commercial pirate who produces forged product and
- introduces it into the domestic market at lower prices.
-
-
- ═══════════════════
- 3.10 Summary
- ═══════════════════
-
- Collecting and entering data into a computer is the
- first stage in enabling people to find information quickly and
- easily in a gigabyte world. Data is raw material, selected
- according to a searcher's purposes, to create useful information.
- Data takes a variety of forms. Text data, that is, any data that
- can be entered through a keyboard, can be prepared for search much
- more readily than graphic or sound data.
-
- Methods of input directly affect the quality of data,
- and hence its potential value for the searcher. Use of standard
- media and data formatting dramatically lower the costs of
- preparation for search. Marketing issues affect the cost and
- ultimately the quality of data products that are available for
- search.