home *** CD-ROM | disk | FTP | other *** search
- ═══════════════════════════════════════
-
- 6. WORKED EXAMPLES...
- VARIATIONS IN ASCII TEXT
-
- ═══════════════════════════════════════
-
-
- The next three topics will be richer to the extent that
- you and other readers provide samples that can be used to explain
- the various forms in which data is held in real files. The variety
- is staggering... in working with between 200 and 300 large
- databases in the late 1980s, I found that only a few formats and
- sets of rules were replicated entirely across databases. Many more
- databases had unique patterns or combinations of rules. But a word
- of encouragement... analysis really does get easier along the way.
-
-
- ════════════════════════════════
- 6.1 Other analysis tools
- ════════════════════════════════
-
- Here is an assortment of programs useful with ASCII
- text files. Source code is included on the diskettes.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage: lines file_name[s]
-
- Provides a quick count of the number of lines in each of
- one or more files.
-
- input: Any file[s], but most useful if ASCII text.
-
- output: A one line report on the screen of the number of lines in
- each input file.
-
- writeup: MIR TUTORIAL ONE, topic 6
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- The DOS DIR command is one measure of size of a file.
- The LINES command is another. It is quick. Try it on some of the
- source files. For example,
-
- LINES A_PATTRN.C A_BYTES.C LINES.C DOSIFY.C
-
- yielded this answer on the screen:
-
- 344 lines in file a_pattrn.c
- 335 lines in file a_bytes.c
- 132 lines in file lines.c
- 154 lines in file dosify.c
-
- 965 lines TOTAL
-
- The actual count may differ when you try it; that would be on
- account of later revisions in your copy of each of these files.
-
- LINES is particularly useful in preparation for a SORT
- of a file.
-
-
- ≡≡≡≡->> QUESTION:
- The programs that use lists of file names as inputs
- would be improved by a function to expand out wild
- cards in file names (each ? to be replaced by a single
- character, each * to be replaced by zero or one or
- several characters). Try your hand at it and share the
- result.
- <<-≡≡≡≡
-
- Another program provides an analysis of line lengths:
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- usage: a_len [interval] file_name[s]
-
- Analyze the distribution of line lengths up to 1024 bytes
- within any file. The reporting interval (an integer from
- 1 to 100) is a count of the lengths that will be grouped
- together. For example, an interval of 10 means that
- frequencies of length 0, length 1-10, length 11-20, etc.
- are shown in the report. The default interval is 10. If
- the first file name starts with numeric digits, show the
- interval first!
-
- input: Any ASCII file[s].
-
- output: file_name.len which reports the frequency of line lengths
- occurring in the file. Lengths exclude carriage returns
- and
- line feeds.
-
- writeup: MIR TUTORIAL ONE, topic 6
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- Here is the result of
-
- A_LEN SVP_TXT
-
- 1 - 10: 63 1.5%
- 11 - 20: 154 3.7%
- 21 - 30: 160 3.9%
- 31 - 40: 186 4.5%
- 41 - 50: 185 4.5%
- 51 - 60: 832 20.2%
- 61 - 70: 2538 61.6%
-
- Over 80 per cent of lines are between 51 and 70 bytes
- long. None are longer. That's a very strong indication that the
- file is printable text. Non-displayable files are much more likely
- to have random distances between line feed characters. (For a more
- detailed list, try the command A_LEN 1 SVP_TXT.)
-
- A_LEN also tells us whether line-oriented utility
- programs are likely to work. Some line editors behave badly when
- data has long lines. It is common that versions of UNIX "vi" for
- example choke up with lines over 256 bytes in length.
-
- LINE_NUM has a variety of uses.
-
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
- Usage line_num [ starting_line_no ] < stdin > stdout
-
- Assign a sequence number to each line in a file, starting
- either at zero or at a user-specified sequence number.
-
- input: Any printable ASCII file.
-
- output: One line for each line of input. A sequence number is
- left justified, followed by a tab, then the input line
- exactly as received. Empty lines are counted, but left
- empty.
-
- writeup: MIR TUTORIAL ONE, topic 6
- ░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░
-
- Try it with the DOS command FIND. For example,
-
- LINE_NUM < A_BYTES.C | FIND "Usage"
-
- The result on the screen is:
-
- 62 void process(), Usage_(), report(), non_exist() ;
- 88 Usage_();
- 114 Usage_()
-
- In other words, the term "Usage" occurs on lines 62, 88 and 114.
-
-
- ═════════════════════════════════
- 6.2 ASCII markup patterns
- ═════════════════════════════════
-
- The simplest form of text is called WYSIWYG (pronounced
- wizzy-wig). It stands for "What you see is what you get." A
- document or communication consists of content and form. Depending
- on how the author has arranged the content of a WYSIWYG file, you
- as a reader can get all of the content, but may miss much of the
- intended form.
-
- Does form matter? On even the simplest one page memo,
- yes! Lack of form imposes obstacles to understanding. We look to
- form to highlight answers to key questions: Who is this from?
- When was it prepared? What was uppermost in the author's mind?
- For whom is the message intended? What response is desired or
- intended? What is being stressed? What parts are subordinate
- explanation? How does one communicate a response to the author?
- Visualize a memo that simply runs everything together as a stream
- of words in a single paragraph. The answers may be there. But
- reading time goes way up. The message is therefore less likely to
- be read in full, and less likely to be understood correctly if it
- is read.
-
- The longer and the more complex the communication, the
- more form matters. Form guides the reader through the structure of
- the document. For example:
-
- » The level of a heading is indicated by location on a
- page, type size, case, font selection, proximity to
- other material, use of graphic enhancement, associated
- numbering, and even color. An unnumbered heading,
- centred and alone on a page, in bold upper case
- italics, carries a message quite distinct from exactly
- the same words preceded by "6.5.3", left justified,
- primarily lower case, and with following text
- continuing on the same line in the same font.
-
- » Structure is needed to sort out the cross referencing
- of complex material: Are footnotes at the bottom of
- the page, at the end of the chapter, or in their own
- section among the support sections at the end of the
- document?
-
- » A table of contents is intended to clarify structure.
-
- » A preface underlines purpose.
-
- » An index helps the user to find material of particular
- interest.
-
- You have just read a series of indented points. They
- are parallel and separated. The form is more readily understood
- than the same content run together into one paragraph.
-
- A computer file is essentially a stream of bytes. To
- indicate form, there has to be either:
-
- » a definition of the structure separate from the file;
- or
-
- » signals embedded within the file content.
-
- The simplest external definition is that a "line"
- contains a certain number of bytes (often 80). Alternately, the
- external definition may itself be a lengthy file.
-
- Internal signals or codes must be distinguishable from
- the content. These signals may involve characters unused in the
- text (nulls, non-printing characters); they may be sequences of
- printable characters that are guaranteed not to appear as part of
- the content. Whatever their format, internal structure and
- formatting codes should also be consistent across the entire file.
- (But don't count on it. I was given a small database in which the
- sequence "<I>" had two or more quite distinct meanings. It turned
- out the database provider knew of the flaw and, instead of
- correcting it, resorted to physically pasting up copy for photo
- reproduction. Inconsistency is costly.)
-
-
- ══════════════════════════════════
- 6.3 Standard Generalized
- Markup Language (SGML)
- ══════════════════════════════════
-
- The word "rigor" is used in mathematics to convey
- notions of completeness, logic and disciplined consistency. The
- Standard Generalized Markup Language is an attempt to ensure rigor
- in the transfer of documents. Standards are paradoxical... at one
- and the same time an imposition and a convenience. In the case of
- SGML, the imposition is in having to learn and methodically apply
- methods of declaring structure and of representing codes within
- content. In exchange for the pain, the gain is clarity in what is
- intended by the author (or subsequent editor) in documents that
- adhere to the standard. As the standard takes hold, we indexers
- profit from increasingly greater consistency across databases. In
- the long run, we will spend much less time spent deciphering
- "orphan" database structures and markup methods.
-
- In SGML, the elements of form and structure are
- declared in a distinct document which can be transmitted with the
- database(s). Tagging is a separate task. Wide variability is
- allowed in the symbols used for tagging content, but consistency is
- enforced in the methods of tagging. One gets the impression that
- the early writers on SGML have inadvertently set more of a standard
- than they intended; one sees their particular choices of symbols
- being picked up as if they were part of the standard. Example:
- "the figures shown in <tableref>Table 7.2</tableref>..." where "<"
- starts an opening tag and "</" a closing tag. For our purposes,
- the more consistency the better! The less time spent setting up
- for alternative tag sets, the quicker we can get data into a form
- that permits automated indexing.
-
- One of the beauties of SGML is its breadth. WYSIWYG
- can be viewed as untagged SGML; one needs simply the declaration of
- an unstructured byte stream to be the structure. We can write
- software to parse and manipulate this simplest form of text. The
- software may be expanded on an as-needed basis whenever we wish to
- add structure and formatting. In that sense, the form of text we
- use for automated indexing may be considered a form of SGML. If we
- want more speed in the inversion and indexing process, we can adopt
- a specific selection of tag to be recognized by our parser. The
- cost is that we must convert data received from others to our tag
- types. If the incoming data is true SGML, then our preprocessing
- software can be a simple table replacement algorithm (easy stuff to
- write).
-
- Tutorial THREE deals with automated indexing. The
- primary version of the software is kept simple for instructional
- purposes. But there is nothing stopping us from building more
- sophisticated parsers based on more elaborate SGML declarations.
- Let's do it together, basing the expansions on real world needs
- that you encounter.
-
-
- ═════════════════════════════════════════
- 6.4 Free versus hierarchical text
- ═════════════════════════════════════════
-
- Text data includes virtually anything that can be typed
- on a computer keyboard. The most familiar is free text. Natural
- language is our normal way of communicating with one another.
- Words and phrases are grouped according to grammatical rules to
- form sentences and paragraphs. Words and phrases give each other
- context, so that communication creates a picture or impression in
- the mind of the person receiving it. The text is free in the sense
- that from a computer standpoint no divisions or rules are implied.
- There is no computer-based definition or limit to the length of a
- word or line or paragraph. This paragraph, considered alone and by
- itself, is an example of free text.
-
- In lengthy communication, free text is sub-divided.
- The divisions may be as subtle as an extra line feed to mark the
- end of paragraphs. The more complex the document, the more likely
- it is divided into sections, chapters, or articles. Subdivisions
- are meant to be an aid to understanding. Our comprehension of a
- book is improved if we can associate what we are reading with a
- chapter name and book title. Hierarchical data is a term that
- covers this type. Each paragraph is connected with a hierarchy of
- headings (book, chapter, section, sub-section). Hierarchical data
- is also found in newspaper articles, business reports, dictionary
- entries, and encyclopedia entries. Each has one or more levels of
- headings.
-
- A variation on hierarchical data is text that is cross
- referenced. Cross references are an internal form of subject
- index. They are created manually and inserted within the text.
- Human judgment is involved, and their creation is both expensive
- and a matter of personal judgment. Where a record touches on more
- than one subject, multiple cross reference paths may branch out
- from the one record. Heavily studied databases such as religious
- scriptures are often cross referenced.
-
- At some point, a threshold is passed whereby the rules
- are no longer (only) implied in the natural language, but are
- expressed in computer terms as well. The file is no longer treated
- as a simple byte stream. A hierarchical model of some sort is
- imposed on the data to distinguish among the various levels and
- categories of information. In Tutorial TWO we will see how the
- power of an hierarchical model can be wedded to the simplicity of
- a byte stream. It comes down to a simple trick of doing away with
- the assumption that records have to be numbered consecutively.
- More to follow!
-
-
- ════════════════════════════════════════
- 6.5 Fielded variable length text
- ════════════════════════════════════════
-
- Consider the following:
-
- Part Number: CL4-097-B
- Description: VALVE SPRING ASSEMBLY
- Quantity on hand, location 1: 57
- Quantity on hand, location 2: 0
- Quantity on hand, location 3: 16,212
- Quantity on hand, location 4: 8,004
- Usage, this month: 4,211
- Usage, same month last year: 3,654
- Usage, current year to date: 31,032
- Usage, last year to date: 23,996
- Economic order quantity: 1,000
- Cost per unit: $ 12.975
- Permitted substitute part #: CL4-997-X
- Permitted substitute part #: DY6-000-P
-
- The above is an inventory record. Each line is a
- "field", an element of data which describes one attribute of the
- item under discussion. Fielded data is very common in business
- records. Here is a variation on the same data:
-
- <pn>CL4-097-B\0<ds>VALVE SPRING ASSEMBLY\0<q1>57\0
- <q3>16212\0<q4>8004\0<um>4211\0<ul>3654\0<uy>31032
- <up>23996\0<eo>1000\0<co>12.975\0<sp>CL4-997-X\0<s
- p>DY6-000-P\0
-
- In the second version, the name or title is implied for
- each field by a mnemonic tag. The length of fields is variable in
- every sense... from one field to the next, and in the same field
- from one record to the next. The data size for one field may be as
- short as a single character. Since tags are used, fields may be
- dropped entirely when they are empty of data. Some fields may even
- be repeated (as in the substitute part field which occurs twice.)
-
- End of data within each field above is indicated by a
- marker, in this case "\0". Here is a variation that reduces the
- storage space for this particular record from 163 to 143 bytes.
- End of a field is signalled by the tag which begins the following
- field. An end of record tag has to be added so that the length of
- the last field is defined.
-
- <pn>CL4-097-B<ds>VALVE SPRING ASSEMBLY<q1>57<q3>16
- 212<q4>8004<um>4211<ul>3654<uy>31032<up>23996<eo>1
- 000<co>12.975<sp>CL4-997-X<sp>DY6-000-P<fn>
-
-
- ≡≡≡≡->> QUESTION:
- Under what conditions does removal of the end of field
- marker sacrifice information or introduce ambiguity?
- <<-≡≡≡≡
-
- Why have fielded records been included in a topic on
- ASCII text? Precisely because each field contains printable ASCII
- characters. Analysis of ASCII fielded records is much simpler than
- analysis of their binary counterparts. Variable length ASCII
- fielded records are recognized by their frequent repetition of the
- tags. The A_PATTRN program can be used on the first character of
- the tag to pull out all occurrences.
-
- Here's another example of fielded variable length data:
-
- 000
- 001 Historic documents
- 002 United States
- 003 Civil War
- 004 Gettysburg Address
- 005 November 19, 1863
- 006 Lincoln, Abraham
- 006 President Abraham Lincoln
- 007 Fourscore and seven years ago our forefathers brought
- 007 forth upon this continent a new nation, conceived in liberty
- 007 and dedicated to the proposition that all men are created
- ...
-
- The example contains four levels of heading. 006 may
- be either an author or a historic person field; we don't know
- unless we view other records or are given access to the list of
- field definitions. Field 007 goes on at length. After field 007
- there may be other data such as cross references. The next record
- is identified by "000" at the left margin. The first three columns
- are, by virtue of position, field identifiers or tags. The fourth
- column is blank simply to enhance readability. This simple format
- is a first step from WYSIWYG toward SGML. Precisely because it is
- simple, I often use it as an intermediate step in setting up for
- indexing. (The earlier FindIt product used a more complex
- variation in which the fourth column contained codes. Hindsight
- shows that the simpler version is more powerful.)
-
- In fixed length ASCII fielded records, a pattern noticeably
- recurs... possibly blocks of white space, or fields containing
- dates (MAY 22 92), that stand out readily and shift the same number
- of columns left or right at regular vertical intervals.
-
-
- ══════════════════════════════════════════════
- 6.6 Independent versus continuous data
- ══════════════════════════════════════════════
-
- Records occur in some physical order within a computer
- file. The white rabbit in "Alice and Wonderland" preferred a
- simple order for things: "Start at the beginning, keep going until
- you get to the end, and then stop." In a document such as a book
- or a business report, there is usually a distinguishable beginning
- and end. In a database of library books and periodicals, the
- physical order may be date of acquisition... the later the date of
- receipt, the further the record is toward the end of the sequence
- of records. In an inventory file, either part number or date of
- addition may govern the physical order.
-
- Try the command
-
- FIND "HEAD4" SVP_TXT
-
- It turns out that this data set is ordered according to sequential
- numbering of the various pieces of correspondence. The result of
- the above command has 106 lines, starting as follows:
-
- @HEAD4 = 417. - TO SAINT LOUISE DE MARILLAC,<B^>1<D> IN ANGERS
- @HEAD4 = 418. - TO LOUIS ABELLY,<B^>1<D> VICAR GENERAL OF BAYONNE
- @HEAD4 = 419. - TO SAINT LOUISE, IN ANGERS
- @HEAD4 = 420. - TO SAINT LOUISE, IN ANGERS
- @HEAD4 = 421. - TO SAINT LOUISE, IN ANGERS
- @HEAD4 = 422. - TO SAINT LOUISE, IN ANGERS
- @HEAD4 = 423. - TO LOUIS LEBRETON,<B^>1<D> IN ROME
- @HEAD4 = 424. - TO JACQUES THOLARD,<B^>1<D> IN ANNECY
-
- Being able to detect sequence within one field helps in
- the data analysis. This is because patterns show up more quickly
- when there is a strictly ordered field within the data. An ordered
- field helps to determine sequence when the total data set has to be
- pieced together from a number of files. And if the data has been
- garbled or truncated, order facilitates repair. Ah, yes, Virginia,
- there is damaged data out there... a lot of it. There is nothing
- like indexing a set of data to find all kinds of errors in it.
- Every garbled spelling and extraneous piece of garbage shows up in
- the list of indexed terms.
-
- If records are truly independent and in no sequence,
- there is no need for the retrieval software to display nearby
- records. But if there is continuity in the data, the person
- searching through the data will like the ability to see the records
- that occur before or after the records found by a search.
-
-
- * * * * *
-
- In this topic, we have looked at tools for analyzing
- ASCII text files. We found that formatting and markup of ASCII
- text is extremely varied. Indexing will become simpler and less
- time consuming as standards become widely accepted. Physical
- storage presupposes a flat data model; hierarchical models can be
- inferred by segregating data into fields. Continuous data is
- easier to work with in index preparation than truly independent
- non-sequential files.