home *** CD-ROM | disk | FTP | other *** search
- Xref: sparky comp.text:1499 comp.text.sgml:1238
- Newsgroups: comp.text,comp.text.sgml
- Path: sparky!uunet!gatech!destroyer!cs.ubc.ca!uw-beaver!fluke!inc
- From: inc@tc.fluke.COM (Gary Benson)
- Subject: Re: Marking up 'Automatically' -anyone
- Message-ID: <1992Dec25.230745.16019@tc.fluke.COM>
- Organization: John Fluke Mfg. Co., Inc., Everett, WA
- References: <20186.2b362c76@ul.ie> <19921223.044835.214@almaden.ibm.com>
- Date: Fri, 25 Dec 1992 23:07:45 GMT
- Lines: 110
-
- In article <19921223.044835.214@almaden.ibm.com> drmacro@ralvm13.VNET.IBM.COM writes:
- >In <20186.2b362c76@ul.ie> murraya@ul.ie writes:
- >>Can anybody give me details on an application to mark-up text from ordinary
- >>ascii files 'automatically', using certain Rules eg. two blank lines
- >>indicates the begining of a paragraph.
- >>
- >>Any information on the subject would be appreciated!
- >>
- >> <Aonghus>
- >>
- >
- >There are a number of products that can do this. I'm aware
- >of the following vendors:
- >
- >Avalanche Technology: FastTag and SGML Hammer products.
- >IBM: TextTagger/ESA, which is based on Avalanche's technology.
- >Zandar Corp: TagWrite for Windows
- >Software Exoterica: Omnimark
- >
-
- [ Remainder of posting (deleted) describes these products in more detail]
-
- Another alternative is to code it up yourself. Here at Fluke, we have been
- standardizing our input file format since the days of typesetting made it
- seem like a good idea to give consistent input to the typographer.
-
- Later, we automated most of the typesetting function using the technique you
- are asking about. This was at the exact moment of the debut of the Perl
- programming language, a natural for doing this kind of work. Here is how the
- man page starts: "Perl is an interpreted language optimized for scanning
- arbitrary text files, extracting information from those files, and
- generating reports based on that information."
-
- Now that we are using SGML style of tagging for files we send into our Agfa
- CAPS publishing system, we are continually refining our programs and the
- input file format. The main reason we decided to roll our own was that none
- of the companies mentioned in this reply existed at the time that we had
- this need.
-
- One thing to be aware of if you begin donw the road of automatic tagging, is
- that you should probably be ready to accept your decision long-term and be
- willing to commit to it. This is true whether you intend to write your own
- auto-gencoder or buy one. For example, I understand that FastTag from
- Avalanche uses a "Visual Recognition Engine" and some AI techniques to
- generate their output. This implies that you must have a way to feed it
- visually consistent hardcopy, and that you are willing to buy into that
- technique with whatever inherent weaknesses it might have.
-
- Our method on the other hand, scanning ASCII text files, requires a real
- commitment, too, as well as the discipline involved in preparing the files.
-
- To give a picture of how our "gentext" program works, we have some general
- rules about files submitted for processing:
-
- 1. Standard text paragraphs must start at the left margin.
-
- 2. Structural headings start at the margin and indicate their level with
- the unique string ---n where n can be a number between 0 (Chapter)
- and 4.
-
- 2. Lists must occur at a new tab stop for each level of nesting.
- A. Like this.
- B. Or this.
-
- 3. Figure and Table titles must be on a line alone, separated above and
- below by a blank line, and at a different indent than the immediately
- preceeding object.
-
- 4. NOTES, CAUTIONS, and WARNINGS must be indicated by the introducer
- word on a line by itself, a blank line, then the body of the NCW at a
- consistent indent, and one that is different from any object below.
-
- Using these general rules, we have a program scan the text one line at a
- time and check it for compliance with the definition of each of object. If
- an object meets the definition of an object that includes indention, any
- text at that indent is tucked away until the indent changes. In this way, we
- keep track of list nesting, and multiple paragraph ojects. The default
- object is the standard text paragraph, so the program is fairly efficient.
- As a line is examined, gentext says, "is it a heading, is it a figure title,
- is it a NOTE, is it a numeric list???" and so on. As long as the answer is
- no, it just keeps checking. At the bottom, if everything else was no, then
- it must be a pragraph, and the line just goes to the output. Only when it
- does a check for an object, say "NOTE" do things slow down. If for example,
- it finds the workd NOTE alone on a line, it sets up some memory and begins
- shoveling following text into it as long as the indent does not change. The
- moment it does, the memory area is dumped to the output file, followed by the
- </note> code. Then the next line is examined.
-
- I only mention writing your own program as an alternative because I do not
- think that we are unique in having our own special requirements and wants.
- We are certainly interested in looking at commmercially available products,
- but it may take them a while to reach the level of sophistication and
- extensibility we have now. Besides, by doing it ourselves, the software is
- completely accessible, something you definitely give up when you buy someone
- else's idea of the way it should be done!
-
- I am not trying to convince anyone of the relative merits of any particular
- approach, just affirming that the idea murraya@ul was asking about is indeed
- a viable way to tag files. I am looking forward to seeing what WordPerfect
- Markup looks like, too. A "tagging assistant" might be another way to get
- results conforming to your needs.
-
- --
- Gary Benson -_-_-_-_-_-_-_-_-_-inc@sisu.fluke.com_-_-_-_-_-_-_-_-_-_-_-_-_-_-
-
- Stupidity cannot be cured with money, or through education, or by legislation.
- Stupidity is not a sin; the victim can't help being stupid. But stupidity
- is the only universal capital crime; the sentence is death, there is no
- appeal, and execution is carried out automatically and without pity.
- -Lazarus Long
-