home *** CD-ROM | disk | FTP | other *** search
- TECHNICAL NOTES ON ARCSGML 1.0
-
- NOTE: ISO 8879 (Standard Generalized Markup Language) is
- required reading for anyone attempting to implement an SGML
- application. Its complete official text, incorporating
- the 1988 amendment, extensively annotated and cross-referenced
- and with additional tutorial material, can be found in my
- book, The SGML Handbook, published by Oxford University
- Press, ISBN 0-19-85373-9. It is available at bookstores and
- through the SGML Users' Group.
-
- A. Compilation and Such
-
- All Characters are Unsigned
-
- Work on ARCSGML began in 1983, at the dawn of PC C compiler
- technology, so a number of compilers have been used over the years.
- As a result, some occurrences of the "char" type are coded with the
- typedef "UNCH", which stands for "unsigned character". As compilers
- came along that offered compile-time switches to make all "chars"
- unsigned, I stopped using "UNCH" in places. The bottom line is that
- all chars must be unsigned -- how you accomplish it depends on your
- coding style and the facilities of your compiler.
-
- Non-DOS SGML I/O
-
- NDSGMLIO.SRC is a version of SGMLIO.C with alternative encodings for
- other compilers and operating systems. They are chosen by defining
- the following preprocessor names:
-
- TURBOC Borland C++ 2.0 (same code as in SGMLIO.C)
- MICROC Microsoft C 5.1
- IBMC IBM C for mainframes (not tested)
- C88 DeSmet C88 (not tested recently)
-
- SGMLIO.C uses a Borland-specific DOS-specific function called
- "searchpath". Some alternatives for other environments can be found
- in NDSGMLIO.SRC.
-
- When NDSGMLIO.SRC is consulted, NDENVCB.SRC will also be of interest.
-
- Lint
-
- Conventional comments for lint are for PC-Lint 2.11 by Gimpel
- Software. However, lint was not run for the SGMLUG original
- distribution (ARCSGML 1.0) because Borland C appears to provide
- sufficient checking.
-
- Borland C was run with ALL warnings enabled, which required lots of
- explicit type casting. No warnings or errors were issued.
-
- B. Sample Applications ("Text Processors")
-
- The sample text processor that performs markup validation is called
- VM2.C or VM2.EXE. If invoked with the /S option, it will print an
- excruciatingly detailed report of the parsing process (see VM2HELP.DOC
- for details). This text processor is used for testing the internal
- functions of the parser, so it needs access to "non-ESIS" information
- not provided through the usual API. As such, it is a poor model for
- normal text processors, which access the parser control block data
- through a friendlier interface provided by SGMLAPI.C. (See
- ARCSGML.DOC for details of the program architecture.)
-
- The sample text processor called APIMODEL is a better model (although
- presently untested), as it uses only the normal interfaces of SGMLAPI.
- It reads an SGML document and produces a slightly reformatted version.
- Although it is not complete, it is easy to see how to use it as a
- model for other "translation" applications, such as SGML to a word
- processor file.
-
- C. System Declaration Information
-
- A parser constructed with ARCSGML, when used in a conforming SGML
- system on an ASCII machine (e.g., PC or PS2), should be capable of
- conforming to the following system declaration:
-
- SYSTEM "ISO 8879:1986"
- CHARSET
- BASESET "ISO 646:1983//CHARSET
- International Reference Version (IRV)//ESC 2/5 4/0"
- DESCSET 0 128 0
-
- CAPACITY PUBLIC "ISO 8879:1986//CAPACITY Reference//EN"
- SCOPE DOCUMENT
- SYNTAX PUBLIC "ISO 8879:1986//SYNTAX Reference//EN"
-
- FEATURES
- MINIMIZE DATATAG NO OMITTAG YES RANK NO SHORTTAG YES
- LINK SIMPLE NO IMPLICIT NO EXPLICIT NO
- OTHER CONCUR NO SUBDOC NO FORMAL YES
-
- VALIDATE GENERAL YES
- MODEL NO EXCLUDE YES CAPACITY YES
- NONSGML YES SGML NO FORMAL YES
-
- D. Interface Specification Notes
-
- The function calls from the text processor to the parser are documented
- in SGMLAPI.H. That file also describes the C structures with which the
- API returns information from the parser. Other parser control blocks
- (with which the text processor normally need not be concerned) are
- documented in the other header files. The following notes supplement the
- information in SGMLAPI.H:
-
- 1. Attribute values and SDATA
- entities could contain delimited processing instructions.
- 2. Attribute values and CDATA, SDATA, and PI entities could contain
- non-SGML characters (as can content data, described below).
- 3. Inline processing instructions are not parsed (except for PIC),
- so character references are not recognized, and non-SGML characters
- cannot occur validly (i.e., through character references).
- 4. There are three internal delimiters
- that SGML will insert into returned data:
- tpsw.delcdata brackets a CDATA entity returned in an
- attribute value.
- tpsw.delsdata brackets an SDATA entity so returned.
- tpsw.delnonch precedes a non-SGML character (see below).
- In addition, an offset,
- tpsw.addnonch, is added to a non-SGML character (modulo 256).
- The delimiters and offset are set in the switch structure; they must
- be non-SGML characters other than 0 8 9 10 13 26 and 28.
- (29, 30, and 31 are the recommended delimiters; 64 us the offset.)
- The CDATA and SDATA brackets can be removed when the data
- is to be formatted.
- The NONCHAR prefix might be needed until the last moment
- for those NONCHARs that have special meaning to a text
- processor, so tpsw.delnonch should be chosen accordingly.
- 5. When the recommended delimiters and offset are in use,
- non-SGML characters will be returned as 31,char+64 (modulo 256).
- For example: decimal 22 will be returned as decimal 31, followed
- by decimal 86; decimal 255 as 31 63.
- 6. SGML returns resolved notation identifier with NOTATION attribute as
- ADDATA(n). It is a pointer to a NOTATION control block (see below).
- 7. When a start-tag is returned, IDREFL(n) is 1 if the corresponding IDREF
- token refers to an existing ID; it is 0 if the ID hasn't been defined.
- 8. When parser returns a start-tag with attributes, an attribute that
- must always be a single token is treated as CDATA, as distinguished
- from a list that just happens to be one token long.
- 9. PUBLIC identifier is passed to SGMLIO in ipbfile.ipbp1
- when resolving external identifiers for entities and data content
- notations (programs that access non-SGML data entities).
- 10.For RCBDAF returns: if NDESW is on, data ptr is to NDATA entity
- control block. For AENTITY attributes, ADDATA(i) points to NDATA
- ecb. The NDATA ecb (struct ne in ADL.H) in turn points to a NOTATION
- control block (struct dcncb in ENTITY.H). There is code in VM2.C
- that shows how to handle these control blocks. (You can imbed
- image and graphic entities anywhere that a character entity could
- be referenced.)
- 11.You can support device-dependent versions of
- PUBLIC entities (e.g., non-keyable graphic characters), according
- to the syntax of formal public identifiers as defined by the Standard.
- The parser indicates in the FILENAME ipb four entity type codes
- for when a public identifier is versionless and the best available
- version should be substituted. SGMLIO must find the
- best available version. (See SGMLCB.H and SGMLIO.C for details.)
- 12.If the document element terminates other than by its own end-tag
- or by the end of the document entity, an error message (89) is issued
- and the parser returns an EOD rcb. Any remaining text in the
- primary entity is ignored.
- 13.Multiple passes are supported. The
- ptrs to rc and rcb in the text processor are passed in
- the switches structure, whose address is sent to the parser
- in the SGMLSET ipb. At the
- end of the run, SGMLEND must be sent to allow final
- cleanup by the parser. (See VM2.C for the details.)
- 14.SGML keeps a buffer only for the current file. SGMLIO saves and
- restores the location when a nested file is opened.
- 15.Stack of positions (source control blocks) is returned when an error
- occurs, so full entity trace can be printed in error messages.
- 16.Trailing RE RS suppression and leading RS insertion for external entities
- set by user parameter (overridable in system ID of ENTITY declaration).
-
- E. Debugging Aids
-
- VM2, because it reveals the internal state of the parser, can be a
- useful debugging aid when the /S option is enabled. In addition, the
- TRACESET module contains routines to format and print the parser's
- internal control blocks, and to display a vertical "tape" of the
- characters parsed and the state transitions they caused.
-
- The various traces are enabled by setting environment variables to an
- arbitrary value so that the variables will exist (e.g., SET T=1).
- There are separate variables for the prolog (for debugging the DTD
- parsing) and the document instance. They are:
-
- Prolog Instance
- PT T trace state transitions
- PA A trace attribute activity
- C trace context checking
- PD D trace declaration parsing
- PE E trace entity activity
- PG G trace group creations
- I trace ID activity
- PM M trace MS activity
- PN N trace notation activity
-