home *** CD-ROM | disk | FTP | other *** search
-
-
- This file offers additional comments, notes and hints on the Dada
- compiler given in DADA.PAS and described in Computer Language for
- December, 1985.
-
- 1. Several of the types and variables that are defined globally in
- DADA.PAS might better be declared as typed constants under Turbo
- Pascal. They would then be local to the procedures that refer to
- them and hence out of harm's way if some other procedure starts
- meddling where it doesn't belong. They must be typed "constants"
- rather than variables because they are static quantities: an error
- message, for example, must retain its value between calls to the
- error-listing routine. An added convenience of the Turbo typed
- constants is that they can be initialized at the time they are
- declared; thus the procedures InitKeywords and InitErrorList could
- be discarded.
-
- I avoided using typed constants because they are not a feature of
- standard Pascal.
-
- The following global variables (and their associated defined
- types) might be made local in this way:
-
- OutLine, OutPoint, OutBuf
- CH
- LineCount
- TypeSet, TFset, RelOpSet, AddOpSet, MultOpSet
- CurrentScope
- Keywords
- ErrorList
-
-
- 2. For simplicity I have defined the symbol table as a linked linear
- list of invariant records. Several improvements could be consid-
- ered. A binary tree structure would speed up the compiler; a hash
- table would go even faster. The only routines that would have to
- be changed are Find, Declare and Blot. It hardly seems worth the
- trouble, however; Dada is a toy language, and the compiler is
- unlikely to be given long programs, where speed of compilation
- would become an issue.
-
- If space rather than time is critical, a variant record could be
- defined, with SymClass as the tag field. The VarType field is
- needed only for variable entries and the Scope field only for
- procedure entries. The savings, however, amount to only one byte
- for VarType and two for Scope.
-
-
- 3. Doing a better job with errors is a lot of work. Stopping at the
- first error is acceptable in a compiler that has a built-in
- editor, where the position of the error can be indicated more or
- less exactly on screen. In a batch-style compiler, the favored
- strategy is to continue reading the source text, so that a single
- compilation will in principle catch errors throughout the program.
- The challenge is to recover from the first error well enough to
- continue without issuing a Niagara of spurious error messages. The
- Dada compiler is not capable of recovery. Removing the Halt state-
- ment from the Error procedure will demonstrate this effectively.
-
-
- 4. The error-checking done here uses compiler directives peculiar to
- Turbo Pascal. The main idea is that {$I-} allows the program to
- continue after an input-output error, leaving an error code in the
- system variable IoResult. If the operation is successful, IoResult
- is zero. For other Pascal compilers equivalent services should be
- available. A fuller and friendlier file-handling routine would
- report the nature of the error, offer the user a second chance,
- supply a default name for the output file, warn if an existing
- file will be overwritten, etc.
-
- A more convenient way of getting the file names would be to
- retrieve them from the the command line. I decided against doing
- so because the methods of accessing command-line arguments vary
- between Pascal compilers and between operating systems.
-
-
- 5. The simplicity of the scanner owes much to the simplicity of the
- language being recognized, but it is also made possible in part by
- delaying certain lexical decisions until the syntactic or even the
- semantic phase of program analysis. The scanner returns the same
- code (Ident) for all identifiers, whether they refer to variables
- or to procedures. The distinction between variable and procedure
- names is really a lexical one, but it is not made until the parser
- looks up the identifier in the symbol table.
-
- Postponing this part of the lexical analysis allows the scanner to
- operate without either lookahead of backtracking. At any moment,
- the scanner knows only the current input symbol (the content of
- variable CH), and once it has consumed that character it can never
- go back to read it again.
-
- In many compilers the scanner has broader responsibilities and is
- consequently more complex. For example, if the scanner (rather
- than the parser) installs identifiers in the symbol table, they
- must be classified unambiguously at the outset. In Dada this might
- be done by looking ahead, after finding an identifier, to see if
- it is followed by an assignment operator, which would imply that
- the identifier is (or ought to be) a variable name. If the
- assignment operator were not found, the pointer to the input
- stream would have to be backed up to the character immediately
- following the identifier.
-
-
- 6. The organization of the symbol table and the three routines that
- access the table are based on the most-closely-nested rule of
- Pascal and other block-structured languages. Since the table is
- searched from youngest to oldest, a call to Find(IdentifierName)
- will always return a pointer to the most recent entry that matches
- IdentifierName. This allows the compiler to distinguish between
- identifiers that have the same name but differ in scope.
-
- Suppose a Dada program makes the following declarations, causing
- the parser to make the symbol-table calls shown. For each line the
- value of CurrentScope is given in brackets:
-
- procedure Proc1; [1] { Declare(Proc1) }
- procedure Proc1A; [2] { Declare(Proc1A) }
- begin [3]
- end; [2] { Blot }
- procedure Proc1B; [2] { Declare(Proc1B) }
- procedure Proc1Ba; [3] { Declare(Proc1Ba) }
- begin [4]
- end; [3] { Blot }
- begin [3]
- end; [2] { Blot }
- begin [2]
- end; [1] { Blot }
- procedure Proc2; [1] { Declare(Proc2) }
- begin [2]
- end; [1] { Blot }
-
- The corresponding symbol-table entries would be built up by
- Declare and subsequently eliminated by Blot in the following
- order:
- NAME SCOPE NEXT CURRENTSCOPE
-
- FirstSym --> Proc1 1 ^<earlier> 1
-
- Proc1 1 ^<earlier>
- FirstSym --> Proc1A 2 ^Proc1 2
-
- Proc1 1 ^<earlier>
- Proc1A 2 ^Proc1
- FirstSym --> Proc1B 2 ^Proc1A 2
-
- Proc1 1 ^<earlier>
- Proc1A 2 ^Proc1
- Proc1B 2 ^Proc1A
- FirstSym --> Proc1Ba 3 ^Proc1B 3
-
- Proc1 1 ^<earlier>
- Proc1A 2 ^Proc1
- FirstSym --> Proc1B 2 ^Proc1A 2
-
- Proc1 1 ^<earlier>
- FirstSym --> Proc2 1 ^Proc1 1
-
- The appending of a symbol to the table by Declare is straight-
- forward. A new instance of the record structure is created, then
- the Next field of the new record is made to point to the current
- FirstSym, and finally FirstSym is adjusted to point to the new
- record. Thus each new record is put at the "front" of the list.
-
- The culling of old symbols by Blot is not quite as obvious. Each
- name must be made inaccessible once the program passes beyond the
- scope of that name. In the fragment above, for example, Proc2
- should not be able to call Proc1A, Proc1B or Proc1Ba because those
- procedures are local to Proc1. In Dada this is accomplished by
- eradicating the symbol-table entry for a procedure once the "end"
- that marks the limit of its scope is reached. The global variable
- CurrentScope is incremented each time a block is entered; Blot
- decrements CurrentScope and then works up the list, unlinking any
- entries whose scope is numerically greater than CurrentScope.
-
- Two further notes are needed. First, observe that "end" can appear
- in two contexts in Dada (or Pascal): at the end of a block or at
- the end of a compound statement. Blot must be called only on
- exiting a block, not on completing a compound statement. Second,
- the technique of making symbol-table entries inaccessible by
- simply erasing them may not work in more elaborate compilers. If
- there is an optimizing pass, for example, it will probably need to
- consult the symbol table. The proper approach then is to include a
- flag field in the record, so that a procedure like Blot can mark a
- name as hidden without destroying the entry. This also allows
- better error messages--"Variable referenced outside its scope"
- rather than "Undeclared variable."
-
-
- 7. Forth systems vary a good deal, not only in the language they
- accept but also in the file format they expect. The code generator
- given here creates "screen" files made up of 1,024-byte blocks.
- Each such block is conventionally viewed as a screen of 16 lines
- with 64 characters per line. There are no end-of-line delimiters;
- every character position that is not occupied by Forth text is
- filled with an ASCII blank. Input in this form should be accept-
- able to many Forth systems, provided they use the operating-system
- disk format. The published routines were tested with versions of
- PC/FORTH, from Laboratory Microsystems, Inc.
-
- The plasticity and mutability of Forth itself raises other
- problems. The various standards (fig-Forth, Forth-79, Forth-83)
- differ appreciably, and there are differences even among systems
- that are meant to conform to the same standard. I have tried to
- confine myself to Forth words that are common to all systems, but
- no doubt incompatibilities will turn up. The code generator can
- produce the following Forth words:
-
- ( ) FORTH DEFINITIONS DECIMAL CONSTANT VARIABLE :
- ; ;S --> @ ! . CR MINUS 0= = < > DUP
- DROP SWAP OVER KEY EMIT IF ELSE THEN BEGIN WHILE
- REPEAT OR AND - * + / MOD
-
- In addition, the following synonyms are defined in the header that
- precedes every program:
- 1 CONSTANT TRUE
- 0 CONSTANT FALSE
- : NEGATE MINUS ;
- : NOT 0= ;
- : <> = NOT ;
- : >= < NOT ;
- : <= > NOT ;
-
- If any of this will make your Forth system choke, change it, or
- add definitions to the header to provide compatibility.
-
-
- 8. The Dada compiler employs the parsing technique called recursive
- descent, which directly reflects the recursive definition of the
- language syntax. The program itself also illustrates this
- structure: the routines are nested so that each one is accessible
- only to the nest-higher-level routine that calls it. The overall
- structure looks like this:
-
- ParseProgram
- ParseVariables
- ParseBlock
- ParseStatement
- ParseExpression
- ParseSimpleExpr
- ParseTerm
- ParseSignedFactor
- ParseFactor
-
- Thus to recognize a program the parser must find variable
- declarations and a block; in recognizing a block it tries to
- resolve the input stream into statements, and so on.
-
- The recursive structure takes care of much of the bookkeeping that
- would otherwise burden the program (and the programmer). This is
- clearest in the parsing of expressions. In Forth as well as in
- most machine languages expressions must be given in postfix
- notation, whereas Dada and other "algebraic" languages use infix
- notation. Making the conversion requires maintaining a stack of
- pending operators. For example, the infix expression 3 * (4 + 5)
- must be converted to the postfix 3 4 5 + *; to do so the operators
- "+" and "*" must be stacked until both their operands have been
- read from the input stream.
-
- Examination of the parsing routines reveals no stack-management
- apparatus. How is it done? The stack employed is the return stack
- created by the storage-allocation subsystem of the Pascal com-
- piler. Although HoldMultOp, for example, is defined as a single
- variable, the executing program can actually create many instances
- of it. Every time ParseTerm is called, a new instance of
- HoldMultOp is pushed onto the return stack.
-
- One final note on the parser: Just as the scanner can see only the
- current character in the source file, the parser has access only
- to the current token in the input stream. There is no explicit
- lookahead or backtracking. Whereas the scanner really does its
- analysis one symbol at a time, however, the parser saves a great
- deal of information on its (hidden) stack. Indeed, the first
- identifier in a program--the name of the program itself--is saved
- until the main block at the very end of the program is reached.
-
-
- 9. The type checking done in DADA.PAS is not quite complete. The
- operators are classified as RelOps, AddOps and MultOps according
- to their precedence relations--multiplication must be done before
- addition and addition before comparison. There is another dimen-
- sion of classification, however, that the compiler ignores. Log-
- ical OR has the same precedence as multiplication, and logical AND
- is at the same level as addition, but the logical and arithmetic
- operations properly apply to values of different types. As the
- program stands, the compiler makes certain that the two operands
- in each operation have the same type, but there is no assurance
- the operator is appropriate. Thus "6 + FALSE" would generate an
- error, but "6 AND 7" or "TRUE * FALSE" would pass without com-
- plaint. Because Forth encodes boolean values as integers, such
- expressions will also execute without causing a run-time error.
-
- The remedy is straightforward, although it makes the code somewhat
- bulkier. ParseSimpleExpr and ParseTerm must be made to analyze
- logical and arithmetic expressions separately. Both operands of
- AND and OR must be boolean values, and both operands of +, -, *
- and / must be integers. In ParseExpression some decisions need to
- be made. Boolean values can surely be compared for equality, but
- is "TRUE > FALSE" a meaningful expression? All this becomes still
- more complicated when additional data types, such as strings and
- characters, are introduced.
-
-
- 10. There is a minor discrepancy between grammar and compiler in the
- function ParseStatement. The grammar for Dada, following that for
- Pascal, defines the semicolon as a statement separator, not a
- statement terminator. Hence the proper syntax of a compound state-
- ment is: begin
- A := B;
- B := C;
- C := D
- end
- where there is no semicolon following the last simple statement
- and preceding the "end". It's hard enough, however, remembering
- where to put semicolons without also remembering special rules
- about where not to put them. Many Pascal compilers therefore make
- the final semicolon optional--unneeded but harmless if present.
- The same policy is adopted in Dada.
-