home *** CD-ROM | disk | FTP | other *** search
Text File | 1990-01-18 | 73.9 KB | 1,756 lines |
-
-
-
-
-
-
-
-
-
-
-
-
- LEX AND YACC - COMPILER WRITER'S TOOLS FOR TURBO PASCAL
- VERSION 2.0
-
- Albert Graef
- FB Mathematik
- Johannes Gutenberg-Universitaet Mainz
-
- November 22, 1989
-
- (ASCII version of manual: January 12, 1990)
-
-
- ABSTRACT
-
- We describe a reimplementation of the compiler writer's tools Lex and
- Yacc for Borland's Turbo Pascal, running under MS-DOS. These programs
- are useful tools for the development of compilers, lexical analyzers,
- and similar applications, and are intended for experienced Turbo Pas-
- cal programmers with a background in compiler design, and for courses
- in compiler construction.
-
- Note: This is a raw ASCII facsimile of the original Lex and Yacc
- manual contained in the TeX DVI file MAN.DVI on the distribution disk.
-
- CONTENTS
-
- Introduction
- 1. Installation
- 2. Lex
- 2.1 Regular Expressions
- 2.2 Actions
- 2.3 Lex Library
- 2.4 Character Tables
- 2.5 Turbo Pascal Tie-ins
- 2.6 Implementation Restrictions, Bugs, and Incompatibilities
- 3. Yacc
- 3.1 Actions
- 3.2 Lexical Analysis
- 3.3 Yacc Library
- 3.4 Ambiguity
- 3.5 Error Handling
- 3.6 Arbitrary Value Types
- 3.7 Debugging
- 3.8 Yacc Language Grammar
- 3.9 Additional Features, Implementation Restrictions and Bugs
- Conclusion
- References
- Appendix
-
- INTRODUCTION
-
- This manual describes two popular compiler writer's tools, Lex and
- Yacc, which have seen extensive use on the UNIX system, and have been
- reimplemented by the author for the MS-DOS operating system.
-
- The original (UNIX) versions of these programs are described in [2,3].
- Other public domain and commercial remakes of Lex and Yacc are avail-
- able under MS-DOS, e.g. from DECUS and Mortice Kern Systems. However,
- in difference to these implementations, the programs described in this
- manual are for use with Borland's Turbo Pascal, rather than with C. In
- particular, they support Turbo Pascal as their host and target
- programming language.
-
- The Turbo Pascal Lex and Yacc versions are an independent development
- of the author, not containing any fragments from the original sources,
- or other UNIX stuff (happily, the theory underlying Lex and Yacc has
- been published, e.g., in [1], und thus is public domain). However, the
- names Lex and Yacc (which, as far as I know, are not copyrighted) are
- justified by the fact that the programs described here are intended to
- be approximately compatible with the original versions.
-
- The intended audience for this manual are experienced (Turbo Pascal)
- programmers with knowledge of the basics of formal language and com-
- piler theory and students in compiler construction courses, not
- novices or casual programmers. Thus, the manual is particularily con-
- cise and compact, while trying to cover all essential aspects of Turbo
- Pascal Lex and Yacc.
-
- As a supplementary text, we strongly recommend the famous "dragon
- book" of Aho, Sethi and Ullman [1] which covers all important
- theoretical and practical aspects of compiler design and implementa-
- tion.
-
- The manual is organized as follows: Section 1 covers installation re-
- quirements and the installation process. Section 2 treats the lexical
- analyzer generator Lex; subsections are devoted to the format of
- regular expressions, the implementation of actions, the Lex library
- unit, character tables, Turbo Pascal tie-ins, implementation restric-
- tions, bugs, and incompatibilities. Section 3 discusses the parser
- generator Yacc; subsections deal with actions in Yacc grammars, the
- lexical analyzer routine used by Yacc-generated parsers, the Yacc
- library unit, ambiguities in Yacc grammars, syntactic error handling,
- arbitrary value types in actions, the debugging of Yacc-generated par-
- sers, the Yacc language syntax, and, finally, additional features, im-
- plementation restrictions and bugs. The appendix contains short des-
- criptions of Lex and Yacc in the style of UNIX manual pages.
-
- Note: Throughout the manual, the terms Lex and Yacc refer to the Turbo
- Pascal versions. The original (UNIX) versions are denoted by the terms
- UNIX or standard Lex resp. Yacc.
-
- 1. INSTALLATION
-
- Installation requirements:
- - IBM PC/XT/AT or compatible, 512 KB RAM
- - MS-DOS 3.10 or later (may also run under MS-DOS 2.x, but I have not
- tested it)
- - Turbo Pascal 4.0 or later (has been tested with 4.0 and 5.0)
-
- To install Turbo Pascal Lex and Yacc, simply copy the contents of the
- distribution disk to an appropriate disk and/or directory. You might
- also wish to put this directory on your DOS path. The programs
- generated with Lex and Yacc will need the LexLib and YaccLib units
- (.tpu files on the distribution disk) when compiled, so you might have
- to put them anywhere the Turbo Pascal compiler finds them (e.g., in
- the turbo.tpl library).
-
- Here's the contents of the distribution disk:
- lex.exe the Lex program
- lexlib.* source and .tpu file for the LexLib unit
- yacc.exe the Yacc program
- yacclib.* source and .tpu file for the YaccLib unit
- read.me if present, contains addenda to the manual
- makefile makefile for the sample programs
- *.l, *.y sample Lex and Yacc programs
- man.dvi TeX dvi file for the manual
-
- As shipped, the LexLib and YaccLib units are compiled with Turbo Pas-
- cal 4.0. If you're running Turbo Pascal 5.0 or later, you will have to
- recompile lexlib.pas and yacclib.pas with your version of the Turbo
- Pascal compiler.
-
- You can use the makefile to compile the sample programs on the dis-
- tribution disk (see the Turbo Pascal manual for a description of Bor-
- land's make, and the makefile for a description of its usage and the
- sample programs).
-
- To run Turbo Pascal Lex and Yacc on your grammar sourcefiles, refer to
- the manual pages in the appendix for a description of the Lex and Yacc
- command formats.
-
- 2. LEX
-
- Lex is a program to generate lexical analyzers from a given set of in-
- put patterns, specified as regular expressions. Table 1 summarizes the
- regular expressions Lex recognizes. In this table, c stands for any
- single character, r for a regular expression, and s for a string.
-
- Expression Matches Example
- --------------------------------------------------
- c any non-operator character c a
- \c character c literally \*
- "s" string s literally "**"
- . any character but newline a.*b
- ^ beginning of line ^abc
- $ end of line abc$
- [s] any character in s [abc]
- [^s] any character not in s [^abc]
- r* zero or more r's a*
- r+ one or more r's a+
- r? zero or one r a?
- r{m,n} m to n occurrences of r a{1,5}
- r1r2 r1 then r2 ab
- r1|r2 r1 or r2 a|b
- (r) r (a|b)
- r1/r2 r1 when followed by r2 abc/123
- --------------------------------------------------
- Table 1: Lex regular expressions (taken from [1, fig. 3.48]).
-
- A Lex program, or grammar, in general, consists of three sections
- separated with the delimiter %%:
- definitions
- %%
- rules
- %%
- auxiliary procedures
-
- Both definitions and rules section may be empty, and the auxiliary
- procedures section may be omitted, together with the second %%. Thus,
- the minimal Lex program is
- %%
- (no definitions, no rules).
-
- The rules section of a Lex program is a table of regular expressions
- and corresponding actions, specifying patterns to be matched in the
- input, and (Turbo Pascal) program statements to be executed when a
- pattern has been matched:
- expression statement;
- ...
-
- Here, expression and statement are delimited with whitespace. The
- statement must be a single Turbo Pascal statement (use begin ... end
- for compound statements) terminated with a semicolon; if the statement
- consists of multiple lines, the continuation lines must be indented
- with at least one blank or tab character. An action may also be
- replaced by the symbol |, in which case it is assumed that the action
- for the current rule is the same as that for the next one.
-
- As already indicated, the auxiliary procedures section is optional. If
- it is present, it is assumed to contain valid Turbo Pascal code (such
- as supplementing routines, or a main program) which is simply tacked
- on to the end of the output (Turbo Pascal) program Lex produces.
-
- The definitions section of a Lex program may contain regular defini-
- tions of the form
- name expression
- defining a (regular expression) substitution for an identifier name
- (according to Turbo Pascal syntax). name and expression must be
- separated by whitespace. Note that in difference to Pascal, upper- and
- lowercase in identifiers is always distinct.
-
- The value of a regular definition for name can be referred to lateron
- using the notation {name}. Thus, regular definitions provide a sort of
- "constant declaration" for regular expressions.
-
- From the source grammar, Lex produces an output program, written in
- Turbo Pascal, that defines a parameterless function
- function yylex : integer;
- implementing the lexical analyzer. When called, yylex reads an input
- file (standard input, by default), scanning for the patterns specified
- in the source grammar, and executing the corresponding actions as pat-
- terns are matched.
-
- Normally, yylex scans the whole input file and returns with value 0 to
- the calling program upon encountering end-of-file (actions may also
- return other values to the calling program, cf. 2.2). Thus, in the
- normal case, a suitable main program calling yylex is something like:
- begin
- if yylex=0 then { done }
- end.
-
- Such a main program must be supplied by the programmer, e.g., in the
- auxiliary procedures section (there is no default main program in the
- Lex library, as with UNIX Lex).
-
- The lexical analyzer routine yylex scans for all patterns
- simultaneously. If more than one pattern is matched, the longest match
- is preferred; if there still remains more than one pattern making the
- longest match, the first such rule in the source grammar is chosen.
- This makes rules like
- if writeln('keyword if');
- [A-Za-z][A-Za-z0-9]* writeln('identifier ', yytext);
- work as expected (i.e., input if will be matched by the first, if1 by
- the second rule).
-
- A Lex program may also be incomplete in that it does not specify a
- pattern for any possible input. In such a case, the lexical analyzer
- executes a default action on unrecognized parts of the input, which
- consists of copying the input to an output file (standard output, by
- default). Thus, the trivial Lex program
- %%
- yields a routine that copies the input to the output file unchanged.
- On the other hand, if the input has to be absorbed completely, the
- programmer must supply rules that match everything, e.g.:
- . |
- \n ;
-
- Example: The following Lex program counts words (sequences of non-
- whitespace characters) in an input file:
- uses LexLib;
- var count : integer;
- %%
- [^ \t\n]+ inc(count);
- . |
- \n ;
- %%
- begin
- count := 0;
- if yylex=0 then writeln('word count: ', count)
- end.
-
- A few remarks about the generated lexical analyzer routine are in or-
- der. Lex generates table-driven lexical analyzers in DFA technique [1,
- section 3.7] which usually are both quite compact and fast (though
- hand-coded lexical analyzers will often be more efficient). In
- particular, the matching time is proportional to the length of the in-
- put, unless ambiguity and lookahead requires a significant amount of
- rescanning. There are certain pathological regular expressions which
- cause exponential growth of the DFA table, however, they are rare.
-
- Lex-generated lexical analyzers interface nicely with Yacc, because
- the yylex routine just meets Yacc's requirements of its lexical
- analyzer routine (actually, it was designed that way). Thus, a yylex
- routine prepared with Lex can be incorporated directly into a Yacc-
- generated parser.
-
- 2.1 REGULAR EXPRESSIONS
-
- In the Lex language regular expressions are used to denote string pat-
- terns to be matched in the input stream. The basic regular expressions
- are (cf. table 1):
- - single characters: c stands for the literal character c itself, un-
- less it is an operator character (one of *, +, ?, etc., discussed
- below), in which case it must be quoted with the backslash \. Non-
- printable characters are specified using the C-like escapes listed
- in table 2. Note that \0 marks end-of-file, and \n (newline) stands
- for end-of-line, i.e. the sequence carriage-return/line-feed in MS-
- DOS text files.
-
- Escape Denotes
- ----------------------------------
- \b backspace
- \t tab
- \n newline
- \nnn character no. nnn in octal
- \\ backslash
- \c character c literally
- ----------------------------------
- Table 2: Lex character escapes.
-
- - strings: "s", where s is any character sequence, stands for the
- string s. To embed the double quote " in a string, it must be quoted
- with \.
- - character classes: [s] stands for all characters in s, and [^s]
- denotes the complement of [s]. A - sign in a character class denotes
- ranges, e.g. [a-z] is the class of all lowercase letters. The period
- . is an abbreviation for the class of all characters except newline,
- i.e. [^\n]. Note that a character class never contains the end-of-
- file marker \0, unless it is explicitly included.
-
- From these basic elements, larger regular expressions may be formed
- using the following operators:
- - r*: stands for an arbitrary sequence of r's (0 or more), where r is
- any regular expression.
- - r+: stands for 1 or more r's.
- - r?: stands for 0 or 1 r.
- - r{m,n}: stands for m to n r's (where m and n are nonnegative in-
- tegers). r{m} denotes excactly m r's.
- - r1r2: stands for r1, followed by r2, where r1 and r2 are arbitrary
- regular expressions.
- - r1|r2: stands for r1 or r2.
-
- The operators have been listed in order of decreasing precedence, i.e.
- *, +, ? and {m,n} bind stronger than r1r2 (concatenation), which in
- turn precedes over | (alternation). Parentheses ( ... ) can be used to
- group regular expressions and override default precedences.
-
- As already mentioned, subexpressions may be abbreviated with names,
- using regular definitions. If name has been defined as an expression
- r, {name} specifies the regular expression r. Note that (in difference
- to UNIX Lex) the substituted expression r must always be a complete
- legal regular expression, which is actually substituted as an expres-
- sion, not textually. This implies that {name} is treated as a
- parenthesized expression. Also note, that any references to name must
- follow its definition, i.e. recursive definitions are illegal.
-
- Lex also supplies a number of operators that are used to specify left
- and right context. The context does not actually belong to the pat-
- tern, but determines whether the pattern itself may be matched.
-
- Right context, or lookahead is specified using the lookahead operator
- /. r1/r2 stands for r1, but only if followed by r2, where r1 and r2
- are arbitrary regular expressions (which may, however, not contain the
- lookahead operator themselves). r$ may be used as an abbreviation for
- r/[\0\n], i.e. r followed by line end or end-of-file.
-
- The caret ^ stands for the beginning of the line, and thus marks im-
- mediate left context. More distant left context may be specified using
- user-defined start states. The expression <s1,...,sn>r denotes a pat-
- tern r that is valid (i.e., may be matched) only if the lexical
- analyzer is in any of the start states s1,...,sn. This requires one or
- more start state declarations in the definitions section that list all
- used start state identifiers (which, like expression names, must be
- legal Turbo Pascal identifiers).
- %start s1 s2 ... sn
- and
- %start s1
- ...
- %start sn
- are examples of valid start state declarations.
-
- By default, the lexical analyzer is in the default start state, which
- has number 0. Lex also assigns unique numbers to all user-defined
- start states, and the begin_ routine (cf. 2.2) can be used in an ac-
- tion or any other routine to put the lexical analyzer in the desired
- start state.
-
- Start states are useful when certain patterns have to be analyzed dif-
- ferently, depending on some left context (such as a special character
- at the beginning of the line), or when multiple lexical analyzers are
- working in concert. Note that a rule without start state prefix is
- valid in the default start state, as well as in any user-defined start
- state. This may be used to factor out patterns that are to be matched
- in either user-defined state.
-
- All the context operators may only appear in rules, not in regular
- definitions, and they may appear only once. Of course, context
- operators may be combined, as in <s>^r1/r2 which denotes a pattern r1
- that may only be matched if it occurs at the beginning of a line, is
- followed by an instance of r2, and if the lexical analyzer is in user-
- defined start state s.
-
- 2.2 ACTIONS
-
- A rule specifies a regular expression pattern and an action (a Turbo
- Pascal statement) to be executed when the pattern is matched. Lex sup-
- plies a number of variables and routines useful in the programming of
- actions:
- - yytext: a string variable, containing the matched text.
- - yyleng: the length of yytext, i.e. length(yytext). Note that the
- first and last character in yytext are yytext[1] and yytext[yyleng],
- respectively.
- - yylineno: the current line number in the input file; useful, e.g.,
- in giving diagnostics.
- - yymore: appends the next match to the current one (normally, the
- next match will overwrite the current one).
- - yyless: the counterpart of yymore; yyless(n) causes the current
- match to be restricted to the first n characters, and returns the
- remaining characters at the end of the match to be reread by the
- lexical analyzer. This supplies a crude form of "lookahead". Since
- Lex also supports a more general form of lookahead (cf. 2.1), this
- routine is largely obsolete.
- - reject: rejects the current match and executes whatever was "second
- choice" after the current rule, adjusting the match accordingly.
- This routine is useful when all (possibly overlapping) instances of
- patterns have to be detected; see the digram program on the dis-
- tribution disk for an example.
- - return: return(n), n an integer value, causes yylex to return with
- the indicated value. This routine is useful when the lexical
- analyzer is used to partition the input file for a parser; n will
- then typically denote a token number (cf. 3.2).
- - begin_: begin_(s) puts the lexical analyzer in start state s (cf.
- 2.1), where s is either 0 (default start state) or names a user-
- defined start state.
-
- These, and other Lex-supplied variables and routines are also dis-
- cussed in the interface of the LexLib unit (file lexlib.pas on the
- distribution disk).
-
- 2.3 LEX LIBRARY
-
- The LexLib (Lex library) unit supplies the input/output routines used
- by the lexical analyzer. The I/O files are implemented as Pascal text
- files yyin and yyout. These are - by default - assigned to the
- redirectable MS-DOS standard input/output devices. However, the user
- program may also assign them to any suitable files/devices.
-
- yylex accesses the input/output files through the following routines:
- - function input : char;
- returns the next character from the input file.
- - procedure unput(c : char);
- returns character c to the input buffer to be reread by a subsequent
- call to input.
- - procedure output(c : char);
- appends character c to the output file.
-
- All references to the yyin/yyout files of the LexLib unit are made
- through these three routines. Thus, a user program may well replace
- them altogether by other routines matching the specifications. This
- makes it possible for the lexical analyzer to access arbitrary in-
- put/output streams, such as special devices or internal memory.
-
- The LexLib I/O routines and files may also be accessed directly
- through actions or the main program. However, care must be heeded un-
- der certain circumstances. In particular, direct access to the yyin
- file will bypass the buffering of unput characters and thus may some-
- times not have the desired results.
-
- The yywrap routine is a parameterless boolean function that determines
- whether the lexical analyzer should perform normal wrapup at end-of-
- file. If it returns true, normal wrapup is performed; if it returns
- false, yylex ignores the end-of-file mark and continues lexical
- analysis. The LexLib unit supplies a default version of yywrap which
- always returns true. This routine may be replaced by a customized ver-
- sion that does application dependent processing at end-of-file. In
- particular, yywrap may arrange for more input and return false to
- resume lexical analysis (see the findproc program on the distribution
- disk for an example).
-
- Note that the LexLib unit must be loaded with (almost) any Lex program
- such that the lexical analyzer routine may access I/O and other
- routines. To achieve this, the line
- uses LexLib;
- should be put at the beginning of the Lex program (see section 2.5 on
- how to incorporate Turbo Pascal source lines into the definitions sec-
- tion of a Lex program).
-
- Refer also to the LexLib interface description contained in lexlib.pas
- on the distribution disk for a discussion of the LexLib I/O system and
- other routines.
-
- 2.4 CHARACTER TABLES
-
- The standard character encoding supported both by Lex and the LexLib
- unit is ASCII (to be more precise, IBM's 8-bit extension of ASCII).
- However the user may supply his own versions of input, unput and out-
- put (cf. 2.3), supporting their own character encoding. If such a
- customized character set is used, Lex must be told about it by means
- of a character table in the definitions section of the Lex program.
-
- This table has the format:
- %T
- charno. string
- ...
- %T
-
- Each line of the character table lists a character number (in the
- target code) and the corresponding ASCII representation(s) of this
- character. The usual escapes for non-printable characters are recog-
- nized (cf. table 2).
-
- Example: To map the lower- and uppercase letters into 1,...,26, the
- digits 0,...,9 into codes 27,...,36, and newline into 37, the fol-
- lowing table may be used:
-
- %T
- 1 Aa
- 2 Bb
- 3 Cc
- ...
- 26 Zz
- 27 0
- 28 1
- ...
- 36 9
- 37 \n
- %T
-
- If a character table is used, all characters (at least those actually
- used in the Lex grammar) should be in the table; all character numbers
- must be byte values (0..255), and no character may be mapped into two
- different codes.
-
- 2.5 TURBO PASCAL TIE-INS
-
- Frequently, a Lex program will not merely consist of definitions and
- rules, but also use other routines to be loaded with the lexical
- analyzer. One example is a main program that calls the yylex routine.
- We have already mentioned, that such a main program, and other sup-
- plementary routines, may be placed into the auxiliary procedures sec-
- tion.
-
- For other target language (i.e. Turbo Pascal) tie-ins, the Lex
- language allows arbitrary code fragments to be included in the defini-
- tions and at the beginning of the rules section of the Lex grammar, by
- the following means:
- - Any line in the source grammar starting in column one is assumed to
- contain Lex code (definitions or rules), which is processed by Lex.
- - Any line indented with at least one space or tab character, and any
- sequence of lines enclosed between %{ and %} is assumed to contain
- Turbo Pascal code, and is copied to the output file unchanged.
-
- Code in the definitions section is inserted at the beginning of the
- output program, at global scope, while code at the beginning of the
- rules section is inserted as local declarations into the action
- routine that contains the actions of all rules. Thus, an indented line
- var i : integer;
- at the beginning of the rules section will declare an integer variable
- local to the action statements.
-
- As a side-effect, these conventions allow comments to be included in a
- Lex program; comments should then follow host language (i.e. Turbo
- Pascal) syntax.
-
- Example: A typical setup for a Lex program with necessary declara-
- tions, supplementary routines, and main program may look as follows:
- uses LexLib;
- { global declarations }
- { Lex definitions }
- %%
- { local declarations }
- { Lex rules }
- %%
- { supplementary routines }
- begin { main program }
- ...
- if yylex=0 then { done };
- ...
- end.
-
- 2.6 IMPLEMENTATION RESTRICTIONS, BUGS, AND INCOMPATIBILITIES
-
- Lex poses some restrictions on the sizes of the input grammar and in-
- ternal tables. Maximum table sizes are printed out by Lex at the end
- of a translation, along with statistics about actual table space re-
- quirements. An error message table overflow can also indicate that not
- enough main memory is available to Lex (possibly because of too many
- programs loaded into memory).
-
- Since yytext is implemented as a Turbo Pascal string variable, the
- maximum size for a matched string is 255.
-
- As implemented, the reject routine (cf. 2.2) does not rescan the in-
- put, but uses internal state information to determine the next pos-
- sible match. Thus reject combined with modifications of the input
- stream may yield unexpected results.
-
- There is a subtle (and, as far as I know, undocumented) bug in Lex,
- concerning certain types of lookahead patterns, that sometimes causes
- lookahead not to be restored properly. E.g., when the pattern ab*/b is
- matched, the last b is never returned to the input, but instead the
- whole matched sequence will be returned in yytext. This actually is a
- "misfeature", and seems to be comformant with UNIX Lex and even with
- the method of Aho/Sethi/Ullman for handling the lookahead operator [1,
- section 3.8], which Lex' treatment of the lookahead operator is based
- on.
-
- When called, yylex partially initalizes itself. This implies that
- yymore and reject will not work between different invokations of
- yylex.
-
- As discussed in 2.1, Lex substitutes expressions, not text, when a
- regular definition is referred to with the {name} notation. This is in
- contrast with the (textual) macro expansion scheme used in UNIX Lex.
- Although the approach taken in Turbo Pascal Lex is more restrictive,
- we feel that it is actually an improvement. In particular, regular
- definitions can be parsed and checked for validity immediately, such
- that errors in regular definitions can be detected as soon as pos-
- sible. Also, the meaning of a regular definition is guaranteed to be
- independent of the context in which it is used.
-
- Another (minor) difference between Turbo Pascal and UNIX Lex in the
- syntax of regular definitions is that Turbo Pascal Lex requires names
- for regular expressions to be legal (Turbo Pascal) identifiers,
- whereas UNIX Lex admits arbitrary character strings as names.
-
- 3. YACC
-
- Yacc ("Yet Another Compiler-Compiler") is a parser generator, i.e. a
- program that translates the specification of an input language by its
- BNF (Backus Naur Form) grammar into a parser subroutine, written in
- Turbo Pascal.
-
- Similar to Lex, a Yacc program or grammar has the form
- definitions
- %%
- rules
- %%
- auxiliary procedures
- where the first section contains the definitions of the basic input
- symbols (terminals, also termed tokens) of the specified language, the
- second section the grammar rules for the nonterminal symbols of the
- language, and the third, and optional, section contains any additional
- Turbo Pascal code, such as supplementary routines and a main program,
- which will be tacked on to the end of the Turbo Pascal output program.
-
- The rules section of a Yacc program is simply a BNF grammar for the
- target language, possibly augmented with actions, program statements
- to be executed as certain syntactic constructs are recognized by the
- parser. By default, the left-hand side of the first rule marks the
- start symbol of the grammar. It is also possible to declare a start
- symbol explicitly by means of a declaration of the form
- %start A
- in the definitions section, where A is the desired start symbol.
-
- Grammar rules have the general format
- A : b1 ... bn;
- where A is the left-hand side nonterminal of the rule, and b1 ... bn
- is the (possibly empty) right-hand side sequence of nonterminal and
- terminal symbols bi.
-
- The terminating semicolon may be omitted, and several rules A : ui for
- the same left-hand side nonterminal A may be abbreviated as
- A : u1
- | u2
- ...
- | un
-
- Nonterminal symbols are denoted by identifiers (letters, followed by
- digits and letters, where underscore _ and period . count as letters,
- and upper- and lowercase are distinct).
-
- Terminal symbols may either be literals (single characters enclosed in
- single quotes) or identifiers that are declared explicitly as
- terminals by a %token definition of the form
- %token a1 ... an
-
- By convention, token identifiers are given in uppercase, such that
- they can be distinguished easily from nonterminal symbols.
-
- In literals, the usual C-like escapes are recognized (cf. table 3).
-
- Escape Denotes
- ------------------------------------
- '\n' newline
- '\r' carriage return
- '\'' single quote '
- '\\' backslash
- '\t' tab
- '\b' backspace
- '\f' form feed
- '\nnn' character no. nnn in octal
- ------------------------------------
- Table 3: Yacc character escapes.
-
- Grammar rules may be augmented with actions, Turbo Pascal statements
- enclosed between { ... }. Usually, actions appear at the end of rules,
- indicating the statements to be executed when an instance of a rule
- has been recognized (cf. 3.1).
-
- The Yacc language is free-format: blanks, tabs, and newlines are ig-
- nored, except when they serve as delimiters. Yacc language comments
- have the format:
- /* ... anything except */ ... */
-
- As with Lex, host language (i.e. Turbo Pascal) tie-ins may be
- specified by enclosing them in %{ ... %}. Such code fragments will be
- copied unchanged, and inserted into the output file at appropriate
- places (code in the definitions section at global scope, code at the
- beginning of the rules section as local declarations of the action
- routine).
-
- The class of grammars accepted by Yacc is LALR(1) with disambiguating
- rules (cf. [1, sections 4.7 and 4.8]). Yacc can successfully produce
- parsers for a large class of formal languages, including most modern
- programming languages (under UNIX, Yacc has been used to produce par-
- sers for C, Fortran, APL, Pascal, Ratfor, Modula-2, and others).
-
- From the source grammar, Yacc produces an output file containing the
- parser routine
- function yyparse : integer;
- together with any additional Turbo Pascal code supplied by the
- programmer.
-
- yyparse repeatedly calls a lexical analyzer routine yylex to obtain
- tokens from the input file, and parses the input accordingly, execut-
- ing appropriate actions as instances of grammar rules are recognized.
- yyparse returns with a value of 0 (successful parse terminated at end-
- of-file) or 1 (fatal error, such as parse stack overflow, or un-
- recoverable syntax error).
-
- Thus, a main program that calls the parser routine may look as fol-
- lows:
- begin
- ...
- if yyparse=0 then { done } else { error };
- ...
- end.
-
- Main program and lexical analyzer routine must be supplied by the
- programmer. The yylex routine can also be prepared with Lex and then
- loaded with the parser, cf. 3.2. The main program is usually included
- at the end of the auxiliary procedures section.
-
- The following is an example of a Yacc grammar for simple arithmetic
- expressions. Note that the symbol NUMBER is declared as a token ex-
- pected to be returned by the lexical analyzer as a single input sym-
- bol.
-
- %token NUMBER
- %%
- expr : term
- | expr '+' term
- ;
- term : factor
- | term '*' factor
- ;
- factor : NUMBER
- | '(' expr ')'
- | '-' factor
- ;
-
- A Lex program implementing the lexical analyzer for this Yacc program
- may look as follows:
- {$I expr.h} { definition of token numbers produced by
- Yacc }
- %%
- [0-9]+ return(NUMBER);
- . |
- \n return(ord(yytext[1]));
- { other literals returned as their character
- codes }
-
- 3.1 ACTIONS
-
- As already indicated, grammar rules may be associated with actions,
- program statements that are executed as rules are recognized during
- the parse. Among other things, actions may print out results, modify
- internal data structures such as a symbol table, or construct a parse
- tree. Actions may also return a value for the left-hand side non-
- terminal, and process values returned by previous actions for right-
- hand side symbols of the rule.
-
- For this purpose, Yacc assigns to each symbol of a rule a correspond-
- ing value (of type integer, by default, but arbitrary value types may
- be declared, cf. 3.6): $$ denotes the value of the left-hand side non-
- terminal, $i the value of the ith right-hand side symbol. Values are
- kept on a stack maintained by the parser as the input is parsed. The
- lifetime of a value begins when the corresponding syntactic entity
- (nonterminal or terminal) has been recognized by the parser, and ends
- when the parser reduces by an enclosing rule, thereby replacing the
- values associated with the right-hand side symbols by the value of the
- left-hand side nonterminal.
-
- Nonterminals A obtain their values through assignments of the form $$
- := v, where A is the left-hand side of the corresponding rule, and v
- is some value, usually obtained by a function applied to the right-
- hand side values $i.
-
- Terminals may also have associated values; these are set by the lexi-
- cal analyzer through an assignment to the variable yylval supplied by
- Yacc (cf. 3.2), as necessary.
-
- As an example, here is an extension of the arithmetic expression gram-
- mar, featuring actions that evaluate the input expression and a rule
- for the new start symbol line that is associated with an action that
- prints out the obtained result.
-
- %token NUMBER
- %%
- line : expr '\n' { writeln($1) }
- ;
- expr : term { $$ := $1 }
- | expr '+' term { $$ := $1 + $3 }
- ;
- term : factor { $$ := $1 }
- | term '*' factor { $$ := $1 * $3 }
- ;
- factor : NUMBER { $$ := $1 }
- | '(' expr ')' { $$ := $2 }
- | '-' factor { $$ := -$2 }
- ;
-
- Note that the lexical analyzer must set the values of NUMBER tokens,
- which are referred to by $1 in the rule factor : NUMBER. One can use a
- Lex rule like
- var code : integer;
- [0-9]+ begin
- val(yytext, yylval, code);
- return(NUMBER)
- end;
- that applies the Turbo Pascal standard procedure val to evaluate a
- NUMBER token.
-
- Actually, we could have omitted the "copy actions" of the form $$ :=
- $1 in the above grammar, since this is the default action automatical-
- ly assumed by Yacc for any rule without explicit action.
-
- Yacc also allows actions within rules, i.e. actions that are to be ex-
- ecuted before a rule has been fully parsed. A rule like
- A : b { p; } c
- will be treated as if it was written
- A : b $act c
- $act : { p; }
- introducing a new nonterminal $act matched to an empty right-hand
- side, and associated with the desired action.
-
- In particular, the action { p; } is treated as if it was a (nontermi-
- nal) grammar symbol, and thus can also return a value accessible with
- the usual $i notation. The action itself may also access values of
- symbols (and other actions) to the left of it. Thus, the rule A : b {
- p; } c is actually treated as if it consisted of three right-hand side
- symbols; $1 denotes the value of b, $2 the value of { p; } (set in p
- by an assignment to $$) and $3 is the value of c.
-
- Yacc's syntax-directed evaluation scheme makes it particularily easy
- to implement synthesized attributes along the guidelines of [1, sec-
- tion 5.6]. The evaluation of inherited attributes can often also be
- simulated by making use of marker nonterminals, see also [1]. To ac-
- cess marker nonterminals outside the scope of the rule to which they
- belong, Yacc also supports the notation $i, where i<=0, indicating a
- value to the left of the current rule, belonging to an enclosing rule
- ($0 denotes the first symbol to the left, $-1 the second, ...).
-
- Consider
- A : b B
- | b b' { $$ := $1 } B
- B : c { $$ := f($0) }
-
- The anonymous marker nonterminal implemented by the action { $$ := $1
- } assures that the value of b can always be accessed through $0 in the
- third rule. Note that without use of the marker nonterminal, the rela-
- tive position of b's value on the stack would not be predictable.
-
- Actions within rules, and access to values in enclosing rules, supply
- flexible means to implement syntax-directed evaluation schemes.
- However, care must be heeded that such actions do not give rise to un-
- wanted parsing conflicts caused by ambiguities (cf. 3.4).
-
- 3.2 LEXICAL ANALYSIS
-
- Yacc-generated parsers use a lexical analyzer routine yylex to obtain
- tokens from the input file. This routine must be supplied by the user,
- and is assumed to return an integer value denoting a basic input sym-
- bol. 0 (or negative) denotes end-of-file, and character literals are
- denoted by their character code. Usually, all other token numbers are
- assigned by Yacc automatically, in the order in which %token defini-
- tions appear in the grammar. Token numbers may also be assigned ex-
- plicitly, by a definition of the form
- %token a n
- where a is the terminal symbol (literal or identifier), and n is the
- desired token number.
-
- If there is a value associated with an input symbol, yylex should as-
- sign this value to the variable yylval supplied by Yacc. Usually, yyl-
- val has type integer, but this default can be overwritten (cf. 3.6).
-
- Declarations shared by parser and lexical analyzer are put in the
- header (.h) file Yacc generates along with the (.pas) output file con-
- taining the parser routine. The header file declares the yylval vari-
- able and lists the token numbers; each token identifier is declared as
- a corresponding integer constant.
-
- The header file should thus be included in a context where it is ac-
- cessible by both the parser and the lexical analyzer. For instance,
- one may include both header file and lexical analyzer, in that order,
- in the definitions section of the grammar, by means of the Turbo Pas-
- cal include directive ($I):
- %{
- {$I header filename }
- {$I lexical analyzer }
- %}
-
- As has already been indicated, the lexical analyzer generator Lex dis-
- cussed in section 2 of this manual is a useful tool to produce lexical
- analyzers to be incorporated into Yacc-generated parsers.
-
- 3.3 YACC LIBRARY
-
- The Yacc library unit YaccLib contains some default declarations used
- by Yacc-generated parsers. It should therefore be loaded with yyparse,
- which can be achieved by the uses clause
- %{
- uses Yacclib;
- %}
- at the beginning of the Yacc program. Note that if the program in-
- cludes a lexical analyzer prepared with Lex, the LexLib unit may also
- be required:
- %{
- uses Yacclib, LexLib;
- %}
-
- The routines implemented by the Yacc library are defaults that can be
- customized for the target application. In particular, these are the
- yymsg message printing routine and the yydebugmsg debug message print-
- ing routine. The Yacc library also declares defaults for the value-
- type YYSTYPE (cf. 3.6) and the size of the parser stack (yymaxdepth;
- see also 3.9).
-
- Refer to the interface description of the YaccLib unit for further in-
- formation. The interface also describes some additional variables and
- routines which are not actually implemented by the Yacc library, but
- are contained in the Yacc output file. Some of these will also be men-
- tioned in subsequent sections.
-
- 3.4 AMBIGUITY
-
- If a grammar is non-LALR(1), Yacc will detect parsing conflicts when
- constructing the parse table. Such parsing conflicts are usually
- caused by inconsistencies and ambiguities in the grammar. Yacc will
- report the number of detected parsing conflicts, and will try to
- resolve these conflicts, using the methods outlined in [1, section
- 4.8]. Thus, Yacc will generate a parser for any grammar, even if it is
- non-LALR. However, if unexpected parsing conflicts arise, it is wise
- to consult the parser description (.lst file, cf. 3.7) generated by
- Yacc to determine whether the conflicts were resolved correctly.
- Otherwise the parser may not behave as expected.
-
- An example of an ambigious grammar is the following, specifying a
- Pascal-like syntax of IF-THEN-ELSE statements:
- %token IF THEN ELSE
- %%
- stmt : IF expr THEN stmt /* 1 */
- | IF expr THEN stmt ELSE stmt /* 2 */
-
- The ambiguity in this grammar fragment, often referred to as the
- dangling-else-ambiguity, stems from the fact that it cannot be decided
- to which THEN a nested ELSE belongs: is
- IF e1 THEN IF e2 THEN s1 ELSE s2
- to be interpreted as:
- (1) IF e1 THEN ( IF e2 THEN s1 ELSE s2 )
- or as:
- (2) IF e1 THEN ( IF e2 THEN s1 ) ELSE s2 ?
-
- Let us take a look at how such an ambigious construct would be parsed.
- When the parser has seen IF e2 THEN s1, it could recognize (reduce by)
- the first rule, yielding the second interpretation; but it could as
- well read ahead, shifting the next symbol ELSE on top of the parser
- stack, then parse s2, and finally reduce by rule 2, which in effect
- yields the first interpretation. Thus, upon seeing the token ELSE, the
- parser is in a shift/reduce conflict: it cannot decide between shift
- (the token ELSE) and reduce (by the first rule).
-
- In the dangling-else example, the grammar - with some effort - can be
- rewritten to eliminate the ambiguity, see [1, section 4.3]. However,
- this is not possible in general (there are languages that are in-
- trinsically ambigious). Furthermore, an unambigious grammar may still
- be non-LALR (see [1, exercise 4.40] for an example). In particular,
- parsing decisions in an LALR parser are based on one-symbol-lookahead,
- which limits the class of grammars that may be used to construct a
- `pure' LALR parser.
-
- As implemented, Yacc always resolves shift/reduce conflicts in favour
- of shift. Thus a Yacc generated parser will correctly resolve the
- dangling-else ambiguity (assuming the common rule, that an ELSE
- belongs to the last unmatched THEN).
-
- Another type of ambiguity arises when the parser has to choose between
- two different rules. Consider the following grammar fragment (for sub-
- and superscripted expressions in the UNIX equation formatter eqn):
- %token SUB SUP
- %%
- expr : expr SUB expr SUP expr /* 1 */
- | expr SUB expr /* 2 */
- | expr SUP expr /* 3 */
-
- The rationale behind this example is that an expression involving both
- sub- and superscript is often set differently from a superscripted
- subscripted expression.
-
- The ambiguity arises in an expression of the form e1 SUB e2 SUP e3. At
- the end of the expression, the parser can apply both rule 1 (reduce
- the whole expression) and rule 3 (reduce the subexpression e2 SUP e3).
-
- This type of conflict is termed reduce/reduce conflict. Yacc resolves
- reduce/reduce conflicts in favour of the rule listed first in the Yacc
- grammar. Thus, "special case constructs" like the one above may be
- specified by listing them ahead of the more general rules. In our ex-
- ample, e1 SUB e2 SUP e3 is always interpreted as an instance of the
- first rule (which presumably is the intended result).
-
- To summarize, in absence of other strategies (to be discussed below),
- Yacc applies the following default disambiguating rules:
- - a shift/reduce conflict is resolved in favour of shift.
- - in a reduce/reduce conflict, the first applicable grammar rule is
- preferred.
-
- In any case, the number of shift/reduce and reduce/reduce conflicts is
- reported by Yacc, since they could indicate inconsistencies in the
- grammar. Also, Yacc reports rules that are never reduced (possibly,
- because they are completely ruled out by disambiguating rules). A more
- detailed description of the detected conflicts may be found in the
- parser description (cf. 3.7).
-
- The default disambiguating rules are often inappropriate in cases
- where an ambigious grammar is deliberately chosen as a more concise
- representation of an (unambigious) language. Consider the following
- grammar for arithmetic expressions:
- %token NUMBER
- %%
- expr : expr '+' expr
- | expr '*' expr
- | NUMBER
- | '(' expr ')'
- ;
-
- There are several reasons why such an ambigious grammar might be
- preferred over the corresponding unambigious grammar:
- %token NUMBER
- %%
- expr : term
- | expr '+' term
- ;
- term : factor
- | term '*' factor
- ;
- factor : NUMBER
- | '(' expr ')'
- ;
-
- In particular, the ambigious grammar is more concise and natural, and
- yields a more efficient parser [1, section 4.8].
-
- The ambiguities in the first grammar may be resolved by specifying
- precedences and associativity of the involved operator symbols. This
- may be done by means of the following precedence definitions:
- - %left operator symbols: specifies left-associative operators
- - %right operator symbols: specifies right-associative operators
- (e.g., exponentiation)
- - %nonassoc operator symbols: specifies non-associative operators (two
- operators of the same class may not be combined; e.g., relational
- operators in Pascal)
-
- Each operator symbol may be a literal or a token-identifier; token-
- names appearing in precedence definitions may, but need not be
- declared with %token as well.
-
- Each precedence declaration introduces a new precedence level, lowest
- precedence first. For the example above, assuming '+' and '*' to be
- left-associative, and '*' to take precedence over '+', the correspond-
- ing Yacc grammar is:
- %token NUMBER
- %left '+'
- %left '*'
- %%
- expr : expr '+' expr
- | expr '*' expr
- | NUMBER
- | '(' expr ')'
- ;
-
- This grammar unambigiously specifies how arbitrary expressions are to
- be parsed; e.g., e1+e2+e3 will be parsed as (e1+e2)+e3, and e1+e2*e3
- as e1+(e2*e3).
-
- Yacc resolves shift/reduce conflicts using precedences and as-
- sociativity in the following manner. With each grammar rule, it as-
- sociates the precedence of the righmost terminal symbol (this default
- may be overwritten using a %prec tag, see below). Now, when there is a
- conflict between shift a and reduce r, a a terminal and r a grammar
- rule, and both a and r have associated precedences p(a) and p(r),
- respectively, the conflict is resolved as follows:
- - if p(a)>p(r), choose `shift'.
- - if p(a)<p(r), choose `reduce'.
- - if p(a)=p(r), the associativity of a determines the resolution:
- - if a is left-associative: `reduce'.
- - if a is right-associative: `shift'.
- - if a is non-associative: `syntax error'.
-
- Shift/reduce conflicts resolved with precedence will, of course, not
- be reported by Yacc.
-
- Occasionally, it may be necessary to explicitly assign a precedence to
- a rule using a %prec tag, because the default choice (precedence of
- rightmost terminal) is inappropriate. Consider the following example
- in which '-' is used both as binary and unary minus:
- %token NUMBER
- %left '+' '-'
- %left '*' '/'
- %right UMINUS
- %%
- expr : expr '+' expr
- | expr '-' expr
- | expr '*' expr
- | expr '/' expr
- | NUMBER
- | '(' expr ')'
- | '-' expr %prec UMINUS
- ;
-
- UMINUS is not an actual input symbol, but serves to give unary minus a
- higher precedence than any other operator. Note that, by default, the
- last rule would otherwise have the precedence of (binary) '-'.
-
- 3.5 ERROR HANDLING
-
- This section is concerned with the built-in features Yacc provides for
- syntactic error handling. By default, when a Yacc-generated parser en-
- counters an errorneous input symbol, it will print out the string
- syntax error via the yymsg routine and return to the calling program
- with the value 1. Usually, an application program will have to do bet-
- ter than this, e.g. print appropriate error messages, and resume pars-
- ing after syntax errors. For this purpose, Yacc supplies a flexible
- error recovery scheme based on the notion of "error rules", cf. [1,
- section 4.8 and 4.9].
-
- The predefined token error is reserved for error handling; it is in-
- serted at places were syntax errors are expected, yielding the so-
- called error rules. A transition on the error token is never taken
- during a normal parse, but only when a syntax error occurs. In this
- case the parser pretends it has seen an error token immediately before
- the offending input symbol. Since in general the current state of the
- parser may not admit a transition on this fictitious error symbol, the
- parser pops its stack until it finds a suitable state, which admits a
- transition on the error token. If no such state exists, the default
- error handler is used which prints out syntax error and terminates the
- parse. If there is such a state, the parser shifts the error token on
- top of the stack, and resumes parsing.
-
- To prevent cascades of error messages, the parser then proceeds in a
- special "error" state in which other errorneous input symbols are
- quietly ignored. Normal parse is resumed only after three symbols have
- been read and accepted by the parser.
-
- As a simple example, consider the rule
- stmt : error { action }
- and assume a syntax error occurs while a statement is parsed. Then the
- parser will pop its stack, until the token error, as an instance of
- the rule stmt : error, can be accepted, shift the error token on top
- of the stack, reduce by the error rule immediately (executing action),
- and resume parsing in error state. The effect is, that the parser as-
- sumes that a statement (to which error reduces) has been found, and
- skips symbols until it finds something which can legally follow a
- statement.
-
- Similarly, the rule
- stmt : error ';' { action }
- will cause the parser to skip input symbols until a semicolon is
- found, after which the parser reduces by the error rule and executes
- action.
-
- Note that error rules are not restricted to the simple forms indicated
- above, but may consist of an arbitrary number of terminal and non-
- terminal symbols.
-
- Occasionally, the three-symbols resynchronization rule is inadequate.
- Consider, for example, an interactive application with an error rule
- input : error '\n' { write('reenter last line: ') }
- input { $$ := $4 }
- ;
-
- An error in an input line will then cause the parser to skip ahead be-
- hind the following line end, emit the message reenter last line:, and
- read another line. However, it will quietly skip invalid symbols on
- the new line, until three valid symbols have been found. This can be
- fixed by a call to the routine yyerrok which resets the parser to its
- normal mode of operation:
- input : error '\n' { write('reenter last line: ');
- yyerrok }
- input { $$ := $4 }
- ;
-
- There are a number of other Yacc-supplied routines which are useful in
- implementing better error diagnostics and handling:
- - yychar: an integer variable containing the current lookahead token.
- - yyclearin: deletes the current lookahead token.
- - yynerrs: current total number of errors reported with yymsg.
- - yyerror: simulates a syntax error; syntactic error recovery is
- started, as if the next symbol was illegal.
- - yyaccept: simulates accept action of the parser (yyparse returns 0).
- - yyabort: aborts the parse (yyparse returns 1), as if an un-
- recoverable syntax error occurred.
-
- These, and other routines are also described in the interface of the
- Yacc library (cf. 3.3).
-
- Syntactic error handling and recovery is a difficult area; see [1] for
- a more comprehensive treatment of this topic. The reader may also
- refer to [4] for a more detailed explanation of the Yacc error
- recovery mechanism and systematic techniques for developing error
- rules.
-
- 3.6 ARBITRARY VALUE TYPES
-
- As already noted, the default type for the $$ and $i values is integer
- (cf. 3.1). This type can be changed by putting a declaration
- %{
- type YYSTYPE = some_type;
- %}
- into the definitions section of the Yacc program, prior to inclusion
- of the header and lexical analyzer file.
-
- Yacc also supports explicit declaration of a (record) value type, by
- means of a definition
- %union{
- name(s) : type;
- ...
- }
-
- Such a definition will be translated to a corresponding (Turbo Pascal)
- record declaration which is put into the header file, and determines
- the type of stacked $$ values, as well as of the yylval variable (cf.
- 3.2). "Union tags" of the form <name>, where name is the name of a
- component of the record type, can then be assigned to terminal and
- nonterminal symbols through %token and %type definitions, respective-
- ly.
-
- Consider, for example, the definition:
- %union{
- integer_val : integer;
- real_val : real;
- }
- and the grammar rules:
- %token INT
- %%
- expr : expr '+' expr { $$ := $1 + $3 }
- | expr '*' expr { $$ := $1 * $3 }
- | INT { $$ := $1 }
- /* ... */
-
- To assign value type real to nonterminal expr, and type integer to the
- token INT, one might say:
- %token <integer_val> INT
- %type <real_val> expr
-
- The effect is, that Yacc will automatically replace references to $$
- and $i values by the appropriate record tags, i.e. $$ := $1 + $3 will
- be treated as $$.real_val := $1.real_val + $3.real_val, and the action
- $$ := $1 associated with the third rule will be interpreted as
- $$.real_val := $1.integer_val.
-
- Also, when arbitrary value types are used, Yacc checks whether each
- value referred to by an action has a well-defined type.
-
- Occasionally, there are values whose types cannot be determined by
- Yacc easily. This is the case for the $$ value of an action within a
- rule, as well as for values $i, i<=0 of symbols in enclosing rules.
- For such values the notations <name> and <name>i must be used, respec-
- tively, where <name> denotes the appropriate union tag.
-
- The expr grammar on the distribution disk is a `real-life' example of
- the use of arbitrary value types.
-
- 3.7 DEBUGGING
-
- As experience shows, debugging a parser can be quite tedious.
- Although, with some experience, writing a grammar is a quite easy and
- straightforward task, implementing, for instance, a good error
- recovery scheme may be quite tricky. Yacc supplies two debugging aids
- that help verify a parser.
-
- First of all, the parser description (.lst file) is useful when
- determining whether Yacc correctly resolved parsing conflicts, and
- when tracing the actions of a parser. The .lst file gives a descrip-
- tion of all generated parser states. For each state, the set of kernel
- items and the parse actions are given. The kernel items correspond to
- the grammar rules processed by the parser in a given state; the un-
- derscore _ in an item denotes the prefix of the rule that has already
- been seen, and the suffix yet to come. The parser actions specify what
- action the parser takes on a given input symbol. Here, the period .
- denotes the default action that is taken on any input symbol not men-
- tioned otherwise. Possible actions are:
- - shift an input symbol on top of the parser stack, and change the
- parser state accordingly;
- - goto a new state upon a certain nonterminal recognized through the
- previous reduction;
- - reduce by a certain grammar rule;
- - accept, i.e. successfully terminate the parse; and
- - error, start syntax error recovery.
- The default action in a state may either be reduce or error.
-
- The parser description also lists parsing conflicts and rules that are
- never reduced (cf. 3.4).
-
- Consider the ambigious grammar:
- %token NUMBER
- %%
- expr : expr '+' expr
- | expr '*' expr
- | NUMBER
- ;
-
- This grammar, when fed into Yacc will cause a number of shift/reduce
- conflicts. For instance, the description of parser state 5 in the .lst
- file will read as follows:
- state 5:
- shift/reduce conflict (shift 3, reduce 2) on '*'
- shift/reduce conflict (shift 4, reduce 2) on '+'
-
- expr : expr '*' expr _ (2)
- expr : expr _ '+' expr
- expr : expr _ '*' expr
-
- $end reduce 2
- '*' shift 3
- '+' shift 4
- . error
-
- As is apparent from this description, the conflicts are caused by
- missing precedences and associativities of '+' and '*'. Also, it can
- be seen that for both '+' and '*' Yacc chose shift, following the
- default shift/reduce disambiguating rule.
-
- Now let us resolve the conflicts in the grammar by adding appropriate
- precedence declarations:
- %token NUMBER
- %left '+'
- %left '*'
- %%
- expr : expr '+' expr
- | expr '*' expr
- | NUMBER
- ;
-
- Now, all conflicts are resolved and for the description of state 5 we
- get:
- state 5:
-
- expr : expr '*' expr _ (2)
- expr : expr _ '+' expr
- expr : expr _ '*' expr
-
- . reduce 2
-
- Thus, Yacc correctly resolved the ambiguities by chosing reduction on
- any input, which corresponds to the higher precedence and left-
- associativity of '*'.
-
- There are situations in which a parser seems to behave "strangely" in
- spite of an "obviously correct" grammar. If the problem cannot be
- found by careful analysis of the grammar, it is useful to trace the
- actions performed by the parser, to get an idea of what goes wrong.
-
- For this purpose a parser may be compiled with defined conditional
- yydebug, e.g.:
- yacc parser
- tpc parser /Dyydebug
-
- When run, the parser will print out the actions it performs in each
- step (together with parser states, numbers of shifted symbols, etc.),
- which can then be followed on a hardcopy of the parser description.
-
- Debug messages are printed via the yydebugmsg routine (cf. 3.3). You
- may also wish to tailor this routine to your target application, such
- that it prints more informative messages.
-
- Of course, the discussion above was rather sketchy. A more detailed
- treatment of these topics, however, would require a presentation of
- the LALR parsing technique, which is well beyond the scope of this
- manual. The reader instead is referred to [1, sections 4.7 and 4.8]
- for more information about the LALR technique. Also, [4] gives a more
- detailed explanation of (UNIX) Yacc parse tables and debug messages,
- which can mostly be applied to Turbo Pascal Yacc accordingly.
-
- 3.8 YACC LANGUAGE GRAMMAR
-
- This section specifies the Yacc language syntax, as a Yacc grammar.
- Actually, the Yacc language is more naturally expressed by an LR(2)
- grammar; the difficulty is to decide on the base of one-symbol
- lookahead whether an identifier at the end of a rule is followed by a
- colon, in which case it starts the next rule. Thus, we distinguish the
- token C_ID (an identifier followed by a colon) and "ordinary"
- identifiers ID. It is assumed to be the task of the lexical analysis
- to determine to which of these two classes an identifier belongs.
-
- The following grammar has been abstracted from the Turbo Pascal Yacc
- grammar actually used to implement Turbo Pascal Yacc.
-
- %token
- ID /* identifier; also literals enclosed in quotes */
- C_ID /* identifier followed by a colon */
- NUMBER /* nonnegative integers */
- TOKEN LEFT RIGHT NONASSOC TYPE START UNION PREC
- /* reserved words: %token, etc. */
- SEP /* separator %% */
- LCURL RCURL /* curly braces %{ and %} */
- ',' ':' ';' '|' '{' '}' '<' '>'
- /* single character literals */
-
- %start spec
-
- %%
-
- spec : defs SEP rules aux_procs
- ;
-
- /* auxiliary procedures section: *************************/
-
- aux_procs : /* empty: aux_procs is optional */
- | SEP { copy the rest of the file }
- ;
- /* definitions section: **********************************/
-
- defs : /* empty */
- | defs def
- ;
-
- def : START ID
- | UNION '{' { copy the union definition } '}'
- | LCURL { copy Turbo Pascal tie-in } RCURL
- | TOKEN tag token_list
- | LEFT tag token_list
- | RIGHT tag token_list
- | NONASSOC tag token_list
- | TYPE tag nonterm_list
- ;
-
- tag : /* empty: union tag is optional */
- | '<' ID '>'
- ;
-
- token_list : token_num
- | token_list token_num
- | token_list ',' token_num
- ;
-
- token_num : ID
- | ID NUMBER
- ;
-
- nonterm_list : nonterm
- | nonterm_list nonterm
- | nonterm_list ',' nonterm
- ;
-
- nonterm : ID
- ;
-
- /* rules section: ****************************************/
-
- rules : rule1
- | LCURL { copy Turbo Pascal tie-in } RCURL rule1
- | rules rule
- ;
-
- rule1 : C_ID ':' body preced
- ;
-
- rule : rule1
- | '|' body preced
- ;
-
- body : /* empty */
- | body ID
- | body action
- ;
-
- action : '{' { copy action, substitute $$, etc. } '}'
- ;
-
- preced : /* empty */
- | PREC ID
- | PREC ID action
- | preced ';'
- ;
-
- 3.9 ADDITIONAL FEATURES, IMPLEMENTATION RESTRICTIONS AND BUGS
-
- For backward compatibility, Turbo Pascal Yacc supports all additional
- language elements entitled as `Old Features Supported But not En-
- couraged' in the UNIX manual:
- - literals delimited by double quotes and multiple-character literals.
- - \ as a synonym for %, i.e. \\ is %%, \left is %left, etc.
- - other synonyms: %< = %left, %> = %right, %binary = %2 = %nonassoc,
- %term = %0 = %token, %= = %prec.
- - actions of the form ={...} and =single statement;.
- - host language tie-ins (%{...%}) at the beginning of the rules sec-
- tion (I think that this last one is really a must).
-
- See the UNIX Yacc manual for further information.
-
- As with Lex, Yacc poses some restrictions on internal table sizes for
- the source grammar and the constructed parser; these are printed out
- by Yacc together with statistics about actual table space require-
- ments, after a successful translation of a grammar. Also, make sure
- that enough main memory is available.
-
- The default size of the parser stack is yymaxdepth=1024 (cf. 3.3)
- which should be sufficient for any average application, but may also
- be enlarged (and shrinked) as needed. Note that right-recursive gram-
- mar rules may increase stack space requirements; thus it is a good
- idea to use left-recursive (and left-associative) rules wherever pos-
- sible.
-
- Standard (UNIX) Yacc has a bug that causes some (correct) error
- recovery schemes to hang in an endless loop, see [4]. This bug should
- be fixed in the Turbo Pascal implementation, at the cost of slightly
- increased parse table sizes.
-
- Yes, there is (at least) one bug in Turbo Pascal Yacc, namely that
- %union definitions (cf. 3.6) are translated to simple Pascal record
- types. They should be variant records instead. This will be fixed in
- the next release, if there ever is one. Note that this bug does not
- affect the proper functioning of the parser; it merely increases
- memory requirements for the parser's value stack. Anyhow, you may work
- around this by using %union definitions of the (Pascal variant record)
- form
- %union { case integer of
- 1: ( ... ) ;
- 2: ( ... ) ;
- ...
- }
-
- A final remark about the efficiency of Yacc-generated parsers is in
- order. The time needed to parse an input of length n is proportional
- to n. Although this may not convince everyone (Lex makes a similar
- claim, however most Lex-based analyzers seem to be considerably slower
- than hand-crafted ones), my experience is that Yacc-generated parsers
- are in fact fast, at least efficient enough for most applications
- (such as Turbo Pascal Yacc itself). The major bottleneck for compila-
- tion speed almost never seems to be the parser, but almost always the
- lexical analyzer, see [5, section 6.2]. Furthermore, one always has to
- consider that manual implementation of parsers is usually much more
- costly, compared to the use of a parser generator.
-
- Personally, I prefer a parser generator, because I'm really a lazy
- programmer; and if something seems not to be running at optimal speed,
- so what? We can always sit back and wait for still more efficient
- hardware to come (just kidding).
-
- CONCLUSION
-
- The Turbo Pascal Lex and Yacc versions described in this manual have
- been designed and tested carefully. I have used them myself to imple-
- ment, among other applications: a lexical analyzer and parser for Pas-
- cal (using a public domain ISO Level 0 grammar, also included in the
- distribution); Turbo Pascal Yacc itself, using bootstrapping; and a
- term rewriting system compiler.
-
- Also, quite a lot of smaller text and data processing and conversion
- routines have been implemented by the author, and others, using these
- programs.
-
- Personally, I feel that these tools are quite convenient and useful,
- and can safe a lot of trouble and time in software development,
- although they surely could still be improved in one direction or the
- other.
-
- Compiler construction tools are not only useful for the compiler
- writer, but can also be applied in the development of almost any other
- software tool that, in some sense, defines an input language. Also,
- the use of such utilities facilitates rapid prototyping, and enables
- the programmer to clarify language design issues in early stages of
- software projects.
-
- Turbo Pascal Lex and Yacc, as a starting point, bring to the Turbo
- Pascal programmer some of the merits of theoretically founded compiler
- technology, and thus may facilitate some of his work in trying to
- produce good, and reliable software.
-
- Author's address: Albert Graef, FB Mathematik, Johannes Gutenberg-
- Universitaet Mainz, 6500 Mainz (FRG). Email: Graef@DMZRZU71.bitnet.
-
- REFERENCES
-
- [1] Aho, Alfred V.; Ravi Sethi; Jeffrey D. Ullman: Compilers : prin-
- ciples, techniques and tools. Reading, Mass.: Addison-Wesley,
- 1986.
-
- [2] Johnson, S.C.: Yacc - yet another compiler-compiler. Murray
- Hill, N.J.: Bell Telephone Laboratories, 1974. (CSTR-32).
-
- [3] Lesk, M.E.: Lex - a lexical analyser generator. Murray Hill,
- N.J.: Bell Telephone Laboratories, 1975. (CSTR-39).
-
- [4] Schreiner, A.T.; H.G. Friedman: Introduction to compiler con-
- struction with UNIX. Prentice-Hall, 1985.
-
- [5] Waite, William M.; Gerhard Goos: Compiler construction. New
- York: Springer, 1985. (Texts and monographs in computer
- science).
-
- APPENDIX: LEX AND YACC MANUAL PAGES
-
- NAME
-
- Lex - lexical analyzer generator (MS-DOS/Turbo Pascal version)
-
- SYNOPSIS
-
- lex lex-file-name[.l] [output-file-name[.pas]]
-
- DESCRIPTION
-
- Lex compiles the regular expression grammar contained in lex-file-name
- (default suffix: .l) to the Turbo Pascal representation of a lexical
- analyzer for the language described by the input grammar, written to
- output-file-name (default suffix: .pas; default: lex-file-name with
- new suffix .pas).
-
- For each pattern in the input grammar an action is given, which is an
- arbitrary Turbo Pascal statement to execute when the corresponding
- pattern is matched in the input stream.
-
- The lexical analyzer is implemented as a table-driven deterministic
- finite automaton (DFA) routine named yylex, declared as follows:
- function yylex : integer;
-
- The return value of yylex may be 0, denoting end-of-file; all other
- return values are defined by the programmer and set through ap-
- propriate actions.
-
- The yylex routine can be compiled with the Turbo Pascal compiler (tpc
- or turbo). It is to be called in the context of a Turbo Pascal main
- program using the LexLib unit (which can be a Yacc-generated parser or
- any other program in a separate file, or incorporated into the input
- specification, and is to be supplied by the programmer).
-
- EXAMPLE
-
- A simple Lex program that counts words in an input file (obtained from
- standard input) can be implemented as follows:
- uses LexLib;
- var count : integer;
- %%
- [^ \t\n]+ inc(count);
- . |
- \n ;
- %%
- begin
- count := 0;
- if yylex=0 then writeln('word count: ', count)
- end.
-
- To compile and run this program, issue the following commands (assum-
- ing the Lex program to be in file wordcount.l):
- lex wordcount
- tpc wordcount
- wordcount <input-file
-
- DIAGNOSTICS
-
- In case of syntactic or semantic errors in the source file, Lex
- displays source line numbers and contents, error position and
- error message; a copy of the error messages is written to the file
- lex-file-name with new suffix .lst.
-
- NAME
-
- Yacc - yet another compiler-compiler (MS-DOS/Turbo Pascal version)
-
- SYNOPSIS
-
- yacc yacc-file-name[.y] [output-file-name[.pas]]
-
- DESCRIPTION
-
- Yacc compiles the BNF-like grammar contained in yacc-file-name
- (default suffix: .y) to the Turbo Pascal representation of an LALR(1)
- parser for the specified language, written to output-file-name
- (default suffix: .pas; default: yacc-file-name with new suffix .pas).
-
- Also, it generates a header file (output file name with new suffix .h)
- containing declarations to be shared between parser and the lexical
- analyzer routine yylex (discussed below), and a report file (yacc-
- file-name with new suffix .lst) that contains a description of the
- generated parser.
-
- The grammar rules in the specification can be augmented with actions,
- Turbo Pascal statements to execute when an instance of the correspond-
- ing grammar rule has been matched in the input.
-
- The parser is implemented as a table-driven deterministic pushdown-
- automaton routine yyparse, that performs a non-backtracking, bottom-up
- shift/reduce parse. yyparse is declared as follows:
- function yyparse : integer;
-
- The return value of this function is either 0 (normal termination of
- the parse) or 1 (exception occurred during the parse, e.g. stack over-
- flow, unrecoverable syntax error, or programmer action called
- yyabort).
-
- The yyparse routine can be compiled with the Turbo Pascal compiler
- (tpc or turbo). It is to be called in the context of a Turbo Pascal
- main program using the YaccLib unit and a lexical analyzer routine
- function yylex : integer;
- which can also be prepared with Lex; these must be supplied by the
- programmer. The header file Yacc generates summarizes declarations to
- be shared between parser and the yylex routine.
-
- If the yyparse routine is compiled with defined conditional yydebug,
- i.e.
- tpc filename /Dyydebug
- yyparse will trace all parsing actions on standard output.
-
- EXAMPLE
-
- The sample desktop calculator supplied on the distribution disk con-
- sists of the main program and input grammar in file expr.y and a lexi-
- cal analyzer in the Lex source file exprlex.l. It can be compiled and
- run by issuing the commands:
- yacc expr
- lex exprlex
- tpc expr
- expr
-
- To trace the steps made by the parser, compile expr.pas with
- tpc expr /Dyydebug
-
- DIAGNOSTICS
-
- When encountering syntactic or semantic errors in an input grammar,
- Yacc gives diagnostics on standard output and in the report file
- (yacc-file-name with new suffix .lst).
-
- Upon successful compilation of the input grammar, Yacc will report the
- number of shift/reduce and reduce/reduce conflicts encountered when
- constructing the parser (if there are any); also, Yacc will report the
- number of grammar rules that are never used in a reduction, and issue
- warnings when nonterminal grammar symbols do not appear on the left
- side of at least one grammar rule. Such items, in particular:
- shift/reduce and reduce/reduce conflicts, are discussed in more detail
- in the report (.lst) file.