Programmer 7500

home *** CD-ROM | disk | FTP | other *** search

/ Programmer 7500 / MAX_PROGRAMMERS.iso / PASCAL / LEX_YACC.ZIP / MAN.TXT < prev next >

Wrap

Text File | 1990-01-18 | 73.9 KB | 1,756 lines

LEX AND YACC - COMPILER WRITER'S TOOLS FOR TURBO PASCAL VERSION 2.0 Albert Graef FB Mathematik Johannes Gutenberg-Universitaet Mainz November 22, 1989 (ASCII version of manual: January 12, 1990) ABSTRACT We describe a reimplementation of the compiler writer's tools Lex and Yacc for Borland's Turbo Pascal, running under MS-DOS. These programs are useful tools for the development of compilers, lexical analyzers, and similar applications, and are intended for experienced Turbo Pas- cal programmers with a background in compiler design, and for courses in compiler construction. Note: This is a raw ASCII facsimile of the original Lex and Yacc manual contained in the TeX DVI file MAN.DVI on the distribution disk. CONTENTS Introduction 1. Installation 2. Lex 2.1 Regular Expressions 2.2 Actions 2.3 Lex Library 2.4 Character Tables 2.5 Turbo Pascal Tie-ins 2.6 Implementation Restrictions, Bugs, and Incompatibilities 3. Yacc 3.1 Actions 3.2 Lexical Analysis 3.3 Yacc Library 3.4 Ambiguity 3.5 Error Handling 3.6 Arbitrary Value Types 3.7 Debugging 3.8 Yacc Language Grammar 3.9 Additional Features, Implementation Restrictions and Bugs Conclusion References Appendix INTRODUCTION This manual describes two popular compiler writer's tools, Lex and Yacc, which have seen extensive use on the UNIX system, and have been reimplemented by the author for the MS-DOS operating system. The original (UNIX) versions of these programs are described in [2,3]. Other public domain and commercial remakes of Lex and Yacc are avail- able under MS-DOS, e.g. from DECUS and Mortice Kern Systems. However, in difference to these implementations, the programs described in this manual are for use with Borland's Turbo Pascal, rather than with C. In particular, they support Turbo Pascal as their host and target programming language. The Turbo Pascal Lex and Yacc versions are an independent development of the author, not containing any fragments from the original sources, or other UNIX stuff (happily, the theory underlying Lex and Yacc has been published, e.g., in [1], und thus is public domain). However, the names Lex and Yacc (which, as far as I know, are not copyrighted) are justified by the fact that the programs described here are intended to be approximately compatible with the original versions. The intended audience for this manual are experienced (Turbo Pascal) programmers with knowledge of the basics of formal language and com- piler theory and students in compiler construction courses, not novices or casual programmers. Thus, the manual is particularily con- cise and compact, while trying to cover all essential aspects of Turbo Pascal Lex and Yacc. As a supplementary text, we strongly recommend the famous "dragon book" of Aho, Sethi and Ullman [1] which covers all important theoretical and practical aspects of compiler design and implementa- tion. The manual is organized as follows: Section 1 covers installation re- quirements and the installation process. Section 2 treats the lexical analyzer generator Lex; subsections are devoted to the format of regular expressions, the implementation of actions, the Lex library unit, character tables, Turbo Pascal tie-ins, implementation restric- tions, bugs, and incompatibilities. Section 3 discusses the parser generator Yacc; subsections deal with actions in Yacc grammars, the lexical analyzer routine used by Yacc-generated parsers, the Yacc library unit, ambiguities in Yacc grammars, syntactic error handling, arbitrary value types in actions, the debugging of Yacc-generated par- sers, the Yacc language syntax, and, finally, additional features, im- plementation restrictions and bugs. The appendix contains short des- criptions of Lex and Yacc in the style of UNIX manual pages. Note: Throughout the manual, the terms Lex and Yacc refer to the Turbo Pascal versions. The original (UNIX) versions are denoted by the terms UNIX or standard Lex resp. Yacc. 1. INSTALLATION Installation requirements: - IBM PC/XT/AT or compatible, 512 KB RAM - MS-DOS 3.10 or later (may also run under MS-DOS 2.x, but I have not tested it) - Turbo Pascal 4.0 or later (has been tested with 4.0 and 5.0) To install Turbo Pascal Lex and Yacc, simply copy the contents of the distribution disk to an appropriate disk and/or directory. You might also wish to put this directory on your DOS path. The programs generated with Lex and Yacc will need the LexLib and YaccLib units (.tpu files on the distribution disk) when compiled, so you might have to put them anywhere the Turbo Pascal compiler finds them (e.g., in the turbo.tpl library). Here's the contents of the distribution disk: lex.exe the Lex program lexlib.* source and .tpu file for the LexLib unit yacc.exe the Yacc program yacclib.* source and .tpu file for the YaccLib unit read.me if present, contains addenda to the manual makefile makefile for the sample programs *.l, *.y sample Lex and Yacc programs man.dvi TeX dvi file for the manual As shipped, the LexLib and YaccLib units are compiled with Turbo Pas- cal 4.0. If you're running Turbo Pascal 5.0 or later, you will have to recompile lexlib.pas and yacclib.pas with your version of the Turbo Pascal compiler. You can use the makefile to compile the sample programs on the dis- tribution disk (see the Turbo Pascal manual for a description of Bor- land's make, and the makefile for a description of its usage and the sample programs). To run Turbo Pascal Lex and Yacc on your grammar sourcefiles, refer to the manual pages in the appendix for a description of the Lex and Yacc command formats. 2. LEX Lex is a program to generate lexical analyzers from a given set of in- put patterns, specified as regular expressions. Table 1 summarizes the regular expressions Lex recognizes. In this table, c stands for any single character, r for a regular expression, and s for a string. Expression Matches Example -------------------------------------------------- c any non-operator character c a \c character c literally \* "s" string s literally "**" . any character but newline a.*b ^ beginning of line ^abc $ end of line abc$ [s] any character in s [abc] [^s] any character not in s [^abc] r* zero or more r's a* r+ one or more r's a+ r? zero or one r a? r{m,n} m to n occurrences of r a{1,5} r1r2 r1 then r2 ab r1|r2 r1 or r2 a|b (r) r (a|b) r1/r2 r1 when followed by r2 abc/123 -------------------------------------------------- Table 1: Lex regular expressions (taken from [1, fig. 3.48]). A Lex program, or grammar, in general, consists of three sections separated with the delimiter %%: definitions %% rules %% auxiliary procedures Both definitions and rules section may be empty, and the auxiliary procedures section may be omitted, together with the second %%. Thus, the minimal Lex program is %% (no definitions, no rules). The rules section of a Lex program is a table of regular expressions and corresponding actions, specifying patterns to be matched in the input, and (Turbo Pascal) program statements to be executed when a pattern has been matched: expression statement; ... Here, expression and statement are delimited with whitespace. The statement must be a single Turbo Pascal statement (use begin ... end for compound statements) terminated with a semicolon; if the statement consists of multiple lines, the continuation lines must be indented with at least one blank or tab character. An action may also be replaced by the symbol |, in which case it is assumed that the action for the current rule is the same as that for the next one. As already indicated, the auxiliary procedures section is optional. If it is present, it is assumed to contain valid Turbo Pascal code (such as supplementing routines, or a main program) which is simply tacked on to the end of the output (Turbo Pascal) program Lex produces. The definitions section of a Lex program may contain regular defini- tions of the form name expression defining a (regular expression) substitution for an identifier name (according to Turbo Pascal syntax). name and expression must be separated by whitespace. Note that in difference to Pascal, upper- and lowercase in identifiers is always distinct. The value of a regular definition for name can be referred to lateron using the notation {name}. Thus, regular definitions provide a sort of "constant declaration" for regular expressions. From the source grammar, Lex produces an output program, written in Turbo Pascal, that defines a parameterless function function yylex : integer; implementing the lexical analyzer. When called, yylex reads an input file (standard input, by default), scanning for the patterns specified in the source grammar, and executing the corresponding actions as pat- terns are matched. Normally, yylex scans the whole input file and returns with value 0 to the calling program upon encountering end-of-file (actions may also return other values to the calling program, cf. 2.2). Thus, in the normal case, a suitable main program calling yylex is something like: begin if yylex=0 then { done } end. Such a main program must be supplied by the programmer, e.g., in the auxiliary procedures section (there is no default main program in the Lex library, as with UNIX Lex). The lexical analyzer routine yylex scans for all patterns simultaneously. If more than one pattern is matched, the longest match is preferred; if there still remains more than one pattern making the longest match, the first such rule in the source grammar is chosen. This makes rules like if writeln('keyword if'); [A-Za-z][A-Za-z0-9]* writeln('identifier ', yytext); work as expected (i.e., input if will be matched by the first, if1 by the second rule). A Lex program may also be incomplete in that it does not specify a pattern for any possible input. In such a case, the lexical analyzer executes a default action on unrecognized parts of the input, which consists of copying the input to an output file (standard output, by default). Thus, the trivial Lex program %% yields a routine that copies the input to the output file unchanged. On the other hand, if the input has to be absorbed completely, the programmer must supply rules that match everything, e.g.: . | \n ; Example: The following Lex program counts words (sequences of non- whitespace characters) in an input file: uses LexLib; var count : integer; %% [^ \t\n]+ inc(count); . | \n ; %% begin count := 0; if yylex=0 then writeln('word count: ', count) end. A few remarks about the generated lexical analyzer routine are in or- der. Lex generates table-driven lexical analyzers in DFA technique [1, section 3.7] which usually are both quite compact and fast (though hand-coded lexical analyzers will often be more efficient). In particular, the matching time is proportional to the length of the in- put, unless ambiguity and lookahead requires a significant amount of rescanning. There are certain pathological regular expressions which cause exponential growth of the DFA table, however, they are rare. Lex-generated lexical analyzers interface nicely with Yacc, because the yylex routine just meets Yacc's requirements of its lexical analyzer routine (actually, it was designed that way). Thus, a yylex routine prepared with Lex can be incorporated directly into a Yacc- generated parser. 2.1 REGULAR EXPRESSIONS In the Lex language regular expressions are used to denote string pat- terns to be matched in the input stream. The basic regular expressions are (cf. table 1): - single characters: c stands for the literal character c itself, un- less it is an operator character (one of *, +, ?, etc., discussed below), in which case it must be quoted with the backslash \. Non- printable characters are specified using the C-like escapes listed in table 2. Note that \0 marks end-of-file, and \n (newline) stands for end-of-line, i.e. the sequence carriage-return/line-feed in MS- DOS text files. Escape Denotes ---------------------------------- \b backspace \t tab \n newline \nnn character no. nnn in octal \\ backslash \c character c literally ---------------------------------- Table 2: Lex character escapes. - strings: "s", where s is any character sequence, stands for the string s. To embed the double quote " in a string, it must be quoted with \. - character classes: [s] stands for all characters in s, and [^s] denotes the complement of [s]. A - sign in a character class denotes ranges, e.g. [a-z] is the class of all lowercase letters. The period . is an abbreviation for the class of all characters except newline, i.e. [^\n]. Note that a character class never contains the end-of- file marker \0, unless it is explicitly included. From these basic elements, larger regular expressions may be formed using the following operators: - r*: stands for an arbitrary sequence of r's (0 or more), where r is any regular expression. - r+: stands for 1 or more r's. - r?: stands for 0 or 1 r. - r{m,n}: stands for m to n r's (where m and n are nonnegative in- tegers). r{m} denotes excactly m r's. - r1r2: stands for r1, followed by r2, where r1 and r2 are arbitrary regular expressions. - r1|r2: stands for r1 or r2. The operators have been listed in order of decreasing precedence, i.e. *, +, ? and {m,n} bind stronger than r1r2 (concatenation), which in turn precedes over | (alternation). Parentheses ( ... ) can be used to group regular expressions and override default precedences. As already mentioned, subexpressions may be abbreviated with names, using regular definitions. If name has been defined as an expression r, {name} specifies the regular expression r. Note that (in difference to UNIX Lex) the substituted expression r must always be a complete legal regular expression, which is actually substituted as an expres- sion, not textually. This implies that {name} is treated as a parenthesized expression. Also note, that any references to name must follow its definition, i.e. recursive definitions are illegal. Lex also supplies a number of operators that are used to specify left and right context. The context does not actually belong to the pat- tern, but determines whether the pattern itself may be matched. Right context, or lookahead is specified using the lookahead operator /. r1/r2 stands for r1, but only if followed by r2, where r1 and r2 are arbitrary regular expressions (which may, however, not contain the lookahead operator themselves). r$ may be used as an abbreviation for r/[\0\n], i.e. r followed by line end or end-of-file. The caret ^ stands for the beginning of the line, and thus marks im- mediate left context. More distant left context may be specified using user-defined start states. The expression <s1,...,sn>r denotes a pat- tern r that is valid (i.e., may be matched) only if the lexical analyzer is in any of the start states s1,...,sn. This requires one or more start state declarations in the definitions section that list all used start state identifiers (which, like expression names, must be legal Turbo Pascal identifiers). %start s1 s2 ... sn and %start s1 ... %start sn are examples of valid start state declarations. By default, the lexical analyzer is in the default start state, which has number 0. Lex also assigns unique numbers to all user-defined start states, and the begin_ routine (cf. 2.2) can be used in an ac- tion or any other routine to put the lexical analyzer in the desired start state. Start states are useful when certain patterns have to be analyzed dif- ferently, depending on some left context (such as a special character at the beginning of the line), or when multiple lexical analyzers are working in concert. Note that a rule without start state prefix is valid in the default start state, as well as in any user-defined start state. This may be used to factor out patterns that are to be matched in either user-defined state. All the context operators may only appear in rules, not in regular definitions, and they may appear only once. Of course, context operators may be combined, as in <s>^r1/r2 which denotes a pattern r1 that may only be matched if it occurs at the beginning of a line, is followed by an instance of r2, and if the lexical analyzer is in user- defined start state s. 2.2 ACTIONS A rule specifies a regular expression pattern and an action (a Turbo Pascal statement) to be executed when the pattern is matched. Lex sup- plies a number of variables and routines useful in the programming of actions: - yytext: a string variable, containing the matched text. - yyleng: the length of yytext, i.e. length(yytext). Note that the first and last character in yytext are yytext[1] and yytext[yyleng], respectively. - yylineno: the current line number in the input file; useful, e.g., in giving diagnostics. - yymore: appends the next match to the current one (normally, the next match will overwrite the current one). - yyless: the counterpart of yymore; yyless(n) causes the current match to be restricted to the first n characters, and returns the remaining characters at the end of the match to be reread by the lexical analyzer. This supplies a crude form of "lookahead". Since Lex also supports a more general form of lookahead (cf. 2.1), this routine is largely obsolete. - reject: rejects the current match and executes whatever was "second choice" after the current rule, adjusting the match accordingly. This routine is useful when all (possibly overlapping) instances of patterns have to be detected; see the digram program on the dis- tribution disk for an example. - return: return(n), n an integer value, causes yylex to return with the indicated value. This routine is useful when the lexical analyzer is used to partition the input file for a parser; n will then typically denote a token number (cf. 3.2). - begin_: begin_(s) puts the lexical analyzer in start state s (cf. 2.1), where s is either 0 (default start state) or names a user- defined start state. These, and other Lex-supplied variables and routines are also dis- cussed in the interface of the LexLib unit (file lexlib.pas on the distribution disk). 2.3 LEX LIBRARY The LexLib (Lex library) unit supplies the input/output routines used by the lexical analyzer. The I/O files are implemented as Pascal text files yyin and yyout. These are - by default - assigned to the redirectable MS-DOS standard input/output devices. However, the user program may also assign them to any suitable files/devices. yylex accesses the input/output files through the following routines: - function input : char; returns the next character from the input file. - procedure unput(c : char); returns character c to the input buffer to be reread by a subsequent call to input. - procedure output(c : char); appends character c to the output file. All references to the yyin/yyout files of the LexLib unit are made through these three routines. Thus, a user program may well replace them altogether by other routines matching the specifications. This makes it possible for the lexical analyzer to access arbitrary in- put/output streams, such as special devices or internal memory. The LexLib I/O routines and files may also be accessed directly through actions or the main program. However, care must be heeded un- der certain circumstances. In particular, direct access to the yyin file will bypass the buffering of unput characters and thus may some- times not have the desired results. The yywrap routine is a parameterless boolean function that determines whether the lexical analyzer should perform normal wrapup at end-of- file. If it returns true, normal wrapup is performed; if it returns false, yylex ignores the end-of-file mark and continues lexical analysis. The LexLib unit supplies a default version of yywrap which always returns true. This routine may be replaced by a customized ver- sion that does application dependent processing at end-of-file. In particular, yywrap may arrange for more input and return false to resume lexical analysis (see the findproc program on the distribution disk for an example). Note that the LexLib unit must be loaded with (almost) any Lex program such that the lexical analyzer routine may access I/O and other routines. To achieve this, the line uses LexLib; should be put at the beginning of the Lex program (see section 2.5 on how to incorporate Turbo Pascal source lines into the definitions sec- tion of a Lex program). Refer also to the LexLib interface description contained in lexlib.pas on the distribution disk for a discussion of the LexLib I/O system and other routines. 2.4 CHARACTER TABLES The standard character encoding supported both by Lex and the LexLib unit is ASCII (to be more precise, IBM's 8-bit extension of ASCII). However the user may supply his own versions of input, unput and out- put (cf. 2.3), supporting their own character encoding. If such a customized character set is used, Lex must be told about it by means of a character table in the definitions section of the Lex program. This table has the format: %T charno. string ... %T Each line of the character table lists a character number (in the target code) and the corresponding ASCII representation(s) of this character. The usual escapes for non-printable characters are recog- nized (cf. table 2). Example: To map the lower- and uppercase letters into 1,...,26, the digits 0,...,9 into codes 27,...,36, and newline into 37, the fol- lowing table may be used: %T 1 Aa 2 Bb 3 Cc ... 26 Zz 27 0 28 1 ... 36 9 37 \n %T If a character table is used, all characters (at least those actually used in the Lex grammar) should be in the table; all character numbers must be byte values (0..255), and no character may be mapped into two different codes. 2.5 TURBO PASCAL TIE-INS Frequently, a Lex program will not merely consist of definitions and rules, but also use other routines to be loaded with the lexical analyzer. One example is a main program that calls the yylex routine. We have already mentioned, that such a main program, and other sup- plementary routines, may be placed into the auxiliary procedures sec- tion. For other target language (i.e. Turbo Pascal) tie-ins, the Lex language allows arbitrary code fragments to be included in the defini- tions and at the beginning of the rules section of the Lex grammar, by the following means: - Any line in the source grammar starting in column one is assumed to contain Lex code (definitions or rules), which is processed by Lex. - Any line indented with at least one space or tab character, and any sequence of lines enclosed between %{ and %} is assumed to contain Turbo Pascal code, and is copied to the output file unchanged. Code in the definitions section is inserted at the beginning of the output program, at global scope, while code at the beginning of the rules section is inserted as local declarations into the action routine that contains the actions of all rules. Thus, an indented line var i : integer; at the beginning of the rules section will declare an integer variable local to the action statements. As a side-effect, these conventions allow comments to be included in a Lex program; comments should then follow host language (i.e. Turbo Pascal) syntax. Example: A typical setup for a Lex program with necessary declara- tions, supplementary routines, and main program may look as follows: uses LexLib; { global declarations } { Lex definitions } %% { local declarations } { Lex rules } %% { supplementary routines } begin { main program } ... if yylex=0 then { done }; ... end. 2.6 IMPLEMENTATION RESTRICTIONS, BUGS, AND INCOMPATIBILITIES Lex poses some restrictions on the sizes of the input grammar and in- ternal tables. Maximum table sizes are printed out by Lex at the end of a translation, along with statistics about actual table space re- quirements. An error message table overflow can also indicate that not enough main memory is available to Lex (possibly because of too many programs loaded into memory). Since yytext is implemented as a Turbo Pascal string variable, the maximum size for a matched string is 255. As implemented, the reject routine (cf. 2.2) does not rescan the in- put, but uses internal state information to determine the next pos- sible match. Thus reject combined with modifications of the input stream may yield unexpected results. There is a subtle (and, as far as I know, undocumented) bug in Lex, concerning certain types of lookahead patterns, that sometimes causes lookahead not to be restored properly. E.g., when the pattern ab*/b is matched, the last b is never returned to the input, but instead the whole matched sequence will be returned in yytext. This actually is a "misfeature", and seems to be comformant with UNIX Lex and even with the method of Aho/Sethi/Ullman for handling the lookahead operator [1, section 3.8], which Lex' treatment of the lookahead operator is based on. When called, yylex partially initalizes itself. This implies that yymore and reject will not work between different invokations of yylex. As discussed in 2.1, Lex substitutes expressions, not text, when a regular definition is referred to with the {name} notation. This is in contrast with the (textual) macro expansion scheme used in UNIX Lex. Although the approach taken in Turbo Pascal Lex is more restrictive, we feel that it is actually an improvement. In particular, regular definitions can be parsed and checked for validity immediately, such that errors in regular definitions can be detected as soon as pos- sible. Also, the meaning of a regular definition is guaranteed to be independent of the context in which it is used. Another (minor) difference between Turbo Pascal and UNIX Lex in the syntax of regular definitions is that Turbo Pascal Lex requires names for regular expressions to be legal (Turbo Pascal) identifiers, whereas UNIX Lex admits arbitrary character strings as names. 3. YACC Yacc ("Yet Another Compiler-Compiler") is a parser generator, i.e. a program that translates the specification of an input language by its BNF (Backus Naur Form) grammar into a parser subroutine, written in Turbo Pascal. Similar to Lex, a Yacc program or grammar has the form definitions %% rules %% auxiliary procedures where the first section contains the definitions of the basic input symbols (terminals, also termed tokens) of the specified language, the second section the grammar rules for the nonterminal symbols of the language, and the third, and optional, section contains any additional Turbo Pascal code, such as supplementary routines and a main program, which will be tacked on to the end of the Turbo Pascal output program. The rules section of a Yacc program is simply a BNF grammar for the target language, possibly augmented with actions, program statements to be executed as certain syntactic constructs are recognized by the parser. By default, the left-hand side of the first rule marks the start symbol of the grammar. It is also possible to declare a start symbol explicitly by means of a declaration of the form %start A in the definitions section, where A is the desired start symbol. Grammar rules have the general format A : b1 ... bn; where A is the left-hand side nonterminal of the rule, and b1 ... bn is the (possibly empty) right-hand side sequence of nonterminal and terminal symbols bi. The terminating semicolon may be omitted, and several rules A : ui for the same left-hand side nonterminal A may be abbreviated as A : u1 | u2 ... | un Nonterminal symbols are denoted by identifiers (letters, followed by digits and letters, where underscore _ and period . count as letters, and upper- and lowercase are distinct). Terminal symbols may either be literals (single characters enclosed in single quotes) or identifiers that are declared explicitly as terminals by a %token definition of the form %token a1 ... an By convention, token identifiers are given in uppercase, such that they can be distinguished easily from nonterminal symbols. In literals, the usual C-like escapes are recognized (cf. table 3). Escape Denotes ------------------------------------ '\n' newline '\r' carriage return '\'' single quote ' '\\' backslash '\t' tab '\b' backspace '\f' form feed '\nnn' character no. nnn in octal ------------------------------------ Table 3: Yacc character escapes. Grammar rules may be augmented with actions, Turbo Pascal statements enclosed between { ... }. Usually, actions appear at the end of rules, indicating the statements to be executed when an instance of a rule has been recognized (cf. 3.1). The Yacc language is free-format: blanks, tabs, and newlines are ig- nored, except when they serve as delimiters. Yacc language comments have the format: /* ... anything except */ ... */ As with Lex, host language (i.e. Turbo Pascal) tie-ins may be specified by enclosing them in %{ ... %}. Such code fragments will be copied unchanged, and inserted into the output file at appropriate places (code in the definitions section at global scope, code at the beginning of the rules section as local declarations of the action routine). The class of grammars accepted by Yacc is LALR(1) with disambiguating rules (cf. [1, sections 4.7 and 4.8]). Yacc can successfully produce parsers for a large class of formal languages, including most modern programming languages (under UNIX, Yacc has been used to produce par- sers for C, Fortran, APL, Pascal, Ratfor, Modula-2, and others). From the source grammar, Yacc produces an output file containing the parser routine function yyparse : integer; together with any additional Turbo Pascal code supplied by the programmer. yyparse repeatedly calls a lexical analyzer routine yylex to obtain tokens from the input file, and parses the input accordingly, execut- ing appropriate actions as instances of grammar rules are recognized. yyparse returns with a value of 0 (successful parse terminated at end- of-file) or 1 (fatal error, such as parse stack overflow, or un- recoverable syntax error). Thus, a main program that calls the parser routine may look as fol- lows: begin ... if yyparse=0 then { done } else { error }; ... end. Main program and lexical analyzer routine must be supplied by the programmer. The yylex routine can also be prepared with Lex and then loaded with the parser, cf. 3.2. The main program is usually included at the end of the auxiliary procedures section. The following is an example of a Yacc grammar for simple arithmetic expressions. Note that the symbol NUMBER is declared as a token ex- pected to be returned by the lexical analyzer as a single input sym- bol. %token NUMBER %% expr : term | expr '+' term ; term : factor | term '*' factor ; factor : NUMBER | '(' expr ')' | '-' factor ; A Lex program implementing the lexical analyzer for this Yacc program may look as follows: {$I expr.h} { definition of token numbers produced by Yacc } %% [0-9]+ return(NUMBER); . | \n return(ord(yytext[1])); { other literals returned as their character codes } 3.1 ACTIONS As already indicated, grammar rules may be associated with actions, program statements that are executed as rules are recognized during the parse. Among other things, actions may print out results, modify internal data structures such as a symbol table, or construct a parse tree. Actions may also return a value for the left-hand side non- terminal, and process values returned by previous actions for right- hand side symbols of the rule. For this purpose, Yacc assigns to each symbol of a rule a correspond- ing value (of type integer, by default, but arbitrary value types may be declared, cf. 3.6): $$ denotes the value of the left-hand side non- terminal, $i the value of the ith right-hand side symbol. Values are kept on a stack maintained by the parser as the input is parsed. The lifetime of a value begins when the corresponding syntactic entity (nonterminal or terminal) has been recognized by the parser, and ends when the parser reduces by an enclosing rule, thereby replacing the values associated with the right-hand side symbols by the value of the left-hand side nonterminal. Nonterminals A obtain their values through assignments of the form $$ := v, where A is the left-hand side of the corresponding rule, and v is some value, usually obtained by a function applied to the right- hand side values $i. Terminals may also have associated values; these are set by the lexi- cal analyzer through an assignment to the variable yylval supplied by Yacc (cf. 3.2), as necessary. As an example, here is an extension of the arithmetic expression gram- mar, featuring actions that evaluate the input expression and a rule for the new start symbol line that is associated with an action that prints out the obtained result. %token NUMBER %% line : expr '\n' { writeln($1) } ; expr : term { $$ := $1 } | expr '+' term { $$ := $1 + $3 } ; term : factor { $$ := $1 } | term '*' factor { $$ := $1 * $3 } ; factor : NUMBER { $$ := $1 } | '(' expr ')' { $$ := $2 } | '-' factor { $$ := -$2 } ; Note that the lexical analyzer must set the values of NUMBER tokens, which are referred to by $1 in the rule factor : NUMBER. One can use a Lex rule like var code : integer; [0-9]+ begin val(yytext, yylval, code); return(NUMBER) end; that applies the Turbo Pascal standard procedure val to evaluate a NUMBER token. Actually, we could have omitted the "copy actions" of the form $$ := $1 in the above grammar, since this is the default action automatical- ly assumed by Yacc for any rule without explicit action. Yacc also allows actions within rules, i.e. actions that are to be ex- ecuted before a rule has been fully parsed. A rule like A : b { p; } c will be treated as if it was written A : b $act c $act : { p; } introducing a new nonterminal $act matched to an empty right-hand side, and associated with the desired action. In particular, the action { p; } is treated as if it was a (nontermi- nal) grammar symbol, and thus can also return a value accessible with the usual $i notation. The action itself may also access values of symbols (and other actions) to the left of it. Thus, the rule A : b { p; } c is actually treated as if it consisted of three right-hand side symbols; $1 denotes the value of b, $2 the value of { p; } (set in p by an assignment to $$) and $3 is the value of c. Yacc's syntax-directed evaluation scheme makes it particularily easy to implement synthesized attributes along the guidelines of [1, sec- tion 5.6]. The evaluation of inherited attributes can often also be simulated by making use of marker nonterminals, see also [1]. To ac- cess marker nonterminals outside the scope of the rule to which they belong, Yacc also supports the notation $i, where i<=0, indicating a value to the left of the current rule, belonging to an enclosing rule ($0 denotes the first symbol to the left, $-1 the second, ...). Consider A : b B | b b' { $$ := $1 } B B : c { $$ := f($0) } The anonymous marker nonterminal implemented by the action { $$ := $1 } assures that the value of b can always be accessed through $0 in the third rule. Note that without use of the marker nonterminal, the rela- tive position of b's value on the stack would not be predictable. Actions within rules, and access to values in enclosing rules, supply flexible means to implement syntax-directed evaluation schemes. However, care must be heeded that such actions do not give rise to un- wanted parsing conflicts caused by ambiguities (cf. 3.4). 3.2 LEXICAL ANALYSIS Yacc-generated parsers use a lexical analyzer routine yylex to obtain tokens from the input file. This routine must be supplied by the user, and is assumed to return an integer value denoting a basic input sym- bol. 0 (or negative) denotes end-of-file, and character literals are denoted by their character code. Usually, all other token numbers are assigned by Yacc automatically, in the order in which %token defini- tions appear in the grammar. Token numbers may also be assigned ex- plicitly, by a definition of the form %token a n where a is the terminal symbol (literal or identifier), and n is the desired token number. If there is a value associated with an input symbol, yylex should as- sign this value to the variable yylval supplied by Yacc. Usually, yyl- val has type integer, but this default can be overwritten (cf. 3.6). Declarations shared by parser and lexical analyzer are put in the header (.h) file Yacc generates along with the (.pas) output file con- taining the parser routine. The header file declares the yylval vari- able and lists the token numbers; each token identifier is declared as a corresponding integer constant. The header file should thus be included in a context where it is ac- cessible by both the parser and the lexical analyzer. For instance, one may include both header file and lexical analyzer, in that order, in the definitions section of the grammar, by means of the Turbo Pas- cal include directive ($I): %{ {$I header filename } {$I lexical analyzer } %} As has already been indicated, the lexical analyzer generator Lex dis- cussed in section 2 of this manual is a useful tool to produce lexical analyzers to be incorporated into Yacc-generated parsers. 3.3 YACC LIBRARY The Yacc library unit YaccLib contains some default declarations used by Yacc-generated parsers. It should therefore be loaded with yyparse, which can be achieved by the uses clause %{ uses Yacclib; %} at the beginning of the Yacc program. Note that if the program in- cludes a lexical analyzer prepared with Lex, the LexLib unit may also be required: %{ uses Yacclib, LexLib; %} The routines implemented by the Yacc library are defaults that can be customized for the target application. In particular, these are the yymsg message printing routine and the yydebugmsg debug message print- ing routine. The Yacc library also declares defaults for the value- type YYSTYPE (cf. 3.6) and the size of the parser stack (yymaxdepth; see also 3.9). Refer to the interface description of the YaccLib unit for further in- formation. The interface also describes some additional variables and routines which are not actually implemented by the Yacc library, but are contained in the Yacc output file. Some of these will also be men- tioned in subsequent sections. 3.4 AMBIGUITY If a grammar is non-LALR(1), Yacc will detect parsing conflicts when constructing the parse table. Such parsing conflicts are usually caused by inconsistencies and ambiguities in the grammar. Yacc will report the number of detected parsing conflicts, and will try to resolve these conflicts, using the methods outlined in [1, section 4.8]. Thus, Yacc will generate a parser for any grammar, even if it is non-LALR. However, if unexpected parsing conflicts arise, it is wise to consult the parser description (.lst file, cf. 3.7) generated by Yacc to determine whether the conflicts were resolved correctly. Otherwise the parser may not behave as expected. An example of an ambigious grammar is the following, specifying a Pascal-like syntax of IF-THEN-ELSE statements: %token IF THEN ELSE %% stmt : IF expr THEN stmt /* 1 */ | IF expr THEN stmt ELSE stmt /* 2 */ The ambiguity in this grammar fragment, often referred to as the dangling-else-ambiguity, stems from the fact that it cannot be decided to which THEN a nested ELSE belongs: is IF e1 THEN IF e2 THEN s1 ELSE s2 to be interpreted as: (1) IF e1 THEN ( IF e2 THEN s1 ELSE s2 ) or as: (2) IF e1 THEN ( IF e2 THEN s1 ) ELSE s2 ? Let us take a look at how such an ambigious construct would be parsed. When the parser has seen IF e2 THEN s1, it could recognize (reduce by) the first rule, yielding the second interpretation; but it could as well read ahead, shifting the next symbol ELSE on top of the parser stack, then parse s2, and finally reduce by rule 2, which in effect yields the first interpretation. Thus, upon seeing the token ELSE, the parser is in a shift/reduce conflict: it cannot decide between shift (the token ELSE) and reduce (by the first rule). In the dangling-else example, the grammar - with some effort - can be rewritten to eliminate the ambiguity, see [1, section 4.3]. However, this is not possible in general (there are languages that are in- trinsically ambigious). Furthermore, an unambigious grammar may still be non-LALR (see [1, exercise 4.40] for an example). In particular, parsing decisions in an LALR parser are based on one-symbol-lookahead, which limits the class of grammars that may be used to construct a `pure' LALR parser. As implemented, Yacc always resolves shift/reduce conflicts in favour of shift. Thus a Yacc generated parser will correctly resolve the dangling-else ambiguity (assuming the common rule, that an ELSE belongs to the last unmatched THEN). Another type of ambiguity arises when the parser has to choose between two different rules. Consider the following grammar fragment (for sub- and superscripted expressions in the UNIX equation formatter eqn): %token SUB SUP %% expr : expr SUB expr SUP expr /* 1 */ | expr SUB expr /* 2 */ | expr SUP expr /* 3 */ The rationale behind this example is that an expression involving both sub- and superscript is often set differently from a superscripted subscripted expression. The ambiguity arises in an expression of the form e1 SUB e2 SUP e3. At the end of the expression, the parser can apply both rule 1 (reduce the whole expression) and rule 3 (reduce the subexpression e2 SUP e3). This type of conflict is termed reduce/reduce conflict. Yacc resolves reduce/reduce conflicts in favour of the rule listed first in the Yacc grammar. Thus, "special case constructs" like the one above may be specified by listing them ahead of the more general rules. In our ex- ample, e1 SUB e2 SUP e3 is always interpreted as an instance of the first rule (which presumably is the intended result). To summarize, in absence of other strategies (to be discussed below), Yacc applies the following default disambiguating rules: - a shift/reduce conflict is resolved in favour of shift. - in a reduce/reduce conflict, the first applicable grammar rule is preferred. In any case, the number of shift/reduce and reduce/reduce conflicts is reported by Yacc, since they could indicate inconsistencies in the grammar. Also, Yacc reports rules that are never reduced (possibly, because they are completely ruled out by disambiguating rules). A more detailed description of the detected conflicts may be found in the parser description (cf. 3.7). The default disambiguating rules are often inappropriate in cases where an ambigious grammar is deliberately chosen as a more concise representation of an (unambigious) language. Consider the following grammar for arithmetic expressions: %token NUMBER %% expr : expr '+' expr | expr '*' expr | NUMBER | '(' expr ')' ; There are several reasons why such an ambigious grammar might be preferred over the corresponding unambigious grammar: %token NUMBER %% expr : term | expr '+' term ; term : factor | term '*' factor ; factor : NUMBER | '(' expr ')' ; In particular, the ambigious grammar is more concise and natural, and yields a more efficient parser [1, section 4.8]. The ambiguities in the first grammar may be resolved by specifying precedences and associativity of the involved operator symbols. This may be done by means of the following precedence definitions: - %left operator symbols: specifies left-associative operators - %right operator symbols: specifies right-associative operators (e.g., exponentiation) - %nonassoc operator symbols: specifies non-associative operators (two operators of the same class may not be combined; e.g., relational operators in Pascal) Each operator symbol may be a literal or a token-identifier; token- names appearing in precedence definitions may, but need not be declared with %token as well. Each precedence declaration introduces a new precedence level, lowest precedence first. For the example above, assuming '+' and '*' to be left-associative, and '*' to take precedence over '+', the correspond- ing Yacc grammar is: %token NUMBER %left '+' %left '*' %% expr : expr '+' expr | expr '*' expr | NUMBER | '(' expr ')' ; This grammar unambigiously specifies how arbitrary expressions are to be parsed; e.g., e1+e2+e3 will be parsed as (e1+e2)+e3, and e1+e2*e3 as e1+(e2*e3). Yacc resolves shift/reduce conflicts using precedences and as- sociativity in the following manner. With each grammar rule, it as- sociates the precedence of the righmost terminal symbol (this default may be overwritten using a %prec tag, see below). Now, when there is a conflict between shift a and reduce r, a a terminal and r a grammar rule, and both a and r have associated precedences p(a) and p(r), respectively, the conflict is resolved as follows: - if p(a)>p(r), choose `shift'. - if p(a)<p(r), choose `reduce'. - if p(a)=p(r), the associativity of a determines the resolution: - if a is left-associative: `reduce'. - if a is right-associative: `shift'. - if a is non-associative: `syntax error'. Shift/reduce conflicts resolved with precedence will, of course, not be reported by Yacc. Occasionally, it may be necessary to explicitly assign a precedence to a rule using a %prec tag, because the default choice (precedence of rightmost terminal) is inappropriate. Consider the following example in which '-' is used both as binary and unary minus: %token NUMBER %left '+' '-' %left '*' '/' %right UMINUS %% expr : expr '+' expr | expr '-' expr | expr '*' expr | expr '/' expr | NUMBER | '(' expr ')' | '-' expr %prec UMINUS ; UMINUS is not an actual input symbol, but serves to give unary minus a higher precedence than any other operator. Note that, by default, the last rule would otherwise have the precedence of (binary) '-'. 3.5 ERROR HANDLING This section is concerned with the built-in features Yacc provides for syntactic error handling. By default, when a Yacc-generated parser en- counters an errorneous input symbol, it will print out the string syntax error via the yymsg routine and return to the calling program with the value 1. Usually, an application program will have to do bet- ter than this, e.g. print appropriate error messages, and resume pars- ing after syntax errors. For this purpose, Yacc supplies a flexible error recovery scheme based on the notion of "error rules", cf. [1, section 4.8 and 4.9]. The predefined token error is reserved for error handling; it is in- serted at places were syntax errors are expected, yielding the so- called error rules. A transition on the error token is never taken during a normal parse, but only when a syntax error occurs. In this case the parser pretends it has seen an error token immediately before the offending input symbol. Since in general the current state of the parser may not admit a transition on this fictitious error symbol, the parser pops its stack until it finds a suitable state, which admits a transition on the error token. If no such state exists, the default error handler is used which prints out syntax error and terminates the parse. If there is such a state, the parser shifts the error token on top of the stack, and resumes parsing. To prevent cascades of error messages, the parser then proceeds in a special "error" state in which other errorneous input symbols are quietly ignored. Normal parse is resumed only after three symbols have been read and accepted by the parser. As a simple example, consider the rule stmt : error { action } and assume a syntax error occurs while a statement is parsed. Then the parser will pop its stack, until the token error, as an instance of the rule stmt : error, can be accepted, shift the error token on top of the stack, reduce by the error rule immediately (executing action), and resume parsing in error state. The effect is, that the parser as- sumes that a statement (to which error reduces) has been found, and skips symbols until it finds something which can legally follow a statement. Similarly, the rule stmt : error ';' { action } will cause the parser to skip input symbols until a semicolon is found, after which the parser reduces by the error rule and executes action. Note that error rules are not restricted to the simple forms indicated above, but may consist of an arbitrary number of terminal and non- terminal symbols. Occasionally, the three-symbols resynchronization rule is inadequate. Consider, for example, an interactive application with an error rule input : error '\n' { write('reenter last line: ') } input { $$ := $4 } ; An error in an input line will then cause the parser to skip ahead be- hind the following line end, emit the message reenter last line:, and read another line. However, it will quietly skip invalid symbols on the new line, until three valid symbols have been found. This can be fixed by a call to the routine yyerrok which resets the parser to its normal mode of operation: input : error '\n' { write('reenter last line: '); yyerrok } input { $$ := $4 } ; There are a number of other Yacc-supplied routines which are useful in implementing better error diagnostics and handling: - yychar: an integer variable containing the current lookahead token. - yyclearin: deletes the current lookahead token. - yynerrs: current total number of errors reported with yymsg. - yyerror: simulates a syntax error; syntactic error recovery is started, as if the next symbol was illegal. - yyaccept: simulates accept action of the parser (yyparse returns 0). - yyabort: aborts the parse (yyparse returns 1), as if an un- recoverable syntax error occurred. These, and other routines are also described in the interface of the Yacc library (cf. 3.3). Syntactic error handling and recovery is a difficult area; see [1] for a more comprehensive treatment of this topic. The reader may also refer to [4] for a more detailed explanation of the Yacc error recovery mechanism and systematic techniques for developing error rules. 3.6 ARBITRARY VALUE TYPES As already noted, the default type for the $$ and $i values is integer (cf. 3.1). This type can be changed by putting a declaration %{ type YYSTYPE = some_type; %} into the definitions section of the Yacc program, prior to inclusion of the header and lexical analyzer file. Yacc also supports explicit declaration of a (record) value type, by means of a definition %union{ name(s) : type; ... } Such a definition will be translated to a corresponding (Turbo Pascal) record declaration which is put into the header file, and determines the type of stacked $$ values, as well as of the yylval variable (cf. 3.2). "Union tags" of the form <name>, where name is the name of a component of the record type, can then be assigned to terminal and nonterminal symbols through %token and %type definitions, respective- ly. Consider, for example, the definition: %union{ integer_val : integer; real_val : real; } and the grammar rules: %token INT %% expr : expr '+' expr { $$ := $1 + $3 } | expr '*' expr { $$ := $1 * $3 } | INT { $$ := $1 } /* ... */ To assign value type real to nonterminal expr, and type integer to the token INT, one might say: %token <integer_val> INT %type <real_val> expr The effect is, that Yacc will automatically replace references to $$ and $i values by the appropriate record tags, i.e. $$ := $1 + $3 will be treated as $$.real_val := $1.real_val + $3.real_val, and the action $$ := $1 associated with the third rule will be interpreted as $$.real_val := $1.integer_val. Also, when arbitrary value types are used, Yacc checks whether each value referred to by an action has a well-defined type. Occasionally, there are values whose types cannot be determined by Yacc easily. This is the case for the $$ value of an action within a rule, as well as for values $i, i<=0 of symbols in enclosing rules. For such values the notations <name> and <name>i must be used, respec- tively, where <name> denotes the appropriate union tag. The expr grammar on the distribution disk is a `real-life' example of the use of arbitrary value types. 3.7 DEBUGGING As experience shows, debugging a parser can be quite tedious. Although, with some experience, writing a grammar is a quite easy and straightforward task, implementing, for instance, a good error recovery scheme may be quite tricky. Yacc supplies two debugging aids that help verify a parser. First of all, the parser description (.lst file) is useful when determining whether Yacc correctly resolved parsing conflicts, and when tracing the actions of a parser. The .lst file gives a descrip- tion of all generated parser states. For each state, the set of kernel items and the parse actions are given. The kernel items correspond to the grammar rules processed by the parser in a given state; the un- derscore _ in an item denotes the prefix of the rule that has already been seen, and the suffix yet to come. The parser actions specify what action the parser takes on a given input symbol. Here, the period . denotes the default action that is taken on any input symbol not men- tioned otherwise. Possible actions are: - shift an input symbol on top of the parser stack, and change the parser state accordingly; - goto a new state upon a certain nonterminal recognized through the previous reduction; - reduce by a certain grammar rule; - accept, i.e. successfully terminate the parse; and - error, start syntax error recovery. The default action in a state may either be reduce or error. The parser description also lists parsing conflicts and rules that are never reduced (cf. 3.4). Consider the ambigious grammar: %token NUMBER %% expr : expr '+' expr | expr '*' expr | NUMBER ; This grammar, when fed into Yacc will cause a number of shift/reduce conflicts. For instance, the description of parser state 5 in the .lst file will read as follows: state 5: shift/reduce conflict (shift 3, reduce 2) on '*' shift/reduce conflict (shift 4, reduce 2) on '+' expr : expr '*' expr _ (2) expr : expr _ '+' expr expr : expr _ '*' expr $end reduce 2 '*' shift 3 '+' shift 4 . error As is apparent from this description, the conflicts are caused by missing precedences and associativities of '+' and '*'. Also, it can be seen that for both '+' and '*' Yacc chose shift, following the default shift/reduce disambiguating rule. Now let us resolve the conflicts in the grammar by adding appropriate precedence declarations: %token NUMBER %left '+' %left '*' %% expr : expr '+' expr | expr '*' expr | NUMBER ; Now, all conflicts are resolved and for the description of state 5 we get: state 5: expr : expr '*' expr _ (2) expr : expr _ '+' expr expr : expr _ '*' expr . reduce 2 Thus, Yacc correctly resolved the ambiguities by chosing reduction on any input, which corresponds to the higher precedence and left- associativity of '*'. There are situations in which a parser seems to behave "strangely" in spite of an "obviously correct" grammar. If the problem cannot be found by careful analysis of the grammar, it is useful to trace the actions performed by the parser, to get an idea of what goes wrong. For this purpose a parser may be compiled with defined conditional yydebug, e.g.: yacc parser tpc parser /Dyydebug When run, the parser will print out the actions it performs in each step (together with parser states, numbers of shifted symbols, etc.), which can then be followed on a hardcopy of the parser description. Debug messages are printed via the yydebugmsg routine (cf. 3.3). You may also wish to tailor this routine to your target application, such that it prints more informative messages. Of course, the discussion above was rather sketchy. A more detailed treatment of these topics, however, would require a presentation of the LALR parsing technique, which is well beyond the scope of this manual. The reader instead is referred to [1, sections 4.7 and 4.8] for more information about the LALR technique. Also, [4] gives a more detailed explanation of (UNIX) Yacc parse tables and debug messages, which can mostly be applied to Turbo Pascal Yacc accordingly. 3.8 YACC LANGUAGE GRAMMAR This section specifies the Yacc language syntax, as a Yacc grammar. Actually, the Yacc language is more naturally expressed by an LR(2) grammar; the difficulty is to decide on the base of one-symbol lookahead whether an identifier at the end of a rule is followed by a colon, in which case it starts the next rule. Thus, we distinguish the token C_ID (an identifier followed by a colon) and "ordinary" identifiers ID. It is assumed to be the task of the lexical analysis to determine to which of these two classes an identifier belongs. The following grammar has been abstracted from the Turbo Pascal Yacc grammar actually used to implement Turbo Pascal Yacc. %token ID /* identifier; also literals enclosed in quotes */ C_ID /* identifier followed by a colon */ NUMBER /* nonnegative integers */ TOKEN LEFT RIGHT NONASSOC TYPE START UNION PREC /* reserved words: %token, etc. */ SEP /* separator %% */ LCURL RCURL /* curly braces %{ and %} */ ',' ':' ';' '|' '{' '}' '<' '>' /* single character literals */ %start spec %% spec : defs SEP rules aux_procs ; /* auxiliary procedures section: *************************/ aux_procs : /* empty: aux_procs is optional */ | SEP { copy the rest of the file } ; /* definitions section: **********************************/ defs : /* empty */ | defs def ; def : START ID | UNION '{' { copy the union definition } '}' | LCURL { copy Turbo Pascal tie-in } RCURL | TOKEN tag token_list | LEFT tag token_list | RIGHT tag token_list | NONASSOC tag token_list | TYPE tag nonterm_list ; tag : /* empty: union tag is optional */ | '<' ID '>' ; token_list : token_num | token_list token_num | token_list ',' token_num ; token_num : ID | ID NUMBER ; nonterm_list : nonterm | nonterm_list nonterm | nonterm_list ',' nonterm ; nonterm : ID ; /* rules section: ****************************************/ rules : rule1 | LCURL { copy Turbo Pascal tie-in } RCURL rule1 | rules rule ; rule1 : C_ID ':' body preced ; rule : rule1 | '|' body preced ; body : /* empty */ | body ID | body action ; action : '{' { copy action, substitute $$, etc. } '}' ; preced : /* empty */ | PREC ID | PREC ID action | preced ';' ; 3.9 ADDITIONAL FEATURES, IMPLEMENTATION RESTRICTIONS AND BUGS For backward compatibility, Turbo Pascal Yacc supports all additional language elements entitled as `Old Features Supported But not En- couraged' in the UNIX manual: - literals delimited by double quotes and multiple-character literals. - \ as a synonym for %, i.e. \\ is %%, \left is %left, etc. - other synonyms: %< = %left, %> = %right, %binary = %2 = %nonassoc, %term = %0 = %token, %= = %prec. - actions of the form ={...} and =single statement;. - host language tie-ins (%{...%}) at the beginning of the rules sec- tion (I think that this last one is really a must). See the UNIX Yacc manual for further information. As with Lex, Yacc poses some restrictions on internal table sizes for the source grammar and the constructed parser; these are printed out by Yacc together with statistics about actual table space require- ments, after a successful translation of a grammar. Also, make sure that enough main memory is available. The default size of the parser stack is yymaxdepth=1024 (cf. 3.3) which should be sufficient for any average application, but may also be enlarged (and shrinked) as needed. Note that right-recursive gram- mar rules may increase stack space requirements; thus it is a good idea to use left-recursive (and left-associative) rules wherever pos- sible. Standard (UNIX) Yacc has a bug that causes some (correct) error recovery schemes to hang in an endless loop, see [4]. This bug should be fixed in the Turbo Pascal implementation, at the cost of slightly increased parse table sizes. Yes, there is (at least) one bug in Turbo Pascal Yacc, namely that %union definitions (cf. 3.6) are translated to simple Pascal record types. They should be variant records instead. This will be fixed in the next release, if there ever is one. Note that this bug does not affect the proper functioning of the parser; it merely increases memory requirements for the parser's value stack. Anyhow, you may work around this by using %union definitions of the (Pascal variant record) form %union { case integer of 1: ( ... ) ; 2: ( ... ) ; ... } A final remark about the efficiency of Yacc-generated parsers is in order. The time needed to parse an input of length n is proportional to n. Although this may not convince everyone (Lex makes a similar claim, however most Lex-based analyzers seem to be considerably slower than hand-crafted ones), my experience is that Yacc-generated parsers are in fact fast, at least efficient enough for most applications (such as Turbo Pascal Yacc itself). The major bottleneck for compila- tion speed almost never seems to be the parser, but almost always the lexical analyzer, see [5, section 6.2]. Furthermore, one always has to consider that manual implementation of parsers is usually much more costly, compared to the use of a parser generator. Personally, I prefer a parser generator, because I'm really a lazy programmer; and if something seems not to be running at optimal speed, so what? We can always sit back and wait for still more efficient hardware to come (just kidding). CONCLUSION The Turbo Pascal Lex and Yacc versions described in this manual have been designed and tested carefully. I have used them myself to imple- ment, among other applications: a lexical analyzer and parser for Pas- cal (using a public domain ISO Level 0 grammar, also included in the distribution); Turbo Pascal Yacc itself, using bootstrapping; and a term rewriting system compiler. Also, quite a lot of smaller text and data processing and conversion routines have been implemented by the author, and others, using these programs. Personally, I feel that these tools are quite convenient and useful, and can safe a lot of trouble and time in software development, although they surely could still be improved in one direction or the other. Compiler construction tools are not only useful for the compiler writer, but can also be applied in the development of almost any other software tool that, in some sense, defines an input language. Also, the use of such utilities facilitates rapid prototyping, and enables the programmer to clarify language design issues in early stages of software projects. Turbo Pascal Lex and Yacc, as a starting point, bring to the Turbo Pascal programmer some of the merits of theoretically founded compiler technology, and thus may facilitate some of his work in trying to produce good, and reliable software. Author's address: Albert Graef, FB Mathematik, Johannes Gutenberg- Universitaet Mainz, 6500 Mainz (FRG). Email: Graef@DMZRZU71.bitnet. REFERENCES [1] Aho, Alfred V.; Ravi Sethi; Jeffrey D. Ullman: Compilers : prin- ciples, techniques and tools. Reading, Mass.: Addison-Wesley, 1986. [2] Johnson, S.C.: Yacc - yet another compiler-compiler. Murray Hill, N.J.: Bell Telephone Laboratories, 1974. (CSTR-32). [3] Lesk, M.E.: Lex - a lexical analyser generator. Murray Hill, N.J.: Bell Telephone Laboratories, 1975. (CSTR-39). [4] Schreiner, A.T.; H.G. Friedman: Introduction to compiler con- struction with UNIX. Prentice-Hall, 1985. [5] Waite, William M.; Gerhard Goos: Compiler construction. New York: Springer, 1985. (Texts and monographs in computer science). APPENDIX: LEX AND YACC MANUAL PAGES NAME Lex - lexical analyzer generator (MS-DOS/Turbo Pascal version) SYNOPSIS lex lex-file-name[.l] [output-file-name[.pas]] DESCRIPTION Lex compiles the regular expression grammar contained in lex-file-name (default suffix: .l) to the Turbo Pascal representation of a lexical analyzer for the language described by the input grammar, written to output-file-name (default suffix: .pas; default: lex-file-name with new suffix .pas). For each pattern in the input grammar an action is given, which is an arbitrary Turbo Pascal statement to execute when the corresponding pattern is matched in the input stream. The lexical analyzer is implemented as a table-driven deterministic finite automaton (DFA) routine named yylex, declared as follows: function yylex : integer; The return value of yylex may be 0, denoting end-of-file; all other return values are defined by the programmer and set through ap- propriate actions. The yylex routine can be compiled with the Turbo Pascal compiler (tpc or turbo). It is to be called in the context of a Turbo Pascal main program using the LexLib unit (which can be a Yacc-generated parser or any other program in a separate file, or incorporated into the input specification, and is to be supplied by the programmer). EXAMPLE A simple Lex program that counts words in an input file (obtained from standard input) can be implemented as follows: uses LexLib; var count : integer; %% [^ \t\n]+ inc(count); . | \n ; %% begin count := 0; if yylex=0 then writeln('word count: ', count) end. To compile and run this program, issue the following commands (assum- ing the Lex program to be in file wordcount.l): lex wordcount tpc wordcount wordcount <input-file DIAGNOSTICS In case of syntactic or semantic errors in the source file, Lex displays source line numbers and contents, error position and error message; a copy of the error messages is written to the file lex-file-name with new suffix .lst. NAME Yacc - yet another compiler-compiler (MS-DOS/Turbo Pascal version) SYNOPSIS yacc yacc-file-name[.y] [output-file-name[.pas]] DESCRIPTION Yacc compiles the BNF-like grammar contained in yacc-file-name (default suffix: .y) to the Turbo Pascal representation of an LALR(1) parser for the specified language, written to output-file-name (default suffix: .pas; default: yacc-file-name with new suffix .pas). Also, it generates a header file (output file name with new suffix .h) containing declarations to be shared between parser and the lexical analyzer routine yylex (discussed below), and a report file (yacc- file-name with new suffix .lst) that contains a description of the generated parser. The grammar rules in the specification can be augmented with actions, Turbo Pascal statements to execute when an instance of the correspond- ing grammar rule has been matched in the input. The parser is implemented as a table-driven deterministic pushdown- automaton routine yyparse, that performs a non-backtracking, bottom-up shift/reduce parse. yyparse is declared as follows: function yyparse : integer; The return value of this function is either 0 (normal termination of the parse) or 1 (exception occurred during the parse, e.g. stack over- flow, unrecoverable syntax error, or programmer action called yyabort). The yyparse routine can be compiled with the Turbo Pascal compiler (tpc or turbo). It is to be called in the context of a Turbo Pascal main program using the YaccLib unit and a lexical analyzer routine function yylex : integer; which can also be prepared with Lex; these must be supplied by the programmer. The header file Yacc generates summarizes declarations to be shared between parser and the yylex routine. If the yyparse routine is compiled with defined conditional yydebug, i.e. tpc filename /Dyydebug yyparse will trace all parsing actions on standard output. EXAMPLE The sample desktop calculator supplied on the distribution disk con- sists of the main program and input grammar in file expr.y and a lexi- cal analyzer in the Lex source file exprlex.l. It can be compiled and run by issuing the commands: yacc expr lex exprlex tpc expr expr To trace the steps made by the parser, compile expr.pas with tpc expr /Dyydebug DIAGNOSTICS When encountering syntactic or semantic errors in an input grammar, Yacc gives diagnostics on standard output and in the report file (yacc-file-name with new suffix .lst). Upon successful compilation of the input grammar, Yacc will report the number of shift/reduce and reduce/reduce conflicts encountered when constructing the parser (if there are any); also, Yacc will report the number of grammar rules that are never used in a reduction, and issue warnings when nonterminal grammar symbols do not appear on the left side of at least one grammar rule. Such items, in particular: shift/reduce and reduce/reduce conflicts, are discussed in more detail in the report (.lst) file.