Power-Programmierung

home *** CD-ROM | disk | FTP | other *** search

/ Power-Programmierung / CD2.mdf / c / library / os2 / pgen_2 / pgen.doc < prev next >

Wrap

Text File | 1993-06-08 | 54.4 KB | 1,248 lines

PGEN Parser GENerator (v2.0) Copyright (C) 1992 by Keith L. Robertson PGEN accepts a description of a context-free grammar and its semantic actions. It can output C language code for tables used to parse sentences in the described language and to interpret them via the semantic actions. The C language 'parse()' program that uses the tables is included with PGEN. PGEN is NOT public domain or free software. It is distributed as Shareware. Okay, folks. You're experienced enough to know what that means: You get to try it for free. If you like it and use it, buy it; keep the author in business. If not, burn it (your copies, not mine). See the section LEGALESE for licensing terms. 1 PACKING LIST (Shareware): This is what you should have with the shareware version of PGEN: PGEN parser generator: PGEN.DOC This file. The documentation for PGEN and the files that come bundled with it. PGEN.EXE The parser generator program. Execute it with no arguments for brief usage instructions. ORDER.FRM Print this for a PGEN order form. TYPES.H Public domain header file with some convenient declarations. Put this in your INCLUDE directory. PG_PARSE.H Header file for types, globals, and procedures needed by the parser, parse tables, and semantic actions. PG_PARSE.C Source code of the parser. Included with the parser is the source code for EXPR, a simple expression evaluator used to demonstrate usage of the PGEN parser generator. It uses PG_PARSE.H and PG_PARSE.C. MAKEFILE EXPR's makefile. Compilation instructions are for 'ztc', the Zortech compiler. Easily modified for other compilers. EXPR_LEX.H EXPR's lexical analyzer header. Defines LEX_INFO, used by the parser. EXPR_LEX.C EXPR's lexical analyzer implementation. EXPR_PAR.PGN EXPR's grammar file for PGEN. Used to produce parsing tables. PGEN -C EXPR_PAR.PGN >EXPR_PAR.C EXPR_SEM.H EXPR's semantic actions header. Declares the semantic action routines used in EXPR_PAR.PGN and the resulting EXPR_PAR.C. EXPR_SEM.C EXPR's semantic actions implementation and the 'main()' program. Bundled with PGEN shareware are three useful utility programs. All of these print usage instructions when executed with no arguments. PPRINT.EXE Pretty print text files, with headings and (optional) line numbers. Uses Epson-compatible printer codes. The source code (received with commercial version registration) is easily modified for other printers. TAIL.EXE Prints the last lines of the given file or of the standard input stream. TEE.EXE Sends standard input to the given file and to standard output. NOTE: When you received the shareware version of PGEN, you probably received it in a self-extracting archive, PGEN_xxx.EXE, where xxx is the version number. On the same disk, there may also be these files: PGEN_xxx.EX2, PGEN_xxx.EX3, and PGEN_xxx.EX4. These are merely copies of PGEN_xxx.EXE. If the .EXE should be unreadable due to a sector gone bad, any of the other three may be renamed with a .EXE extension and executed to extract the same files. If your .EXE is okay, the others are not needed. 2 PACKING LIST (Commercial): This is what you receive with the commercial version of PGEN: All of the above, except with the following change: PGEN.EXE Employs a sophisticated table reduction algorithm. The EXPR 'action_list' table shrinks from 101 to 73 elements (28%). A larger application decreased from 963 to 577 elements (40%). The nonterminal and production tables will also have unused entries removed, although the gain is not as great. You will additionally receive the following utilities: CAT.EXE Prints in sequence (concatenates) to standard output the given file(s). CHOPCRLF.EXE Chops <CR,LF> off the end of a text file. CMDLINE.EXE Simply echoes the arguments given with wildcard replacement, so you can see what the other programs are receiving. CP.EXE Copies one file to another or several to a directory. MV.EXE Moves one file to another or several to a directory. RM.EXE Removes one or more files or directories. HSPLIT.EXE Breaks a long text file into several smaller text files. Useful when you have a file bigger than your editor can handle (like "PGEN -L" output for a large grammar). VSPLIT.EXE Breaks a wide text file (like "PGEN -A" output for a large grammar) into several others for printing. NCSL.EXE Reads a C or C++ source file, categorizing each line into one of four categories: Blank, Comment, Bracket, or Code. DIFF.EXE Compares two text files and prints their differences. DIRDIFF.EXE Compares two directory structures and prints the files unique to each. Useful for comparing working and backup directories. Each of the above prints usage instructions when executed without arguments. And best of all, commented source code: MAKEFILE, 8 header files, and 8 implementation files in C++ for PGEN. MAKEFILE and source files (mostly C, some C++) for all the above utilities. Registered users will also receive notice of any updates, which will be provided for no more than $5.00, plus non- U.S. Mail mailing costs (see REGISTRATION PROCEDURES below). 3 REGISTRATION PROCEDURES: Simple, just print the file ORDER.FRM and fill in the requested information. Currently, IBM DOS (& compatible OS) only. Make a check payable to "Keith L. Robertson" for the following amount: $ 20.00 (U.S.) per copy (TX residents add 8% sales tax) $ 1.00 per copy for 3 1/2"-1.44M (high density) disks Orders are shipped via U.S. Mail at no extra cost. Otherwise, add: $ 3.00 for UPS shipment $ 5.00 for shipment outside the U.S.A. Students using PGEN for coursework ONLY may register PGEN for $10.00 per copy. Gets commercial PGEN.EXE and utilities but not source code. Please include photocopy of validated Student I.D. For personal or commercial project use, please register for the full $20.00. Mail the above to: Keith L. Robertson 11857 Gloger Street Houston, TX 77039-6309 (U.S.A.) Customer support is currently by mail only, to the same address. Unregistered users please include SASE if not ordering at the same time. PGEN is not a program for the masses; I must keep my costs low to offer it and support it at the requested price. LEGALESE (Licensing terms): (Sorry, but we have to go thru this.) In the following, reference to "PGEN" includes the parser generator and all source code and other utility programs that come with the shareware version. (Exception: TYPES.H is public domain.) (1) Unregistered users may copy PGEN and use it for two weeks solely for the purpose of determining if it suits their needs. If PGEN is used to create parsing tables for a programming project, whether for commercial or private use, continuing use of those tables (and therefore of PG_PARSE.H and PG_PARSE.C) within that project constitutes use of PGEN and requires registration, even if PGEN itself is never executed again. (2) A user may register as given above in REGISTRATION PROCEDURES. (3) A registered user may use PGEN on one computer per copy registered at any given time. If other people use PGEN routinely, even on the same computer, they must become registered users themselves. (4) A company using PGEN may have one or more employees using the same copy of PGEN. In this case, the registered user is the company, not the employee. However, if PGEN is available for use on more than one computer at a time, one copy must be registered for each such computer. 4 (5) Anyone -- electronic BBS sysops, commercial data networks, shareware distribution houses, users, etc. -- may distribute the shareware version of PGEN, as long as there is no charge for the program itself, all files in the shareware package are distributed complete and unmodified, and no files from the commercial package are included. There may be a charge for normal connect time, or for disk copying and mailing costs, as long as this charge does not exceed $7.00 per disk; PGEN shareware must fit on one disk. The distributor or vendor must make it clear that it is charging for distribution costs and not for the software itself. (6) PGEN may not be bundled with any other commercial product without the express written consent of the author. (7) The commercial version of PGEN, sent to registered users, and all source code and utility programs bundled with it, are for the private use of the registrant. They may not be copied or distributed in any form. (8) Any products made using properly registered copies of PGEN are entirely the property of their creators. There are no royalties nor run-time license fees required by the author of PGEN for the creation and distribution of these products. (9) PGEN may be ported to other operating systems (I expect very easily), using the source code with the commercial version, only with written permission from the author. The ported version will be subject to the same licensing terms under the target platform. Once the author's permission has been obtained, a shareware version may be distributed for that platform, as long as it contains all files and only those files listed in the shareware packing list. Any usage of any part of the PGEN shareware or commercial packages, including source code or utility programs constitutes legal acceptance of the above licensing terms. A user must be registered to use any of the source code or utility programs that come with PGEN beyond the evaluation period. If you use ANY of these, you must be registered for the entire PGEN package. THIS SOFTWARE IS SOLD "AS-IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, RESPECTING THE FITNESS OF MERCHANTABILITY OF THIS SOFTWARE FOR ANY USE. THE AUTHOR SHALL NOT BE LIABLE TO THE PURCHASER, USER, OR ANY OTHER PERSON OR ENTITY WITH RESPECT TO ANY LIABILITY, LOSS, OR DAMAGE CAUSED OR ALLEGED TO BE CAUSED DIRECTLY OR INDIRECTLY BY THIS SOFTWARE. The user is especially cautioned on the use of the "RM -R" feature of the "RM" remove files utility. Trademark Acknowledgements: Epson is a trademark of Epson Corporation. IBM is a trademark of International Business Machines Corporation. Zortech and Symantec are trademarks of Symantec Corporation. 5 USING PGEN INTRODUCTION: A parser generator is a rather advanced programming tool which can be used in the creation of compilers, interpreters, structured editors, text formatters, command line parsers (e.g. for an adventure game), etc. This document is therefore written for the advanced programmer with knowledge of the principles of parser generation. If you don't know what is a token (or terminal), nonterminal, production, or semantic action, chances are you don't need a parser generator. For those interested in the topic, read Aho, A.V., R.Sethi, and J.D.Ullman, "Compilers, principles, techniques, and tools", 1986, Addison-Wesley, Reading, Mass. known as "the dragon book" in academia for the red dragon on its cover, which represents "complexity of compiler design". An LALR parser generator, like PGEN, is shown as the knight's sword used to slay the dragon. I created PGEN because I wanted to write a compiler, I didn't have access to YACC (Dragon book pp.257-266), and I couldn't find a public domain or shareware parser generator in the ShW/PD libraries. Because of this, I don't know how PGEN compares with other parser generators in execution speed, nor in size and speed of the parser generated (if anyone wants to compare PGEN with any other and send me the results, I would welcome the feedback). I do know I have used PGEN to create a working parser (I haven't finished the whole compiler; been working on PGEN!). I find it very useful and others may as well. Terminology: stdin = the standard input stream stdout = the standard output stream stderr = the standard error stream PGEN COMMAND LINE: Usage: PGEN [-I] [-TNFDPSKLWQAGCX] <grammar file> PGEN reads the given grammar file and prints its results to stdout, which is usually redirected to a file. Options can be given together or separately ("PGEN -IC x" == "PGEN -I -C x") in any order, but must precede the grammar file name. During its operation, PGEN prints some progress information to stderr, which will print to the screen even when stdout is redirected (as it should). 6 PGEN GRAMMAR FILE FORMAT: The grammar file has a format similar to that expected for YACC, but there are significant differences. The format is as follows: /* C language code that will appear at the beginning of the C code parse table output. Everything up to the first line beginning with an open bracket ('[') is emitted unmodified. Usually header comments, includes, and defines. No line in this section can begin with '['. */ #include "lexer.h" /* Get defns of token identifier names. */ #define STACK_INC 10 [ # Token definitions begin with a bracket, which must begin the # line. Comments begin with '#' and last to the end of the line. TOKEN1, # Token definitions are separated by commas. # The parser must return 0 for end-of-stream, # 1 for the first token (TOKEN1 here), etc. IDENTIFIER, # A token name can be an identifier, beginning NUMBER, # with a letter, followed by letters, numbers # & underscores, up to 128 characters. Case # is significant, although all upper case is # recommended for tokens. "%", # A token can also be a string, enclosed in # single or double quotes. Tokens must # appear in productions exactly as in the # definition including the quote characters. T ALT_NAME, # A token may be given an alternate name, # which will be used when printing the parse # table C code. It must be an identifier. ',' COMMA, # An alternate name is usually used with '(' LP, ')' RP, # string tokens, since the string name is not # valid for the C output tables. '=' 1R, # A token may be given a precedence number, # which can resolve some ambiguities (Dragon # p.261). A 'L' or 'R' giving associativity # can immediately follow the precedence. '+' 2, # Left associative is the default. '-' 2, # All operators with the same precedence # should be given the same associativity. '*' TIMES 3L, # When all three are present, the correct # order is "primary name", "alternate name", # "precedence/assoc". ] # Token defs end with a close bracket. Productions follow. # The first nonterminal defined is always the start symbol. # Each nonterminal must be defined exactly once but may have # multiple productions in the definition. Expr_list # Recommended style is initial capital. # Each production is introduced with a '|' or ':' and is defined # by a sequence of zero or more symbols (tokens and nonterminals) # on the right-hand-side. I recommend introducing all # productions with '|'. 7 # Nonterminals can be used before they are defined. | Rep1_expr_list | # Empty productions are sometimes necessary. | error # Error prod's as in YACC (see ERROR HANDLING) ; # List of productions is ended with ';'. Rep1_expr_list # Productions may be directly or indirectly recursive, although # left recursion, as shown here, is preferable. | Rep1_expr_list ',' Expr | Expr ; Expr # Productions can have a priority also. The first token with a # priority gives its priority to the production. Alternately, # the priority can be given explicitly. Although this grammar is # ambiguous, the priority and associativity designations resolve # the ambiguity. | Expr '*' Expr # Implicit priority 3 | 2 Expr Add_op Expr # Explicit priority 2 # A production may end with a semantic action name introduced by # an at sign ('@'). This identifier is the name of the C # procedure to be called when the production is reduced. # It must be a simple identifier, not C code. | IDENTIFIER '=' Expr @ do_assign # Recommend lower case | '(' Expr ')' @ do_paren | Add_op Expr @ do_unary | IDENTIFIER [Fn_args] # Any single symbol may be enclosed in # brackets to make it optional. ; Fn_args | '(' Expr_list_2 ')' ; Expr_list_2 | { Expr ',' ... } # A production can be defined by two # symbols in braces to specify a list. ; Add_op | '+' | '-' ; #--------------------------- End of File ---------------------------# The grammar above is not terribly useful as-is. See EXPR_PAR.PGN for a working example. For readability, I recommend that a nonterminal being defined sits on a line by itself, followed by its productions indented beneath, each production beginning a new line, and the terminating semicolon indented on a line by itself. (I don't entirely follow this convention in this document's examples for reasons of length.) For users of versions prior to 2.0, note that "{ sem }" has been changed to "@ sem" for semantic actions. If it were needed, I would have included a converter, but any decent text editor will have a search and replace operation that will allow you to selectively replace '{' with '@' and remove '}'. Braces were originally used to enclose the semantic action name for similarity to YACC, but they were made to enclose the repetition construct instead for similarity to EBNF (Extended Backus-Naur Formalism). 8 PGEN COMMAND LINE OPTIONS: Once again, the usage line is Usage: PGEN [-I] [-TNFDPSKLWQAGCX] <grammar file> PGEN has many options, although all but one are print options. -I Inline productions. If this option is given, PGEN will substitute productions where possible. See PRODUCTION INLINING below. The remaining options tell PGEN what to print to stdout. -T Print tokens. The format is "Sequence number", "Priority", "Associativity", "Primary name". The alternate name is not printed. '#E_O_S#' (end of stream) is always the first token, and 'error' is always the last. -N Print nonterminals. The format is "Sequence number", "Number of the first production which defines the nonterminal", "Number of productions defining this nonterminal", "Number of uses on the right-hand-side of any production", "Stack size needed to parse this nonterminal", "Recursion mark", "Name". The first line gives a stack size and recursion mark for the entire grammar. These are explained later. A dummy start symbol '$_Start_$' is always the first nonterminal. -F Print the "first symbols in productions" for each nonterminal. -D Print the "first symbols in derivations" for each nonterminal. -F and -D were primarily useful for debugging PGEN itself, so the meaning of these is documented in the source file PGEN_SYM.CPP. -P Print productions. "Sequence number", "Number of the first SLR item for this production", "Production priority", "Production". The grammar is always augmented with the production $_Start_$ --> <the first defined nonterminal>. -S Print the SLR items. "Sequence number", "SLR item". The dot location is shown as '<>'. -K Print the item set kernels. -L Print the item set LALR closures. -K and -L are very useful in debugging grammars, especially in finding and resolving parsing conflicts. -W Modifies the printing of -K and -L to print the item sets (W)ith its goto list. This list shows what state the parser will go to for each symbol (both tokens and nonterminals). Symbols not listed are errors if seen. -Q Print parsing action conflicts to stdout instead of stderr. -A Print a table showing the parsing actions for each state and token. -G Print the goto table for each state and nonterminal. -A and -G combined give the type of information shown in the Dragon book Figs 4.31 (p.219), 4.40 (p.236), and 4.47 (p.250). -C Print the C language code for the generated parsing tables. (Our goal.) -X Print ALL of the above except -W and -Q (which you may specify separately). This will get REALLY BIG for a grammar of any size, and was mostly useful in debugging PGEN. 9 PRODUCTION INLINING: Consider the following: Lhn | Rhn ; # This production has one nonterminal and no # semantic action. Lhn may have other prod's. Rhn | ... ; # Rhn is defined by one production, which # may or may not have a semantic action. PGEN will substitute for Rhn, resulting in: Lhn | ... ; # Rhn's production and semantic action. This will increase the parser's speed slightly, since it won't have to reduce by "Lhn --> Rhn". The rationale for this option is so we can write productions like Statement | For_statement | While_statement | If_statement # etc. ; For_statement | FOR IDENT ':=' Expr To_ Expr DO Statement @ for_st ; # etc. without any unnecessary overhead, rather than having to write Statement | FOR IDENT ':=' Expr To_ Expr DO Statement @ for_st ; # etc. which will be much less readable. This optimization often results in inlining and optimizing many productions that weren't expected. STACK SIZE AND RECURSIVE MARK: Two of the things printed on the "PGEN -N" option are the stack size and recursive mark for each nonterminal and for the entire grammar. The stack size needed to parse the grammar is simply that for the augmented start symbol '$_Start_$' plus one. The recursion mark for the grammar is the sum of that for all nonterminals. The stack size for a nonterminal is the worst-case size of the parsing stack needed to parse a sequence of tokens, starting at that nonterminal, barring any recursion other than left recursion. Left recursion, as with Rep1_expr_list above, causes no problems, and is not considered recursion for purposes of the mark or this discussion. Other recursion, as with both of these productions, Expr | '(' Expr ')' | IDENTIFIER '=' Expr ; # Right associative 10 would require an infinite stack size to be guaranteed to parse any sequence of tokens that match the grammar's definition of an Expr. For example, the following is a valid Expr: ((((((((((((((((((((((((((( zebra ))))))))))))))))))))))))))) Indeed, how many parentheses should we allow? A million? Since an infinite stack is impossible in any real computer, PGEN computes the stack size as if there were no such recursion, and informs the programmer in its -N and -C output if the grammar is recursive. The recursive mark for a nonterminal increases each time (non-left) recursion through the nonterminal is detected. Its main relevance, however, is whether or not it is zero, not the actual number. If the mark is nonzero, the nonterminal or grammar is recursive. The programmer may deal with a recursive grammar by deciding how much nesting to allow and setting an amount to increase the stack size in the "PGEN -C" output. This is done by defining the macro 'STACK_INC' in the C code part of the grammar file. For example, #define STACK_INC 10 will allow at least 10 nested '(' or at least 5 nested "IDENT '='" (or six '(' and two "IDENT '='", etc.) without stack overflow, even in the worst case. The parser may still be asked to parse a string of tokens which is valid for the language, but which its stack cannot handle. It will not by default check if the parse stack overflows, and in this case the program will probably crash. Fortunately, with just a little leeway, stack overflow is rare in practice unless you are really trying. To get the parser to check for overflow, set STACK_SIZE in PG_PARSE.C to the actual stack size. The actual stack size is set by the "#define STACK_SIZE" in the "PGEN -C" output. FOLLOWING ALONG WITH PGEN: While executing, PGEN prints progress messages. As they are read, it prints the number of tokens, nonterminals, and productions taken from the input file. PGEN then checks that all nonterminals are properly defined and used. The number of SLR items is computed and printed. As the item sets are constructed, PGEN prints a running tab of two numbers. The first number is the item set PGEN is currently examining; the second is the total number of item sets created so far. Initially, there is only one item set, number 0. As each set is examined, zero, one, or many new item sets may be added. PGEN is done when the first number becomes equal to the second. Then, lookahead symbols are propagated from all item sets. PGEN prints the item set currently being examined. When it finishes the last item set, it may need to backtrack to an earlier set several times. 11 Next, PGEN constructs a table of actions for each <state, token> pair. If any parsing conflicts are found (see PARSING ACTION CONFLICTS), they are printed during this phase. Finally, PGEN converts the tables to action lists which reduces the storage requirement (Dragon book p.244-247). In the commercial version, PGEN merges similar action lists as they are added, further reducing the needed storage. The lists then need to be sorted into the order they will be printed in the final C code output. CREATING A LEXICAL ANALYZER: The parser requires an associated lexical analyzer (lexer). Fortunately, it is not very difficult to create an efficient, robust lexer by hand. EXPR_LEX.C is easily modified to recognize a different set of tokens and to accept its input from other than stdin. The header file for the lexer must declare all token identifier names and any types, global data, or routines needed by the parser and semantic actions. These are the items it must declare: 1) The type 'LEX_INFO' The type of objects on the parse stack. Defines attribute information relevant to token lexemes and nonterminal symbols. 2) The extern variable 'lexeme' of type 'LEX_INFO' Location where the lexer function 'get_token()' returns information related to the last token it read from the input stream. READ ONLY! 3) The extern variable 'last_token' of type 'USHORT' Token number returned by the last call to 'get_token()'. READ ONLY! 4) Function 'USHORT get_token()' Reads the input stream, returns the number of the token it saw directly and in 'last_token', and places related information in 'lexeme'. 5) Procedure 'VOID log_error (USHORT etype, CSTRING desc)' Prints an error message based on the given error type and description. See EXPR_LEX.H for a description. 6) Operations 'set_jumped' and 'was_jumped' on type 'LEX_INFO' #define set_jumped(lex,val) #define was_jumped(lex) The macro 'set_jumped()' marks 'lex' as jumped if 'val' is true and as not jumped if 'val' is false; 'was_jumped()' returns true if and only if 'lex' is marked as jumped. The type 'USHORT' above is defined in TYPES.H as "unsigned short". You may alternately use any integer type. Finally, if the parser is to stop parsing on something in the input stream other than the end-of-file, the lexer merely needs to return the end-of-stream token (token 0) for that thing. 12 SEMANTIC ACTIONS: When the parser reduces by a production, its corresponding semantic action (after '@'), if any, is executed. Semantic actions must be written as individual C procedures which take no arguments and return no value. When a semantic action is called, a global pointer 'attr' points to an array of objects of type LEX_INFO representing the symbols of the associated production, beginning with 0. The semantic action returns attribute information in attr[0]. For example, in the EXPR example we have a production Expression | IDENTIFIER '=' Expression @ expr_assign ; In the code of 'expr_assign()', the identifier is represented by attr[0]. The lexer returned the string for token IDENTIFIER in its 'name' field. The '=' is attr[1]. The lexer did not return the token string, as all relevant information was carried by the token type EQUAL. In fact, that the parser saw '=' when it did is contained in the parse state, so 'expr_assign()' doesn't even consult attr[1]. Finally, the attributes for the received Expression is in attr[2], of which only the field 't.fvalue' is relevant. A semantic action returns its attribute information in attr[0]. Note that the first production symbol and the return symbol use the same attribute object, attr[0]. Make sure the writes for the return information don't affect the attribute reads from the first symbol information. This is easy to ensure in practice. Also, if the return attributes are identical to the attr[0] input and no side effects are required, no semantic action is needed. This is the case for Binary1, Binary2, and Unary in EXPR_PAR.PGN. In EXPR, 'expr_assign()' is mostly useful for its side effect, saving the IDENTIFIER name and its assigned value in a symbol table. Since an assignment is also an Expression, the value is copied to attr[0].t.fvalue. It is an error for a semantic action to access or modify attr[n] where (n < 0) or ("the number of production symbols" <= n). OPTIONAL SYMBOLS: Any single symbol on the right-hand-side of a production can be made optional simply by enclosing it in brackets. Prior to version 2.0, you would usually need to clutter your grammar with the following: Nonterm | ... Opt_symbol ... @ p1_action ; Opt_symbol | Symbol | ; This also adds inefficiency. Time: When Symbol is parsed, we must needlessly reduce it to an Opt_symbol. Space: The above introduces one nonterminal and two productions which must appear in the final 13 parse table output. The only time that you wouldn't need this is if Symbol is a nonterminal that is optional in ALL its uses. Then it just needs an empty production: Symbol | ... all its other productions | ; Now, it doesn't matter if Symbol is a token or nonterminal or if it is not optional in other uses. You may write Nonterm | ... [Symbol] ... @ p1_action ; When appropriate, Symbol will be jumped (skipped) with no semantic action taken. This introduces a potential problem. How does a semantic action like 'p1_action()' determine if one of its optional symbols appeared in the input stream or was jumped? It tests "was_jumped (attr [sym_num])", where 'sym_num' corresponds to the Symbol. As previously noted (in CREATING A LEXICAL ANALYZER), the lexer header is required to provide this macro. The lexer must ensure "was_jumped (lexeme)" is false, because any token it returns was seen, not jumped. The parser sometimes needs to mark stack symbols (pointed to by attr) as jumped or not, so the macro 'set_jumped()' must also appear in the lexer header. LISTS,LISTS,LISTS: A construct that may repeat entirely was always easy to write. List | List <all repeating symbols> | # for 0-or-more, *OR* | <all repeating symbols> # for 1-or-more. ; # e.g. (empty) C; C;C; C;C;C; C;C;C;C; (etc.) # Here, <all repeating symbols> derives "C;" Optional symbols make the 1-or-more list even easier List | [List] <all repeating symbols> ; However, a list of items and separators is more verbose to express without optional symbols, especially in the 0-or-more case: List_0 | List_1 # 0-or-more | ; List_1 | List_1 Separator Item # 1-or-more | Item ; # e.g. (empty) A A,A A,A,A A,A,A,A (etc.) # Here, Item derives "A" and Separator derives "," 14 Optional symbols remove the need for nonterminal List_0 and its productions, but you must make your list optional in all its uses EXCEPT in the 1-or-more list. List_0_use | ... [List_0] ... ; List_0 | List_0 Separator Item # Really a 1-or-more | Item ; This is error prone; you might forget to make a usage optional. Computers are supposed to do the rote work for us, right? PGEN version 2.0 adds the following constructs: List_0 | { Item Separator ... } ; # 0-or-more list List_1 | { Item Separator ... }+ ; # 1-or-more list There must be exactly the two symbols, the item and the separator, and no others in the production. The ellipse ('...') is optional but highly recommended. For example: A_list | { A ',' ... } ; # Add '+' to exclude (empty) PGEN converts both constructs to a 1-or-more list, turning the one production into two. For a 0-or-more list, it automatically scans all past and future productions, and makes optional all uses of the defined nonterminal. The "PGEN -P" output shows the productions after these translations. What about semantic actions? This production List | { Item Separator ... } @ bld_item_list ; and the similar 1-or-more list both become List | List Separator Item @ bld_item_list | Item @ bld_item_list1 ; Note that two semantic actions are needed by a list. The first to execute will have the semantic action name with the character '1' appended ('bld_item_list1'). It must convert a single item to a list with one item. Successive iterations will call the unmodified semantic action name ('bld_item_list'), which adds a single item to the list. WARNING!! There are two potential pitfalls the PGEN parser programmer must particularly pay heed to! (Oooh, love the alliteration.) 1) In the unchanged semantic action name (e.g. 'bld_item_list'), the List nonterminal is at attr[0], the Item at attr[2], and of course, the Separator at attr[1]. This is not obvious from the original list construct where the Item appears before the Separator, and the List doesn't appear at all. 2) In the usage of a 0-or-more list, an empty list results in a jumped symbol, NOT in an empty list. Be sure to check for this in semantic actions that use a 0-or-more list. 15 Your test for the day: [ ALPHA, BETA, SEP, END ] # Tokens Start | List END @ start ; List | { Item SEP ... } @ list ; Item | ALPHA | BETA ; # Item derives ALPHA or BETA. The Separator is SEP. Implement semantic actions for the above grammar so that the number of ALPHA tokens found in the input stream is printed. Assume structure LEX_INFO has an integer field 'token' filled by the lexer with the appropriate token value and an integer field 'count' available for your use. Also, 'ALPHA', 'BETA', etc. are macro defined constants with the correct token values. See THAT'S ALL, FOLKS! for an answer. ERROR HANDLING: Error handling is done similarly as with YACC (Dragon book p.264-266), by deciding on a set of "major" nonterminals to have error recovery and giving those nonterminals an error production of the form Nonterminal | error @ nonterminal_error ; The error token must be the ONLY symbol in the production, and it must appear in lower case. When an unexpected token is detected in the input stream, 0 or more states are popped off the stack until the topmost state defines an action on the 'error' token (the state contains an item of the form "Nonterminal --> <> error"). The error token is shifted onto the stack, and the error production's semantic action is called when the error production is reduced. If the stack becomes empty and no such state is found, the parser aborts with a fatal error. The error semantic action routine is responsible for printing any needed error message and for skipping tokens until the parser can resume. Procedure 'log_error()' in EXPR_LEX.C provides the former facility for EXPR. PG_PARSE.C provides routines 'skip_to_token()' and 'skip_to_bittok()' for skipping to a particular token, or to one of a set of tokens given as a bitmap. (See the file EXPR_SEM.C for example usages of these.) The matching token is not consumed from the input stream by these procedures. THE JUMP ACTION: You already should be familiar with the quaternion of actions performed by bottom-up parsers -- shift, reduce, accept, and error. PGEN includes a fifth action, jump, which was added to accommodate optional symbols. Like the shift action, the jump action specifies which state to "shift" onto the stack. Unlike 'shift', it does not consume a token from the input stream. The LEX_INFO structure pushed 16 represents the symbol being jumped. Its contents are undefined, except that 'was_jumped()' will return a true value when given that structure as its argument. The jump action, with all its consequences and solutions, is entirely this author's original invention. The parser also uses the jump action when it would otherwise reduce by an empty production with no semantic action. In fact, such a production is eliminated, and all uses of the nonterminal on its right-hand-side are made optional. This will appear in the "PGEN -P" output. The contents of the LEX_INFO structure representing the result of such a production would otherwise be completely undefined, but now you can test it with 'was_jumped()'. I could have implemented optional symbols by converting the grammar to an equivalent form without optional symbols, but that would mean adding extra nonterminals and productions just for this purpose, which would have increased the size of the resulting tables. The jump action is also slightly faster. This is how PGEN determines when it should use the jump action. Assume state 's' contains an LR(1) item of the following form: Nonterm --> ... <> [Opt_sym] <follow_string>, [{follow_tokens}] If token 't' is a member of FIRST(<follow_string>) or if <follow_string> derives the empty string and 't' is a member of {follow_tokens}, then action [s,t] == "jump goto (s, [Opt_sym])". Whoa! You know that "goto (s, Symbol)" is the state the parser goes to (pushes on the stack) when it is in state 's' and it sees 'Symbol'. (I consider the item set, "I sub s" to be equivalent to state 's'.) But what is "goto (s, [Symbol])"? It is the state the parser goes to when it jumps 'Symbol'. This may or may not be the same as "goto (s, Symbol)". Consider a state with the following items: SET a: Xn --> ... <> Symbol ... Yn --> ... <> [Symbol] ... The parser can either see a Symbol or jump it. If seen, both items should be in the resulting state. SET b: Xn --> ... Symbol <> ... Yn --> ... [Symbol] <> ... If jumped, however, only the items where 'Symbol' is optional should appear in the resulting state. SET c: Yn --> ... [Symbol] <> ... Here, "goto (a, Symbol) = b" and "goto (a, [Symbol]) = c". In the "PGEN -KW" output under set 'a', both will appear: SET a: ... Symbol b [Symbol] c While the jump action reduces size and time, it invites three new potential conflicts. See PARSING ACTION CONFLICTS. 17 PARSING ACTION CONFLICTS: While generating the parsing actions for each <state, token> pair, PGEN may encounter parsing action conflicts (Dragon p.228-230), which occur when a pair is defined by two or more actions. These conflicts are printed to stderr as they are found. When they occur, the programmer should consult the "PGEN -K" or "PGEN -L" output for the same grammar to determine if the conflicts need to be resolved. Parsing conflicts don't always need to be eliminated; the default choice made by PGEN usually produces the intended result, and eliminating the conflict may add complexity to the grammar. A reduce/reduce conflict arises when the same state has two or more items that call for a reduction on the same token. For example: Sign --> '+' <>, [INTEGER REAL] Unary --> '+' <>, [INTEGER REAL] PGEN will reduce by the production that appeared first in the grammar file. In this case, "Sign --> '+'" is the production we wanted to reduce by, so we just make the definition of Sign appear before Unary in the grammar file. A shift/reduce conflict occurs when the same state has items that call for a shift action and an item that calls for a reduce on the same token. Example: Expr --> Expr '*' Expr <>, ['+' '-' '*' '/' '^' ')'] Expr --> Expr <> '*' Expr, ['+' '-' '*' '/' '^' ')'] The first item calls for reduce by production "Expr --> Expr '*' Expr" on token '*' (among others), while the second calls for a shift on '*'. This kind of conflict is best resolved by giving each of the symbol tokens the appropriate priority and associativity. Not all shift/reduce conflicts are appropriately resolved with associativity. Consider a state containing the following two LR(1) items (among others): Statement --> IF Expr THEN Statement <>, [ELSE ...] Statement --> IF Expr THEN Statement <> ELSE Statement, [...] The first item indicates that an IF Statement can potentially be reduced when ELSE is seen. The second item requires a shift. In this case and many like it, the shift is needed (we match each ELSE with the closest IF). If not, the grammar needs to be rearranged. PGEN always chooses shift when there is a shift/reduce conflict. 18 The shift/jump conflict arises in states containing items similar to the following, # Opt_sym can represent either a token or a nonterminal Nt1 --> ... <> [Opt_sym] ..., [token_set] Nt2 --> ... <> TOKEN ... where TOKEN can follow the Opt_sym in the first production. (If TOKEN is a member of FIRST(Opt_sym), this conflict will invariably result.) PGEN must choose to make the parser shift the TOKEN or jump Opt_sym. It always chooses the shift action. The reduce/jump conflict results from the similar situation where a state contains items like the following: Nt1 --> ... <> [Opt_sym] ..., [token_set] Nt2 --> ... <>, [ ... TOKEN ... ] Again, TOKEN can follow Opt_sym in the first production. In this case, the second production specifies a reduce action on TOKEN. PGEN chooses the reduce. The parser can still jump Opt_sym after the reduce action if appropriate. Finally, a jump/jump conflict occurs when two or more different symbols could be jumped. Nt1 --> ... <> [Opt_sym1] ..., [token_set] Nt2 --> ... <> [Opt_sym2] ..., [token_set] Here, TOKEN can follow either Opt_sym1 or Opt_sym2, producing jump actions to two different states on the same token. PGEN chooses to jump the optional symbol in the first such item in the item set, Opt_sym1 in this case. PUTTING IT ALL TOGETHER: As in the EXPR example, your parser will need a lexical analyzer, a PGEN file to create the parsing table, the 'parse()' driver routine in PG_PARSE.C, the associated semantic actions, and a 'main()' startup procedure. How these are assembled into a working unit is not as easily explained in prose as it is to learn by example. Reading the EXPR*.* and PG_PARSE.? program text files is the best source for learning PGEN usage. That's why I created and included the EXPR example. A few things to notice: (1) The token definitions in the lexer header file parallel the definitions in the grammar file. The lexer header includes an end-of-stream token as token 0 and uses only identifiers, which are the alternate names for the grammar file tokens that are symbols. 19 (2) The grammar file needs #includes at the top to include the token definitions from the lexer header and the semantic action names from the semantic action header. (3) The lexer and semantic actions are written to work in concert where each makes assumptions about the other. In EXPR, the lexer returns the lexeme string only for certain tokens. The semantic actions are responsible for utilizing or freeing the memory used. This is normal and is required for efficiency. (4) To the contrary, the parser driver makes no assumptions about the format or content of the LEX_INFO structure. It does expect 'set_jumped()' and 'was_jumped()' to be available as previously mentioned. (5) PG_PARSE.C needs to be modified slightly to include your lexer header. If your application needs it, as when parsing a programming language, either your lexer must skip comments or you will need to modify the parser to do so. For extremely large grammars PG_PARSE.H may need modification. See the file \PSOURCE\OVERVIEW (with registered copy) for information. THAT'S ALL, FOLKS! Here is an answer to your test in LISTS,LISTS,LISTS. void list1 () /* If Item (pos 0) is ALPHA, init List with count of 1. */ /* 0 otherwise. */ { attr [0].count = (attr [0].token == ALPHA); } void list () /* If Item (pos 2) is ALPHA, increment List (pos 0) count. */ { if (attr [2].token == ALPHA) { attr [0].count += 1; } } void start () /* Check for empty (jumped) list, and print the count. */ { if (was_jumped (attr [0])) { attr [0].count = 0; } printf ("%d\n", attr [0].count); } That's all I can think of to say about PGEN. I appreciate all feedback -- likes, dislikes, suggested improvements, etc. I would especially like to hear from people who have used it. What extra information would you have liked to see in the documentation file and in the source code comments? Also, I'm always interested in others' applications for PGEN. If you care to tell me, please write and let me know what you are programming. I'll conclude with the usage instructions for all the included programmer's utilities: 20 Usage: PPRINT [-SPBN] [-L<lines>] [-C<cols>] [-T<tabs>] <filepatrn> ... Pretty prints text files with headings and optional line numbers. Options: S Send output to standard out. (default: standard printer) P Print in current type size. (def: modify for output width) B Print in boldface type. (def: use current printer setting) N Do not print line numbers. (def: print line numbers) L Set lines per page. (def: 66) C Set columns per line (incl. line numbers). (def: stdprn-96, stdout-79) T Set tab stop width. (def: 8) Usage: command | TAIL [-<nlines>] or: TAIL [-<nlines>] <input file> Prints the last <nlines> lines (default is 10 if not given) from standard output from the piped command (form 1 above) or from the given input file (form 2 above). Use instead [+<nlines>] to print all but the first <nlines> lines. Ex: DIR | TAIL -5 Print the last 5 lines of the DIR listing. TAIL +20 TEST Print all but the first 20 lines of file TEST. Usage: command | TEE <file> Sends text output by the command to the file and to standard output. Does not work with interactive commands under DOS. Usage: CAT <file pattern> ... Prints in sequence (conCATenates) all requested files. Usage: CHOPCRLF <text file> ... Removes the terminating carriage return/line feed from the end of text files. Usage: CHOPLL <text file> ... Removes all blank lines from the end of text files. Usage: CMDLINE <file pattern> ... Prints command line arguments with pattern substitution. Usage: CP [-VQ] <source file> <dest name> or: CP [-VQ] <file pattern> ... <dest directory> Copies source file to dest file, or a set of files to a directory. Options: V Verbose. Report what is being copied. Q Query user before copying. Usage: MV [-VQ] <source file> <dest name> or: MV [-VQ] <file pattern> ... <dest directory> Moves source file to dest file, or a set of files to a directory. Options: V Verbose. Report what is being moved. Q Query user before moving. 21 Usage: RM [-RVQ] <file pattern> ... Removes all files matching the given file patterns. Options: R Remove all directory contents recursively. V Verbose. Report what is being deleted. Q Query user before deleting. Usage: HSPLIT <num> <infile.ext> Breaks a long text file into <num> shorter text files where each output file is a horizontal piece of the input file, approx equal in size. Output is to files: infile.H00 .. infile.H<num-1> <num> must be in the range 2 - 100. Usage: VSPLIT [-T<tab>] [-L<len>] <infile.ext> Breaks a wide text file into several thinner text files where each output file is a vertical strip of the input file. Output is to files: infile.V00 .. infile.V<num-1> where <num> is the largest input line length divided by <len>. <tab> is tab size (2-32, default 8). <len> is max line length for output files (16-256, default 80). <len> must be a multiple of <tab> Usage: NCSL <C/C++ source file pattern> ... Reads the requested C/C++ source files and categorizes each line as Blank, Comment, 1 Brace, or Code. The name comes from the industry metric for source code size, "Non-Commentary Source Lines". Commentary: Blank -- Lines with only whitespace; no comments or text Comment -- Lines containing only // or /* */ comments and whitespace. Includes blank lines within /* */ comments. Non-commentary: 1 Brace -- Lines with only a single { or } brace, whitespace and comments. Code -- All other lines (should be program code). Usage: DIFF [-B] <old text> <new text> Compares the contents of two text files and prints their differences. If -B is specified, a tab character is prepended to the output lines. Usage: DIRDIFF dir1[\patrn] dir2 Compares the two directories for files matching the pattern and prints the matching files unique to each directory. Matching subdirectories are also compared recursively. This is useful for keeping working and archive directories consistent. The first argument is a directory or filename pattern. Ex: C:\DIR\*.C, DIR (patrn=*.*), *.EXE (dir=.) The second argument is a directory name. DIFF and DIRDIFF are written in C++. The others are in C.