home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-06-08 | 54.4 KB | 1,248 lines |
-
-
-
-
-
-
-
-
- PGEN
-
- Parser GENerator
-
- (v2.0)
-
-
-
-
- Copyright (C) 1992
-
- by
-
- Keith L. Robertson
-
-
-
-
-
-
-
-
-
- PGEN accepts a description of a context-free grammar and its semantic
- actions. It can output C language code for tables used to parse
- sentences in the described language and to interpret them via the
- semantic actions. The C language 'parse()' program that uses the
- tables is included with PGEN.
-
-
-
- PGEN is NOT public domain or free software. It is distributed as
- Shareware. Okay, folks. You're experienced enough to know what that
- means:
-
- You get to try it for free.
- If you like it and use it, buy it;
- keep the author in business.
- If not, burn it (your copies, not mine).
-
- See the section LEGALESE for licensing terms.
-
-
-
- 1
- PACKING LIST (Shareware):
-
- This is what you should have with the shareware version of PGEN:
- PGEN parser generator:
- PGEN.DOC This file. The documentation for PGEN and the files
- that come bundled with it.
- PGEN.EXE The parser generator program. Execute it with no
- arguments for brief usage instructions.
- ORDER.FRM Print this for a PGEN order form.
- TYPES.H Public domain header file with some convenient
- declarations. Put this in your INCLUDE directory.
- PG_PARSE.H Header file for types, globals, and procedures needed
- by the parser, parse tables, and semantic actions.
- PG_PARSE.C Source code of the parser.
-
- Included with the parser is the source code for EXPR, a simple
- expression evaluator used to demonstrate usage of the PGEN
- parser generator. It uses PG_PARSE.H and PG_PARSE.C.
- MAKEFILE EXPR's makefile. Compilation instructions are for
- 'ztc', the Zortech compiler. Easily modified for
- other compilers.
- EXPR_LEX.H EXPR's lexical analyzer header. Defines LEX_INFO,
- used by the parser.
- EXPR_LEX.C EXPR's lexical analyzer implementation.
- EXPR_PAR.PGN EXPR's grammar file for PGEN. Used to produce parsing
- tables. PGEN -C EXPR_PAR.PGN >EXPR_PAR.C
- EXPR_SEM.H EXPR's semantic actions header. Declares the semantic
- action routines used in EXPR_PAR.PGN and the resulting
- EXPR_PAR.C.
- EXPR_SEM.C EXPR's semantic actions implementation and the
- 'main()' program.
-
- Bundled with PGEN shareware are three useful utility programs.
- All of these print usage instructions when executed with no
- arguments.
- PPRINT.EXE Pretty print text files, with headings and (optional)
- line numbers. Uses Epson-compatible printer codes.
- The source code (received with commercial version
- registration) is easily modified for other printers.
- TAIL.EXE Prints the last lines of the given file or of the
- standard input stream.
- TEE.EXE Sends standard input to the given file and to standard
- output.
-
-
- NOTE: When you received the shareware version of PGEN, you probably
- received it in a self-extracting archive, PGEN_xxx.EXE, where xxx is
- the version number. On the same disk, there may also be these files:
- PGEN_xxx.EX2, PGEN_xxx.EX3, and PGEN_xxx.EX4. These are merely copies
- of PGEN_xxx.EXE. If the .EXE should be unreadable due to a sector
- gone bad, any of the other three may be renamed with a .EXE extension
- and executed to extract the same files. If your .EXE is okay, the
- others are not needed.
-
-
-
- 2
- PACKING LIST (Commercial):
-
- This is what you receive with the commercial version of PGEN:
- All of the above, except with the following change:
- PGEN.EXE Employs a sophisticated table reduction algorithm.
- The EXPR 'action_list' table shrinks from 101 to 73
- elements (28%). A larger application decreased from
- 963 to 577 elements (40%). The nonterminal and
- production tables will also have unused entries
- removed, although the gain is not as great.
-
- You will additionally receive the following utilities:
- CAT.EXE Prints in sequence (concatenates) to standard output
- the given file(s).
- CHOPCRLF.EXE Chops <CR,LF> off the end of a text file.
- CMDLINE.EXE Simply echoes the arguments given with wildcard
- replacement, so you can see what the other programs
- are receiving.
- CP.EXE Copies one file to another or several to a directory.
- MV.EXE Moves one file to another or several to a directory.
- RM.EXE Removes one or more files or directories.
- HSPLIT.EXE Breaks a long text file into several smaller text
- files. Useful when you have a file bigger than your
- editor can handle (like "PGEN -L" output for a large
- grammar).
- VSPLIT.EXE Breaks a wide text file (like "PGEN -A" output for a
- large grammar) into several others for printing.
- NCSL.EXE Reads a C or C++ source file, categorizing each line
- into one of four categories: Blank, Comment, Bracket,
- or Code.
- DIFF.EXE Compares two text files and prints their differences.
- DIRDIFF.EXE Compares two directory structures and prints the files
- unique to each. Useful for comparing working and
- backup directories.
- Each of the above prints usage instructions when executed
- without arguments.
-
- And best of all, commented source code:
- MAKEFILE, 8 header files, and 8 implementation files in C++ for PGEN.
- MAKEFILE and source files (mostly C, some C++) for all the above
- utilities.
-
- Registered users will also receive notice of any updates,
- which will be provided for no more than $5.00, plus non-
- U.S. Mail mailing costs (see REGISTRATION PROCEDURES below).
-
-
-
- 3
- REGISTRATION PROCEDURES:
-
- Simple, just print the file ORDER.FRM and fill in the requested
- information. Currently, IBM DOS (& compatible OS) only.
-
- Make a check payable to "Keith L. Robertson" for the following amount:
- $ 20.00 (U.S.) per copy (TX residents add 8% sales tax)
- $ 1.00 per copy for 3 1/2"-1.44M (high density) disks
-
- Orders are shipped via U.S. Mail at no extra cost. Otherwise, add:
- $ 3.00 for UPS shipment
- $ 5.00 for shipment outside the U.S.A.
-
- Students using PGEN for coursework ONLY may register PGEN for $10.00
- per copy. Gets commercial PGEN.EXE and utilities but not source code.
- Please include photocopy of validated Student I.D. For personal or
- commercial project use, please register for the full $20.00.
-
- Mail the above to:
- Keith L. Robertson
- 11857 Gloger Street
- Houston, TX 77039-6309 (U.S.A.)
-
- Customer support is currently by mail only, to the same address.
- Unregistered users please include SASE if not ordering at the same
- time. PGEN is not a program for the masses; I must keep my costs low
- to offer it and support it at the requested price.
-
-
- LEGALESE (Licensing terms):
- (Sorry, but we have to go thru this.)
-
- In the following, reference to "PGEN" includes the parser generator
- and all source code and other utility programs that come with the
- shareware version. (Exception: TYPES.H is public domain.)
-
- (1) Unregistered users may copy PGEN and use it for two weeks
- solely for the purpose of determining if it suits their needs.
- If PGEN is used to create parsing tables for a programming
- project, whether for commercial or private use, continuing use
- of those tables (and therefore of PG_PARSE.H and PG_PARSE.C)
- within that project constitutes use of PGEN and requires
- registration, even if PGEN itself is never executed again.
-
- (2) A user may register as given above in REGISTRATION PROCEDURES.
-
- (3) A registered user may use PGEN on one computer per copy
- registered at any given time. If other people use PGEN
- routinely, even on the same computer, they must become
- registered users themselves.
-
- (4) A company using PGEN may have one or more employees using the
- same copy of PGEN. In this case, the registered user is the
- company, not the employee. However, if PGEN is available for
- use on more than one computer at a time, one copy must be
- registered for each such computer.
-
- 4
- (5) Anyone -- electronic BBS sysops, commercial data networks,
- shareware distribution houses, users, etc. -- may distribute
- the shareware version of PGEN, as long as there is no charge
- for the program itself, all files in the shareware package are
- distributed complete and unmodified, and no files from the
- commercial package are included. There may be a charge for
- normal connect time, or for disk copying and mailing costs, as
- long as this charge does not exceed $7.00 per disk; PGEN
- shareware must fit on one disk. The distributor or vendor
- must make it clear that it is charging for distribution costs
- and not for the software itself.
-
- (6) PGEN may not be bundled with any other commercial product
- without the express written consent of the author.
-
- (7) The commercial version of PGEN, sent to registered users, and
- all source code and utility programs bundled with it, are for
- the private use of the registrant. They may not be copied or
- distributed in any form.
-
- (8) Any products made using properly registered copies of PGEN are
- entirely the property of their creators. There are no
- royalties nor run-time license fees required by the author of
- PGEN for the creation and distribution of these products.
-
- (9) PGEN may be ported to other operating systems (I expect very
- easily), using the source code with the commercial version,
- only with written permission from the author. The ported
- version will be subject to the same licensing terms under the
- target platform. Once the author's permission has been
- obtained, a shareware version may be distributed for that
- platform, as long as it contains all files and only those
- files listed in the shareware packing list.
-
- Any usage of any part of the PGEN shareware or commercial packages,
- including source code or utility programs constitutes legal acceptance
- of the above licensing terms. A user must be registered to use any of
- the source code or utility programs that come with PGEN beyond the
- evaluation period. If you use ANY of these, you must be registered
- for the entire PGEN package.
-
- THIS SOFTWARE IS SOLD "AS-IS" WITHOUT WARRANTY OF ANY KIND, EITHER
- EXPRESS OR IMPLIED, RESPECTING THE FITNESS OF MERCHANTABILITY OF THIS
- SOFTWARE FOR ANY USE. THE AUTHOR SHALL NOT BE LIABLE TO THE
- PURCHASER, USER, OR ANY OTHER PERSON OR ENTITY WITH RESPECT TO ANY
- LIABILITY, LOSS, OR DAMAGE CAUSED OR ALLEGED TO BE CAUSED DIRECTLY OR
- INDIRECTLY BY THIS SOFTWARE.
-
- The user is especially cautioned on the use of the "RM -R" feature of
- the "RM" remove files utility.
-
-
- Trademark Acknowledgements:
- Epson is a trademark of Epson Corporation.
- IBM is a trademark of International Business Machines Corporation.
- Zortech and Symantec are trademarks of Symantec Corporation.
-
-
-
- 5
- USING PGEN
-
-
- INTRODUCTION:
-
- A parser generator is a rather advanced programming tool which can be
- used in the creation of compilers, interpreters, structured editors,
- text formatters, command line parsers (e.g. for an adventure game),
- etc. This document is therefore written for the advanced programmer
- with knowledge of the principles of parser generation. If you don't
- know what is a token (or terminal), nonterminal, production, or
- semantic action, chances are you don't need a parser generator. For
- those interested in the topic, read
-
- Aho, A.V., R.Sethi, and J.D.Ullman,
- "Compilers, principles, techniques, and tools",
- 1986, Addison-Wesley, Reading, Mass.
-
- known as "the dragon book" in academia for the red dragon on its
- cover, which represents "complexity of compiler design". An LALR
- parser generator, like PGEN, is shown as the knight's sword used to
- slay the dragon.
-
- I created PGEN because I wanted to write a compiler, I didn't have
- access to YACC (Dragon book pp.257-266), and I couldn't find a public
- domain or shareware parser generator in the ShW/PD libraries. Because
- of this, I don't know how PGEN compares with other parser generators
- in execution speed, nor in size and speed of the parser generated (if
- anyone wants to compare PGEN with any other and send me the results, I
- would welcome the feedback). I do know I have used PGEN to create a
- working parser (I haven't finished the whole compiler; been working on
- PGEN!). I find it very useful and others may as well.
-
- Terminology:
- stdin = the standard input stream
- stdout = the standard output stream
- stderr = the standard error stream
-
-
-
- PGEN COMMAND LINE:
-
- Usage: PGEN [-I] [-TNFDPSKLWQAGCX] <grammar file>
-
- PGEN reads the given grammar file and prints its results to stdout,
- which is usually redirected to a file. Options can be given together
- or separately ("PGEN -IC x" == "PGEN -I -C x") in any order, but must
- precede the grammar file name. During its operation, PGEN prints some
- progress information to stderr, which will print to the screen even
- when stdout is redirected (as it should).
-
-
-
- 6
- PGEN GRAMMAR FILE FORMAT:
-
- The grammar file has a format similar to that expected for YACC, but
- there are significant differences. The format is as follows:
-
-
- /* C language code that will appear at the beginning of the
- C code parse table output. Everything up to the first line
- beginning with an open bracket ('[') is emitted unmodified.
- Usually header comments, includes, and defines.
- No line in this section can begin with '['.
- */
- #include "lexer.h" /* Get defns of token identifier names. */
- #define STACK_INC 10
-
- [ # Token definitions begin with a bracket, which must begin the
- # line. Comments begin with '#' and last to the end of the line.
- TOKEN1, # Token definitions are separated by commas.
- # The parser must return 0 for end-of-stream,
- # 1 for the first token (TOKEN1 here), etc.
- IDENTIFIER, # A token name can be an identifier, beginning
- NUMBER, # with a letter, followed by letters, numbers
- # & underscores, up to 128 characters. Case
- # is significant, although all upper case is
- # recommended for tokens.
- "%", # A token can also be a string, enclosed in
- # single or double quotes. Tokens must
- # appear in productions exactly as in the
- # definition including the quote characters.
- T ALT_NAME, # A token may be given an alternate name,
- # which will be used when printing the parse
- # table C code. It must be an identifier.
- ',' COMMA, # An alternate name is usually used with
- '(' LP, ')' RP, # string tokens, since the string name is not
- # valid for the C output tables.
- '=' 1R, # A token may be given a precedence number,
- # which can resolve some ambiguities (Dragon
- # p.261). A 'L' or 'R' giving associativity
- # can immediately follow the precedence.
- '+' 2, # Left associative is the default.
- '-' 2, # All operators with the same precedence
- # should be given the same associativity.
- '*' TIMES 3L, # When all three are present, the correct
- # order is "primary name", "alternate name",
- # "precedence/assoc".
- ] # Token defs end with a close bracket. Productions follow.
-
- # The first nonterminal defined is always the start symbol.
- # Each nonterminal must be defined exactly once but may have
- # multiple productions in the definition.
- Expr_list # Recommended style is initial capital.
- # Each production is introduced with a '|' or ':' and is defined
- # by a sequence of zero or more symbols (tokens and nonterminals)
- # on the right-hand-side. I recommend introducing all
- # productions with '|'.
- 7
- # Nonterminals can be used before they are defined.
- | Rep1_expr_list
- | # Empty productions are sometimes necessary.
- | error # Error prod's as in YACC (see ERROR HANDLING)
- ; # List of productions is ended with ';'.
- Rep1_expr_list
- # Productions may be directly or indirectly recursive, although
- # left recursion, as shown here, is preferable.
- | Rep1_expr_list ',' Expr
- | Expr
- ;
- Expr
- # Productions can have a priority also. The first token with a
- # priority gives its priority to the production. Alternately,
- # the priority can be given explicitly. Although this grammar is
- # ambiguous, the priority and associativity designations resolve
- # the ambiguity.
- | Expr '*' Expr # Implicit priority 3
- | 2 Expr Add_op Expr # Explicit priority 2
- # A production may end with a semantic action name introduced by
- # an at sign ('@'). This identifier is the name of the C
- # procedure to be called when the production is reduced.
- # It must be a simple identifier, not C code.
- | IDENTIFIER '=' Expr @ do_assign # Recommend lower case
- | '(' Expr ')' @ do_paren
- | Add_op Expr @ do_unary
- | IDENTIFIER [Fn_args] # Any single symbol may be enclosed in
- # brackets to make it optional.
- ;
- Fn_args
- | '(' Expr_list_2 ')'
- ;
- Expr_list_2
- | { Expr ',' ... } # A production can be defined by two
- # symbols in braces to specify a list.
- ;
- Add_op | '+' | '-' ;
- #--------------------------- End of File ---------------------------#
-
- The grammar above is not terribly useful as-is. See EXPR_PAR.PGN for
- a working example.
-
- For readability, I recommend that a nonterminal being defined sits on
- a line by itself, followed by its productions indented beneath, each
- production beginning a new line, and the terminating semicolon
- indented on a line by itself. (I don't entirely follow this
- convention in this document's examples for reasons of length.)
-
- For users of versions prior to 2.0, note that "{ sem }" has been
- changed to "@ sem" for semantic actions. If it were needed, I would
- have included a converter, but any decent text editor will have a
- search and replace operation that will allow you to selectively
- replace '{' with '@' and remove '}'. Braces were originally used to
- enclose the semantic action name for similarity to YACC, but they were
- made to enclose the repetition construct instead for similarity to
- EBNF (Extended Backus-Naur Formalism).
-
-
-
- 8
- PGEN COMMAND LINE OPTIONS:
-
- Once again, the usage line is
- Usage: PGEN [-I] [-TNFDPSKLWQAGCX] <grammar file>
- PGEN has many options, although all but one are print options.
-
- -I Inline productions. If this option is given, PGEN will
- substitute productions where possible. See PRODUCTION
- INLINING below.
-
- The remaining options tell PGEN what to print to stdout.
-
- -T Print tokens. The format is "Sequence number", "Priority",
- "Associativity", "Primary name". The alternate name is not
- printed. '#E_O_S#' (end of stream) is always the first token,
- and 'error' is always the last.
- -N Print nonterminals. The format is "Sequence number", "Number
- of the first production which defines the nonterminal",
- "Number of productions defining this nonterminal", "Number of
- uses on the right-hand-side of any production", "Stack size
- needed to parse this nonterminal", "Recursion mark", "Name".
- The first line gives a stack size and recursion mark for the
- entire grammar. These are explained later. A dummy start
- symbol '$_Start_$' is always the first nonterminal.
- -F Print the "first symbols in productions" for each nonterminal.
- -D Print the "first symbols in derivations" for each nonterminal.
- -F and -D were primarily useful for debugging PGEN itself, so
- the meaning of these is documented in the source file
- PGEN_SYM.CPP.
- -P Print productions. "Sequence number", "Number of the first
- SLR item for this production", "Production priority",
- "Production". The grammar is always augmented with the
- production
- $_Start_$ --> <the first defined nonterminal>.
- -S Print the SLR items. "Sequence number", "SLR item". The dot
- location is shown as '<>'.
- -K Print the item set kernels.
- -L Print the item set LALR closures.
- -K and -L are very useful in debugging grammars, especially in
- finding and resolving parsing conflicts.
- -W Modifies the printing of -K and -L to print the item sets
- (W)ith its goto list. This list shows what state the parser
- will go to for each symbol (both tokens and nonterminals).
- Symbols not listed are errors if seen.
- -Q Print parsing action conflicts to stdout instead of stderr.
- -A Print a table showing the parsing actions for each state and
- token.
- -G Print the goto table for each state and nonterminal.
- -A and -G combined give the type of information shown in the
- Dragon book Figs 4.31 (p.219), 4.40 (p.236), and 4.47 (p.250).
- -C Print the C language code for the generated parsing tables.
- (Our goal.)
- -X Print ALL of the above except -W and -Q (which you may specify
- separately). This will get REALLY BIG for a grammar of any
- size, and was mostly useful in debugging PGEN.
-
-
-
- 9
- PRODUCTION INLINING:
-
- Consider the following:
-
- Lhn | Rhn ; # This production has one nonterminal and no
- # semantic action. Lhn may have other prod's.
- Rhn | ... ; # Rhn is defined by one production, which
- # may or may not have a semantic action.
-
- PGEN will substitute for Rhn, resulting in:
-
- Lhn | ... ; # Rhn's production and semantic action.
-
- This will increase the parser's speed slightly, since it won't have to
- reduce by "Lhn --> Rhn". The rationale for this option is so we can
- write productions like
-
- Statement
- | For_statement
- | While_statement
- | If_statement
- # etc.
- ;
- For_statement
- | FOR IDENT ':=' Expr To_ Expr DO Statement @ for_st ;
- # etc.
-
- without any unnecessary overhead, rather than having to write
-
- Statement
- | FOR IDENT ':=' Expr To_ Expr DO Statement @ for_st ;
- # etc.
-
- which will be much less readable. This optimization often results in
- inlining and optimizing many productions that weren't expected.
-
-
-
- STACK SIZE AND RECURSIVE MARK:
-
- Two of the things printed on the "PGEN -N" option are the stack size
- and recursive mark for each nonterminal and for the entire grammar.
- The stack size needed to parse the grammar is simply that for the
- augmented start symbol '$_Start_$' plus one. The recursion mark for
- the grammar is the sum of that for all nonterminals.
-
- The stack size for a nonterminal is the worst-case size of the parsing
- stack needed to parse a sequence of tokens, starting at that
- nonterminal, barring any recursion other than left recursion. Left
- recursion, as with Rep1_expr_list above, causes no problems, and is
- not considered recursion for purposes of the mark or this discussion.
- Other recursion, as with both of these productions,
-
- Expr | '(' Expr ')'
- | IDENTIFIER '=' Expr ; # Right associative
-
- 10
- would require an infinite stack size to be guaranteed to parse any
- sequence of tokens that match the grammar's definition of an Expr.
- For example, the following is a valid Expr:
-
- ((((((((((((((((((((((((((( zebra )))))))))))))))))))))))))))
-
- Indeed, how many parentheses should we allow? A million? Since an
- infinite stack is impossible in any real computer, PGEN computes the
- stack size as if there were no such recursion, and informs the
- programmer in its -N and -C output if the grammar is recursive.
-
- The recursive mark for a nonterminal increases each time (non-left)
- recursion through the nonterminal is detected. Its main relevance,
- however, is whether or not it is zero, not the actual number. If the
- mark is nonzero, the nonterminal or grammar is recursive.
-
- The programmer may deal with a recursive grammar by deciding how much
- nesting to allow and setting an amount to increase the stack size in
- the "PGEN -C" output. This is done by defining the macro 'STACK_INC'
- in the C code part of the grammar file. For example,
-
- #define STACK_INC 10
-
- will allow at least 10 nested '(' or at least 5 nested "IDENT '='"
- (or six '(' and two "IDENT '='", etc.) without stack overflow, even
- in the worst case.
-
- The parser may still be asked to parse a string of tokens which is
- valid for the language, but which its stack cannot handle. It will
- not by default check if the parse stack overflows, and in this case
- the program will probably crash. Fortunately, with just a little
- leeway, stack overflow is rare in practice unless you are really
- trying. To get the parser to check for overflow, set STACK_SIZE in
- PG_PARSE.C to the actual stack size. The actual stack size is set by
- the "#define STACK_SIZE" in the "PGEN -C" output.
-
-
-
- FOLLOWING ALONG WITH PGEN:
-
- While executing, PGEN prints progress messages. As they are read, it
- prints the number of tokens, nonterminals, and productions taken from
- the input file. PGEN then checks that all nonterminals are properly
- defined and used. The number of SLR items is computed and printed.
-
- As the item sets are constructed, PGEN prints a running tab of two
- numbers. The first number is the item set PGEN is currently
- examining; the second is the total number of item sets created so far.
- Initially, there is only one item set, number 0. As each set is
- examined, zero, one, or many new item sets may be added. PGEN is done
- when the first number becomes equal to the second.
-
- Then, lookahead symbols are propagated from all item sets. PGEN
- prints the item set currently being examined. When it finishes the
- last item set, it may need to backtrack to an earlier set several
- times.
-
- 11
- Next, PGEN constructs a table of actions for each <state, token> pair.
- If any parsing conflicts are found (see PARSING ACTION CONFLICTS),
- they are printed during this phase.
-
- Finally, PGEN converts the tables to action lists which reduces the
- storage requirement (Dragon book p.244-247). In the commercial
- version, PGEN merges similar action lists as they are added, further
- reducing the needed storage. The lists then need to be sorted into
- the order they will be printed in the final C code output.
-
-
-
- CREATING A LEXICAL ANALYZER:
-
- The parser requires an associated lexical analyzer (lexer).
- Fortunately, it is not very difficult to create an efficient, robust
- lexer by hand. EXPR_LEX.C is easily modified to recognize a different
- set of tokens and to accept its input from other than stdin.
-
- The header file for the lexer must declare all token identifier names
- and any types, global data, or routines needed by the parser and
- semantic actions. These are the items it must declare:
-
- 1) The type 'LEX_INFO'
- The type of objects on the parse stack. Defines attribute information
- relevant to token lexemes and nonterminal symbols.
-
- 2) The extern variable 'lexeme' of type 'LEX_INFO'
- Location where the lexer function 'get_token()' returns information
- related to the last token it read from the input stream. READ ONLY!
-
- 3) The extern variable 'last_token' of type 'USHORT'
- Token number returned by the last call to 'get_token()'. READ ONLY!
-
- 4) Function 'USHORT get_token()'
- Reads the input stream, returns the number of the token it saw
- directly and in 'last_token', and places related information in
- 'lexeme'.
-
- 5) Procedure 'VOID log_error (USHORT etype, CSTRING desc)'
- Prints an error message based on the given error type and description.
- See EXPR_LEX.H for a description.
-
- 6) Operations 'set_jumped' and 'was_jumped' on type 'LEX_INFO'
- #define set_jumped(lex,val)
- #define was_jumped(lex)
- The macro 'set_jumped()' marks 'lex' as jumped if 'val' is true and as
- not jumped if 'val' is false; 'was_jumped()' returns true if and only
- if 'lex' is marked as jumped.
-
- The type 'USHORT' above is defined in TYPES.H as "unsigned short".
- You may alternately use any integer type.
-
- Finally, if the parser is to stop parsing on something in the input
- stream other than the end-of-file, the lexer merely needs to return
- the end-of-stream token (token 0) for that thing.
-
-
-
- 12
- SEMANTIC ACTIONS:
-
- When the parser reduces by a production, its corresponding semantic
- action (after '@'), if any, is executed. Semantic actions must be
- written as individual C procedures which take no arguments and return
- no value.
-
- When a semantic action is called, a global pointer 'attr' points to an
- array of objects of type LEX_INFO representing the symbols of the
- associated production, beginning with 0. The semantic action returns
- attribute information in attr[0]. For example, in the EXPR example we
- have a production
-
- Expression | IDENTIFIER '=' Expression @ expr_assign ;
-
- In the code of 'expr_assign()', the identifier is represented by
- attr[0]. The lexer returned the string for token IDENTIFIER in its
- 'name' field. The '=' is attr[1]. The lexer did not return the token
- string, as all relevant information was carried by the token type
- EQUAL. In fact, that the parser saw '=' when it did is contained in
- the parse state, so 'expr_assign()' doesn't even consult attr[1].
- Finally, the attributes for the received Expression is in attr[2], of
- which only the field 't.fvalue' is relevant.
-
- A semantic action returns its attribute information in attr[0]. Note
- that the first production symbol and the return symbol use the same
- attribute object, attr[0]. Make sure the writes for the return
- information don't affect the attribute reads from the first symbol
- information. This is easy to ensure in practice. Also, if the return
- attributes are identical to the attr[0] input and no side effects are
- required, no semantic action is needed. This is the case for Binary1,
- Binary2, and Unary in EXPR_PAR.PGN.
-
- In EXPR, 'expr_assign()' is mostly useful for its side effect, saving
- the IDENTIFIER name and its assigned value in a symbol table. Since
- an assignment is also an Expression, the value is copied to
- attr[0].t.fvalue.
-
- It is an error for a semantic action to access or modify attr[n] where
- (n < 0) or ("the number of production symbols" <= n).
-
-
-
- OPTIONAL SYMBOLS:
-
- Any single symbol on the right-hand-side of a production can be made
- optional simply by enclosing it in brackets. Prior to version 2.0,
- you would usually need to clutter your grammar with the following:
-
- Nonterm | ... Opt_symbol ... @ p1_action ;
- Opt_symbol | Symbol
- | ;
-
- This also adds inefficiency. Time: When Symbol is parsed, we must
- needlessly reduce it to an Opt_symbol. Space: The above introduces
- one nonterminal and two productions which must appear in the final
- 13
- parse table output. The only time that you wouldn't need this is if
- Symbol is a nonterminal that is optional in ALL its uses. Then it
- just needs an empty production:
-
- Symbol | ... all its other productions
- | ;
-
- Now, it doesn't matter if Symbol is a token or nonterminal or if it is
- not optional in other uses. You may write
-
- Nonterm | ... [Symbol] ... @ p1_action ;
-
- When appropriate, Symbol will be jumped (skipped) with no semantic
- action taken.
-
- This introduces a potential problem. How does a semantic action like
- 'p1_action()' determine if one of its optional symbols appeared in the
- input stream or was jumped? It tests "was_jumped (attr [sym_num])",
- where 'sym_num' corresponds to the Symbol. As previously noted (in
- CREATING A LEXICAL ANALYZER), the lexer header is required to provide
- this macro.
-
- The lexer must ensure "was_jumped (lexeme)" is false, because any
- token it returns was seen, not jumped. The parser sometimes needs to
- mark stack symbols (pointed to by attr) as jumped or not, so the macro
- 'set_jumped()' must also appear in the lexer header.
-
-
-
- LISTS,LISTS,LISTS:
-
- A construct that may repeat entirely was always easy to write.
-
- List | List <all repeating symbols>
- | # for 0-or-more, *OR*
- | <all repeating symbols> # for 1-or-more.
- ;
- # e.g. (empty) C; C;C; C;C;C; C;C;C;C; (etc.)
- # Here, <all repeating symbols> derives "C;"
-
- Optional symbols make the 1-or-more list even easier
-
- List | [List] <all repeating symbols> ;
-
- However, a list of items and separators is more verbose to express
- without optional symbols, especially in the 0-or-more case:
-
- List_0 | List_1 # 0-or-more
- | ;
- List_1 | List_1 Separator Item # 1-or-more
- | Item ;
- # e.g. (empty) A A,A A,A,A A,A,A,A (etc.)
- # Here, Item derives "A" and Separator derives ","
-
- 14
- Optional symbols remove the need for nonterminal List_0 and its
- productions, but you must make your list optional in all its uses
- EXCEPT in the 1-or-more list.
-
- List_0_use | ... [List_0] ... ;
- List_0 | List_0 Separator Item # Really a 1-or-more
- | Item ;
-
- This is error prone; you might forget to make a usage optional.
- Computers are supposed to do the rote work for us, right?
-
- PGEN version 2.0 adds the following constructs:
-
- List_0 | { Item Separator ... } ; # 0-or-more list
- List_1 | { Item Separator ... }+ ; # 1-or-more list
-
- There must be exactly the two symbols, the item and the separator, and
- no others in the production. The ellipse ('...') is optional but
- highly recommended. For example:
-
- A_list | { A ',' ... } ; # Add '+' to exclude (empty)
-
- PGEN converts both constructs to a 1-or-more list, turning the one
- production into two. For a 0-or-more list, it automatically scans all
- past and future productions, and makes optional all uses of the
- defined nonterminal. The "PGEN -P" output shows the productions after
- these translations.
-
- What about semantic actions? This production
-
- List | { Item Separator ... } @ bld_item_list ;
-
- and the similar 1-or-more list both become
-
- List | List Separator Item @ bld_item_list
- | Item @ bld_item_list1 ;
-
- Note that two semantic actions are needed by a list. The first to
- execute will have the semantic action name with the character '1'
- appended ('bld_item_list1'). It must convert a single item to a list
- with one item. Successive iterations will call the unmodified
- semantic action name ('bld_item_list'), which adds a single item to
- the list.
-
- WARNING!! There are two potential pitfalls the PGEN parser programmer
- must particularly pay heed to! (Oooh, love the alliteration.)
-
- 1) In the unchanged semantic action name (e.g. 'bld_item_list'),
- the List nonterminal is at attr[0], the Item at attr[2], and
- of course, the Separator at attr[1]. This is not obvious from
- the original list construct where the Item appears before the
- Separator, and the List doesn't appear at all.
-
- 2) In the usage of a 0-or-more list, an empty list results in a
- jumped symbol, NOT in an empty list. Be sure to check for
- this in semantic actions that use a 0-or-more list.
-
- 15
- Your test for the day:
-
- [ ALPHA, BETA, SEP, END ] # Tokens
- Start | List END @ start ;
- List | { Item SEP ... } @ list ;
- Item | ALPHA
- | BETA ;
- # Item derives ALPHA or BETA. The Separator is SEP.
-
- Implement semantic actions for the above grammar so that the number of
- ALPHA tokens found in the input stream is printed. Assume structure
- LEX_INFO has an integer field 'token' filled by the lexer with the
- appropriate token value and an integer field 'count' available for
- your use. Also, 'ALPHA', 'BETA', etc. are macro defined constants
- with the correct token values. See THAT'S ALL, FOLKS! for an answer.
-
-
-
- ERROR HANDLING:
-
- Error handling is done similarly as with YACC (Dragon book p.264-266),
- by deciding on a set of "major" nonterminals to have error recovery
- and giving those nonterminals an error production of the form
-
- Nonterminal | error @ nonterminal_error ;
-
- The error token must be the ONLY symbol in the production, and it must
- appear in lower case.
-
- When an unexpected token is detected in the input stream, 0 or more
- states are popped off the stack until the topmost state defines an
- action on the 'error' token (the state contains an item of the form
- "Nonterminal --> <> error"). The error token is shifted onto the
- stack, and the error production's semantic action is called when the
- error production is reduced. If the stack becomes empty and no such
- state is found, the parser aborts with a fatal error.
-
- The error semantic action routine is responsible for printing any
- needed error message and for skipping tokens until the parser can
- resume. Procedure 'log_error()' in EXPR_LEX.C provides the former
- facility for EXPR. PG_PARSE.C provides routines 'skip_to_token()' and
- 'skip_to_bittok()' for skipping to a particular token, or to one of a
- set of tokens given as a bitmap. (See the file EXPR_SEM.C for example
- usages of these.) The matching token is not consumed from the input
- stream by these procedures.
-
-
-
- THE JUMP ACTION:
-
- You already should be familiar with the quaternion of actions
- performed by bottom-up parsers -- shift, reduce, accept, and error.
- PGEN includes a fifth action, jump, which was added to accommodate
- optional symbols. Like the shift action, the jump action specifies
- which state to "shift" onto the stack. Unlike 'shift', it does not
- consume a token from the input stream. The LEX_INFO structure pushed
- 16
- represents the symbol being jumped. Its contents are undefined,
- except that 'was_jumped()' will return a true value when given that
- structure as its argument. The jump action, with all its consequences
- and solutions, is entirely this author's original invention.
-
- The parser also uses the jump action when it would otherwise reduce by
- an empty production with no semantic action. In fact, such a
- production is eliminated, and all uses of the nonterminal on its
- right-hand-side are made optional. This will appear in the "PGEN -P"
- output. The contents of the LEX_INFO structure representing the
- result of such a production would otherwise be completely undefined,
- but now you can test it with 'was_jumped()'.
-
- I could have implemented optional symbols by converting the grammar to
- an equivalent form without optional symbols, but that would mean
- adding extra nonterminals and productions just for this purpose, which
- would have increased the size of the resulting tables. The jump
- action is also slightly faster.
-
- This is how PGEN determines when it should use the jump action.
- Assume state 's' contains an LR(1) item of the following form:
-
- Nonterm --> ... <> [Opt_sym] <follow_string>, [{follow_tokens}]
-
- If token 't' is a member of FIRST(<follow_string>) or if
- <follow_string> derives the empty string and 't' is a member of
- {follow_tokens}, then action [s,t] == "jump goto (s, [Opt_sym])".
-
- Whoa! You know that "goto (s, Symbol)" is the state the parser goes
- to (pushes on the stack) when it is in state 's' and it sees 'Symbol'.
- (I consider the item set, "I sub s" to be equivalent to state 's'.)
- But what is "goto (s, [Symbol])"? It is the state the parser goes to
- when it jumps 'Symbol'. This may or may not be the same as
- "goto (s, Symbol)". Consider a state with the following items:
-
- SET a: Xn --> ... <> Symbol ...
- Yn --> ... <> [Symbol] ...
-
- The parser can either see a Symbol or jump it. If seen, both items
- should be in the resulting state.
-
- SET b: Xn --> ... Symbol <> ...
- Yn --> ... [Symbol] <> ...
-
- If jumped, however, only the items where 'Symbol' is optional should
- appear in the resulting state.
-
- SET c: Yn --> ... [Symbol] <> ...
-
- Here, "goto (a, Symbol) = b" and "goto (a, [Symbol]) = c". In the
- "PGEN -KW" output under set 'a', both will appear:
-
- SET a: ...
- Symbol b [Symbol] c
-
- While the jump action reduces size and time, it invites three new
- potential conflicts. See PARSING ACTION CONFLICTS.
-
-
- 17
- PARSING ACTION CONFLICTS:
-
- While generating the parsing actions for each <state, token> pair,
- PGEN may encounter parsing action conflicts (Dragon p.228-230), which
- occur when a pair is defined by two or more actions. These conflicts
- are printed to stderr as they are found. When they occur, the
- programmer should consult the "PGEN -K" or "PGEN -L" output for the
- same grammar to determine if the conflicts need to be resolved.
- Parsing conflicts don't always need to be eliminated; the default
- choice made by PGEN usually produces the intended result, and
- eliminating the conflict may add complexity to the grammar.
-
- A reduce/reduce conflict arises when the same state has two or more
- items that call for a reduction on the same token. For example:
-
- Sign --> '+' <>, [INTEGER REAL]
- Unary --> '+' <>, [INTEGER REAL]
-
- PGEN will reduce by the production that appeared first in the grammar
- file. In this case, "Sign --> '+'" is the production we wanted to
- reduce by, so we just make the definition of Sign appear before Unary
- in the grammar file.
-
- A shift/reduce conflict occurs when the same state has items that call
- for a shift action and an item that calls for a reduce on the same
- token. Example:
-
- Expr --> Expr '*' Expr <>, ['+' '-' '*' '/' '^' ')']
- Expr --> Expr <> '*' Expr, ['+' '-' '*' '/' '^' ')']
-
- The first item calls for reduce by production "Expr --> Expr '*' Expr"
- on token '*' (among others), while the second calls for a shift on
- '*'. This kind of conflict is best resolved by giving each of the
- symbol tokens the appropriate priority and associativity.
-
- Not all shift/reduce conflicts are appropriately resolved with
- associativity. Consider a state containing the following two LR(1)
- items (among others):
-
- Statement --> IF Expr THEN Statement <>, [ELSE ...]
- Statement --> IF Expr THEN Statement <> ELSE Statement, [...]
-
- The first item indicates that an IF Statement can potentially be
- reduced when ELSE is seen. The second item requires a shift. In this
- case and many like it, the shift is needed (we match each ELSE with
- the closest IF). If not, the grammar needs to be rearranged. PGEN
- always chooses shift when there is a shift/reduce conflict.
-
- 18
- The shift/jump conflict arises in states containing items similar to
- the following,
-
- # Opt_sym can represent either a token or a nonterminal
- Nt1 --> ... <> [Opt_sym] ..., [token_set]
- Nt2 --> ... <> TOKEN ...
-
- where TOKEN can follow the Opt_sym in the first production. (If TOKEN
- is a member of FIRST(Opt_sym), this conflict will invariably result.)
- PGEN must choose to make the parser shift the TOKEN or jump Opt_sym.
- It always chooses the shift action.
-
- The reduce/jump conflict results from the similar situation where a
- state contains items like the following:
-
- Nt1 --> ... <> [Opt_sym] ..., [token_set]
- Nt2 --> ... <>, [ ... TOKEN ... ]
-
- Again, TOKEN can follow Opt_sym in the first production. In this
- case, the second production specifies a reduce action on TOKEN. PGEN
- chooses the reduce. The parser can still jump Opt_sym after the
- reduce action if appropriate.
-
- Finally, a jump/jump conflict occurs when two or more different
- symbols could be jumped.
-
- Nt1 --> ... <> [Opt_sym1] ..., [token_set]
- Nt2 --> ... <> [Opt_sym2] ..., [token_set]
-
- Here, TOKEN can follow either Opt_sym1 or Opt_sym2, producing jump
- actions to two different states on the same token. PGEN chooses to
- jump the optional symbol in the first such item in the item set,
- Opt_sym1 in this case.
-
-
-
- PUTTING IT ALL TOGETHER:
-
- As in the EXPR example, your parser will need a lexical analyzer, a
- PGEN file to create the parsing table, the 'parse()' driver routine in
- PG_PARSE.C, the associated semantic actions, and a 'main()' startup
- procedure. How these are assembled into a working unit is not as
- easily explained in prose as it is to learn by example. Reading the
- EXPR*.* and PG_PARSE.? program text files is the best source for
- learning PGEN usage. That's why I created and included the EXPR
- example.
-
- A few things to notice:
-
- (1) The token definitions in the lexer header file parallel the
- definitions in the grammar file. The lexer header includes an
- end-of-stream token as token 0 and uses only identifiers,
- which are the alternate names for the grammar file tokens that
- are symbols.
-
- 19
- (2) The grammar file needs #includes at the top to include the
- token definitions from the lexer header and the semantic
- action names from the semantic action header.
-
- (3) The lexer and semantic actions are written to work in concert
- where each makes assumptions about the other. In EXPR, the
- lexer returns the lexeme string only for certain tokens. The
- semantic actions are responsible for utilizing or freeing the
- memory used. This is normal and is required for efficiency.
-
- (4) To the contrary, the parser driver makes no assumptions about
- the format or content of the LEX_INFO structure. It does
- expect 'set_jumped()' and 'was_jumped()' to be available as
- previously mentioned.
-
- (5) PG_PARSE.C needs to be modified slightly to include your lexer
- header. If your application needs it, as when parsing a
- programming language, either your lexer must skip comments or
- you will need to modify the parser to do so. For extremely
- large grammars PG_PARSE.H may need modification. See the file
- \PSOURCE\OVERVIEW (with registered copy) for information.
-
-
-
- THAT'S ALL, FOLKS!
-
- Here is an answer to your test in LISTS,LISTS,LISTS.
-
- void list1 ()
- /* If Item (pos 0) is ALPHA, init List with count of 1. */
- /* 0 otherwise. */
- { attr [0].count = (attr [0].token == ALPHA);
- }
-
- void list ()
- /* If Item (pos 2) is ALPHA, increment List (pos 0) count. */
- { if (attr [2].token == ALPHA) { attr [0].count += 1; }
- }
-
- void start ()
- /* Check for empty (jumped) list, and print the count. */
- { if (was_jumped (attr [0])) { attr [0].count = 0; }
- printf ("%d\n", attr [0].count);
- }
-
-
- That's all I can think of to say about PGEN. I appreciate all
- feedback -- likes, dislikes, suggested improvements, etc. I would
- especially like to hear from people who have used it. What extra
- information would you have liked to see in the documentation file and
- in the source code comments? Also, I'm always interested in others'
- applications for PGEN. If you care to tell me, please write and let
- me know what you are programming.
-
- I'll conclude with the usage instructions for all the included
- programmer's utilities:
-
-
- 20
- Usage: PPRINT [-SPBN] [-L<lines>] [-C<cols>] [-T<tabs>] <filepatrn> ...
- Pretty prints text files with headings and optional line numbers.
- Options:
- S Send output to standard out. (default: standard printer)
- P Print in current type size. (def: modify for output width)
- B Print in boldface type. (def: use current printer setting)
- N Do not print line numbers. (def: print line numbers)
- L Set lines per page. (def: 66)
- C Set columns per line (incl. line numbers). (def: stdprn-96, stdout-79)
- T Set tab stop width. (def: 8)
-
-
- Usage: command | TAIL [-<nlines>]
- or: TAIL [-<nlines>] <input file>
- Prints the last <nlines> lines (default is 10 if not given) from
- standard output from the piped command (form 1 above) or from
- the given input file (form 2 above).
- Use instead [+<nlines>] to print all but the first <nlines> lines.
- Ex: DIR | TAIL -5 Print the last 5 lines of the DIR listing.
- TAIL +20 TEST Print all but the first 20 lines of file TEST.
-
-
- Usage: command | TEE <file>
- Sends text output by the command to the file and to standard output.
- Does not work with interactive commands under DOS.
-
-
- Usage: CAT <file pattern> ...
- Prints in sequence (conCATenates) all requested files.
-
-
- Usage: CHOPCRLF <text file> ...
- Removes the terminating carriage return/line feed
- from the end of text files.
-
-
- Usage: CHOPLL <text file> ...
- Removes all blank lines from the end of text files.
-
-
- Usage: CMDLINE <file pattern> ...
- Prints command line arguments with pattern substitution.
-
-
- Usage: CP [-VQ] <source file> <dest name>
- or: CP [-VQ] <file pattern> ... <dest directory>
- Copies source file to dest file, or a set of files to a directory.
- Options:
- V Verbose. Report what is being copied.
- Q Query user before copying.
-
-
- Usage: MV [-VQ] <source file> <dest name>
- or: MV [-VQ] <file pattern> ... <dest directory>
- Moves source file to dest file, or a set of files to a directory.
- Options:
- V Verbose. Report what is being moved.
- Q Query user before moving.
-
- 21
- Usage: RM [-RVQ] <file pattern> ...
- Removes all files matching the given file patterns.
- Options:
- R Remove all directory contents recursively.
- V Verbose. Report what is being deleted.
- Q Query user before deleting.
-
-
- Usage: HSPLIT <num> <infile.ext>
- Breaks a long text file into <num> shorter text files where each output
- file is a horizontal piece of the input file, approx equal in size.
- Output is to files: infile.H00 .. infile.H<num-1>
- <num> must be in the range 2 - 100.
-
-
- Usage: VSPLIT [-T<tab>] [-L<len>] <infile.ext>
- Breaks a wide text file into several thinner text files where each
- output file is a vertical strip of the input file.
- Output is to files: infile.V00 .. infile.V<num-1>
- where <num> is the largest input line length divided by <len>.
- <tab> is tab size (2-32, default 8).
- <len> is max line length for output files (16-256, default 80).
- <len> must be a multiple of <tab>
-
-
- Usage: NCSL <C/C++ source file pattern> ...
- Reads the requested C/C++ source files and categorizes each line as
- Blank, Comment, 1 Brace, or Code. The name comes from the industry
- metric for source code size, "Non-Commentary Source Lines".
- Commentary:
- Blank -- Lines with only whitespace; no comments or text
- Comment -- Lines containing only // or /* */ comments and whitespace.
- Includes blank lines within /* */ comments.
- Non-commentary:
- 1 Brace -- Lines with only a single { or } brace, whitespace and comments.
- Code -- All other lines (should be program code).
-
-
- Usage: DIFF [-B] <old text> <new text>
- Compares the contents of two text files and prints their differences.
- If -B is specified, a tab character is prepended to the output lines.
-
-
- Usage: DIRDIFF dir1[\patrn] dir2
- Compares the two directories for files matching the pattern
- and prints the matching files unique to each directory.
- Matching subdirectories are also compared recursively.
- This is useful for keeping working and archive directories consistent.
- The first argument is a directory or filename pattern.
- Ex: C:\DIR\*.C, DIR (patrn=*.*), *.EXE (dir=.)
- The second argument is a directory name.
-
-
- DIFF and DIRDIFF are written in C++. The others are in C.
-
-