home *** CD-ROM | disk | FTP | other *** search
Text File | 1990-04-11 | 71.3 KB | 2,260 lines |
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- DECUS C LANGUAGE SYSTEM
-
-
- LEX
- A lexical Analyser Generator
-
-
- by
-
- Charles H. Forsyth
-
- University of Waterloo
- Waterloo, Ontario, N2L 3G1
- Canada
-
-
- Revised by
-
- Robert B. Denny & Martin Minow
-
-
- LEX transforms a regular-expression grammar and associated
- action routines into a C function and set of tables, yielding a
- table-driven lexical analyser which manages to be compact and
- rapid.
-
-
- DECUS Structured Languages SIG
-
- Version of 30-Oct-82
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Copyright (C) 1978, Charles H. Forsyth
-
-
-
-
-
- Modifications Copyright (C) 1980, 1982, DECUS
-
- General permission to copy or modify, but not for
- profit, is hereby granted, provided that the above
- copyright notice is included and reference made to the
- fact that reproduction privileges were granted by DECUS.
-
- The information in this document is subject to change
- without notice and should not be construed as a
- commitment by Digital Equipment Corporation or by DECUS.
-
- Neither Digital Equipment Corporation, DECUS, nor the
- authors assume any responsibility for the use or
- reliability of this document or the described software.
-
- This software is made available without any support
- whatsoever. The person responsible for an
- implementation of this system should expect to have to
- understand and modify the source code if any problems
- are encountered in implementing or maintaining the
- compiler or its run-time library. The DECUS 'Structured
- Languages Special Interest Group' is the primary focus
- for communication among users of this software.
-
-
-
-
-
-
-
-
-
- UNIX is a trademark of Bell Telephone Laboratories. RSX,
- RSTS/E, RT-11 and VMS are trademarks of Digital Equipment
- Corporation.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 3
-
-
- 1.0 Introduction
- ____________
-
-
- A computer program often has an input stream which is composed
- of small elements, such as a stream of characters, and which it
- would like to convert to larger elements in order to process the
- data conveniently. A compiler is a common example of such a
- program: it reads a stream of characters forming a program, and
- would like to turn this sequence of characters into a sequence
- of larger items, namely identifiers, numbers, and operators, for
- parsing. In a compiler, the procedures which do this are
- collectively called the lexical analyser, or scanner; this
- _______ ________ _______
- terminology will be used more generally here.
-
- It may happen that the speed with which this transformation is
- done will noticeably affect the speed at which the rest of the
- program operates. It is certainly true that although such code
- is rarely difficult to write, writing and debugging it is both
- tedious, and time-consuming, and one typically would rather
- spend the time on the hard parts of the program. It is also
- true that while certain transformations are easily thought of,
- they are often hard to express succinctly in the usual
- general-purpose programming languages (eg, the description of a
- floating-point number).
-
- LEX is a program which tries to give a programmer a good deal of
- help in this, by writing large parts of the lexical analyser
- automatically, based on a description supplied by the programmer
- of the items to be recognised (which are known as tokens), as
- ______
- patterns of the more basic elements of the input stream. The
- LEX description is very much a special-purpose language for
- writing lexical analysers, and LEX is then simply a translator
- for this language. The LEX language is easy to write, and the
- resulting processor is compact and fast running.
-
- The purpose of a LEX program is to read an input stream, and
- recognise tokens. As the lexical analyser will usually exist as
- ______
- a subroutine in a larger set of programs, it will return a
- "token number", which indicates which token was found, and
- possibly a "token value", which provides more detailed
- information about the token (eg, a copy of the token itself, or
- an index into a symbol table). This need not be the only
- possibility; a LEX program is often a good description of the
- structure of the whole computation, and in such a case, the
- lexical analyser might choose to call other routines to perform
- the necessary actions whenever a particular token is recognised,
- without reference to its own caller.
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 4
-
-
- 2.0 The Lex Language
- ___ ___ ________
-
- LEX transforms a regular-expression grammar into a deterministic
- finite-state automaton that recognizes that grammar. Each rule
- of the grammar is associated with an action which is to be
- performed when that rule successfully matches some part of the
- input data.
-
- Because of the nature of regular expression grammars, certain
- language constructions cannot be recognized by LEX programs.
- Specifically, expressions with balanced parentheses cannot be
- recognized. This means that LEX cannot be used to recognize all
- Fortran keywords as some (DO, IF, and FORMAT, for example)
- require elaborate recognition to distinguish between ambiguous
- constructions.
-
-
-
- 2.1 Elementary Things
- __________ ______
-
-
- Strings, characters, sets of characters called character
- classes, and operators to form these into patterns, are the
- fundamental elements of the LEX language.
-
- A string is a sequence of characters, not including newline,
- ______
- enclosed in quotes, or apostrophes. Within a string, the
- following escape sequences
- (which are those of the C language) allow any 8-bit character to
- be represented, including the escape character, quotes, and
- newlines:
-
- \n NL (012)
- \r CR (015)
- \b BS (010)
- \t TAB (011)
- \" "
- \' '
- \c c
- \\ \
- \NNN (NNN)
-
- where NNN is a number in octal, and c is any printable
- _
- character. A string may be continued across a line by writing
- the escape character before the newline.
-
- Outside a string, a sequence of upper-case letters stands for a
- sequence of the equivalent lower-case letters, while a sequence
- __________
- of lower-case letters is taken as the name of a LEX expression,
- and handled specially, as described below. These conventions
- make the use of named expressions and the description of
- lower-case keywords (the usual case on Unix) fairly convenient.
- Keywords in case-independent languages, such as Fortran, require
- additional effort to match, as will be noted.
-
-
-
-
- A Lexical Analyser Generator Page 5
-
-
- An Ascii character other than one of
-
- () {} [ * | = ; % / \ ' " -
-
- may be used in LEX to stand for itself.
-
- A sequence of characters enclosed by brackets ('[' and ']')
- forms a character class, which stands for all those characters
- within the brackets. If a circumflex ('^') follows the opening
- bracket, then the class will instead stand for all those
- characters but those inside the brackets. The escapes used in
- ___
- strings may be used in character classes as well.
-
- Within a character class, the construction "A-B" (where "A" and
- "B" are arbitrary characters) stands for the range of characters
- between "A" and "B" inclusive.
-
- For example,
-
- "ABC" matches "abc"
-
- "[ABC]" matches "A" or "B" or "C"
-
- "[A-Za-z0-9]" matches all letters and digits
-
- Case-independent keyword recognition may be described by using
- auxiliary definitions to define expressions that match either
- case. For example,
-
- a = [Aa];
- b = [Bb];
- ...
- z = [Zz];
- %%
- d o Matches "DO", "do", "Do", or "dO"
-
-
-
-
- 2.2 Putting Things Together
- _______ ______ ________
-
-
- Several operators are provided to allow construction of a type
- of pattern called a regular expression. Such expressions can be
- _______ __________
- implemented as finite-state automata (without memory or stacks).
- A reference to an "occurrence" of a regular expression is
- generally taken to mean an occurrence of any string matched by
- that regular expression. The operators are presented in order
- of decreasing priority. In all cases, operators work on either
- strings or character classes, or on other regular expressions.
-
- Any string or character class forms a regular expression which
- matches whatever the string or character class stands for (as
- described above).
-
-
-
-
-
-
- A Lexical Analyser Generator Page 6
-
-
- The operator '*' applied following a regular expression forms a
- new regular expression which matches an arbitrary number (ie,
- zero or more) of adjacent occurrences of the first regular
- expression. The operation is often referred to as (Kleene)
- closure.
-
- The operation of concatenation of two regular expressions is
- _____________
- expressed simply by writing the regular expressions adjacent to
- each other. The resulting regular expression matches any
- occurrence of the first regular expression followed directly by
- an occurrence of the second regular expression.
-
- The operator '|', alternation, written between two regular
- ___________
- expressions forms a regular expression which matches an
- occurrence of the first regular expression or an occurrence of
- the second regular expression.
-
- Any regular expression may be enclosed in parentheses to cause
- the priority of operators to be overridden in the usual manner.
-
- A few examples should help to make all of this clear:
-
- "[0-9]*" matches a (possibly empty) sequence of
- digits.
-
- "[A-Za-z_$][A-Za-z0-9_$]*"
- matches a C identifier.
-
- "([A-Za-z_$]|[0-9])*"
- matches a C identifier, or a sequence of
- digits, or a sequence of letters and
- digits intermixed, or nothing.
-
-
-
-
- 2.3 The General Form of Lex Programs
- ___ _______ ____ __ ___ ________
-
-
- A LEX source input file consists of three sections: a section
- containing auxiliary definitions, a section containing
- _________ ___________
- translations, and a section containing programs.
- ____________ ________
-
- Throughout a LEX program, spaces, tabs, and newlines may be used
- freely, and PL/1-style comments:
-
- /* ... anything but '*/' ... */
-
- may be used, and are treated as a space.
-
- The auxiliary definition section must be present, separated from
- _________ __________
- following sections by the two-character sequence '%%', but may
- be empty. This section allows definition of named regular
-
-
-
-
- expressions, which provide the useful ability to use names of
- regular expressions in the translation section, in place of
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 7
-
-
- common sub-expressions, or to make that section more readable.
-
- The translation section follows the '%%' sequence, and contains
- ___________
- regular expressions paired with actions which describe what the
- _______
- lexical analyser should do when it discovers an occurrence of a
- given regular expression in its input stream.
-
- The program section may be omitted; if it is present it must be
- _______
- separated from the translation section by the '%%' sequence. If
- present, it may contain anything in general, as it is simply
- tacked on to the end of the LEX output file.
-
- The style of this layout will be familiar to users of Yacc. As
- LEX is often used with that processor, it seemed reasonable to
- keep to a similar format.
-
-
-
- 2.4 Auxiliary Definitions
- _________ ___________
-
-
- Given the set of regular expressions forming a complete syntax,
- there are often common sub-expressions. LEX allows these to be
- named, defined but once, and referred to by name in any
- subsequent regular expression. Note that definition must
- precede use. A definition has the form:
-
- expression_name = regular_expression ;
-
- where a name is composed of a lower-case letter followed by a
- sequence string of letters and digits, and where an underscore
- is considered a letter. For example,
-
- digit = [0-9];
- letter = [a-zA-Z];
- name = letter(letter|digit)*;
-
- The semicolon is needed to resolve some ambiguities in the LEX
- syntax.
-
- Three auxiliary definitions have special meaning to LEX:
- "break", "illegal", and "ignore." They are all defined as
- character classes ("break = [,.?]", for example) and are used as
- follows:
-
- break An input token will always terminate if a member
- of the "break" class is scanned.
-
- illegal The "illegal" class allows simplification of
- error detection, as will be described in a later
- section. If this class is defined, and the
- lexical analyser stops at a character that
- "cannot" occur in its present context, the
- analyser will output a suitable error message
- and ignore the offender.
-
-
-
-
- A Lexical Analyser Generator Page 8
-
-
- ignore This class defines a set of characters that are
- ignored by the analyser's input routine.
-
-
-
-
- 2.5 Translations
- ____________
-
-
- One would like to provide a description of the action to be
- taken when a particular sequence of input characters has been
- matched by a given regular expression. The kind of action taken
- might vary considerably, depending upon the application. In a
- compiler, typical actions are: enter an identifer into a symbol
- table, read and store a string, or return a particular token to
- the parser. In text processing, one might wish to reproduce
- most of the input stream on an output stream unchanged, making
- substitutions when a particular sequence of characters is found.
- In general, it is hard to predict what form the action might
- take, and so, in LEX the nature of the action is left to the
- user, by allowing specification, for each regular expression of
- interest, C-language code to be executed when a string matching
- that expression is discovered by the driving program of the
- lexical analyser. An action, together with its regular
- expression, is called a translation, and has the format:
- ___________
-
- regular_expression { action }
-
- All of this may be spread across several lines. The action may
- be empty, but the braces must appear.
-
- Earlier, it was argued that most general-purpose languages are
- inappropriate for writing lexical analysers, and it is important
- to see that the subsequent use of such a language to form the
- actions is not a contradiction. Most languages are fairly good
- at expressing the actions described above (symbol table
- manipulation, writing character strings, and such). Leaving
- this part of the lexical analyser to those languages therefore
- not only makes sense, but also ensures that decisions by the
- writer of the lexical analyser generator will not unduly cramp
- the user's style. However, general-purpose languages do not as
- a rule provide inexpensive pattern matching facilities, or input
- description formats, appropriate for describing or structuring a
- lexical analyser.
-
- Allowing a user to provide his own code is not really enough, as
- he will need some help from LEX to obtain a copy of, or a
- pointer to, the current token, if nothing else. LEX provides a
- library of C functions which may be called to obtain controlled
- access to some of the data structures used by the driving
- programs of the lexical analyser. These are described in a
- later section.
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 9
-
-
- 2.5.1 Numbers and Values -
- _______ ___ ______
-
- Typically, a lexical analyser will return a value to its caller
- indicating which token has been found. Within an action, this
- is done by writing a C return statement, which returns the
- ______
- appropriate value:
-
- BEGIN {
- return(T_BEGIN);
- }
- name {
- lookup(token(NULL));
- return(T_NAME);
- }
- "/" {
- return('/');
- }
-
- Note that function lookup() is provided by the user program.
-
- In many cases, other information must be supplied to its caller
- by the scanner. When an identifier is recognised, for example,
- both a pointer to a symbol-table entry, and the token number
- T_NAME must be returned, yet the C return statement can return
- ______
- but a single value. Yacc has a similar problem, and so its
- lexical analyser sets an external word 'yylval' to the token
- value, while the token number is returned by the scanner. LEX
- uses the external 'yylval' (to be compatible), but, to make LEX
- programs more readable when used alone, the name 'lexval' is set
- by a #define statement to 'yylval'. For example,
-
- name {
- lexval = lookup(token(NULL));
- return(T_NAME);
- }
-
-
- Certain token numbers are treated specially; these are
- automatically defined as manifests (see section 3.2) by LEX, and
- all begin with the sequence 'LEX...' so as not to clash with the
- user's own names. There are two such tokens defined at present:
-
- LEXSKIP When returned by a user's action routine,
- LEXSKIP causes the lexical analyser to ignore
- the current token (ie, it does not inform the
- parser of its presence), and to look instead for
- a new token. This may be used when a comment
- sequence has been discovered, and discarded. It
- is also useful when the action routine completes
- processing of the token. See the discussion of
- the comment() library function for an example of
- its usage.
-
- LEXERR This is returned by the lexical analyser
- (function yylex()) when an unrecognizable input
-
-
-
-
-
- A Lexical Analyser Generator Page 10
-
-
- sequence has been detected. By default, LEXERR
- is 256. This the same as the yacc error value.
-
-
- To summarise, the token number is set by the action with a
- return statement, and the token value (ie, auxiliary
- ______
- information) is set by assigning this value to the external
- integer 'lexval'.
-
-
-
- 2.6 Declaration Sections
- ___________ ________
-
-
- Declarations in the language of the actions may be included in
- both the auxiliary definition section and in the translation
- section. In the former case, these declarations will be
- external to the lexical analyser, and in the latter case, they
- will be local to the lexical analyser (ie, static, or automatic
- storage). Declaration sections consist of a sequence of
- declarations surrounded by the special bracketing sequences '%{'
- and '%}' (as in Yacc). The characters within these brackets are
- copied unchanged into the appropriate spots in the lexical
- analyser program that LEX writes. The examples in appendix A
- suggest how these might be used.
-
-
-
- 3.0 Using Lex from C
- _____ ___ ____ _
-
-
- The present version of LEX is intended for use with C; and it
- is this usage which will be described here.
-
-
-
- 3.1 The Function yylex()
- ___ ________ _______
-
-
- The structure of LEX programs is influenced by what Yacc
- requires of its lexical analyser.
-
- To begin with, the lexical analyser must be named 'yylex', has
- no parameters, and is expected to return a token number, where
- that number is determined by Yacc. The token number for an
- Ascii character is its Ascii value (ie, its value as a C
- character constant). Named tokens, defined in yacc '%token'
- statements, have a number above 256, with the particular number
- accessible through a Yacc-produced #define of the given token
- name as its number. Yacc also allows 'yylex' to pass a value to
- the Yacc action routines, by assigning that value to the
- external 'yylval'.
-
- LEX thus provides a lexical analyser function named 'yylex',
- which interprets tables constructed by the LEX program returning
-
-
-
-
- A Lexical Analyser Generator Page 11
-
-
- the token number returned by the actions it performs. Values
- assigned to lexval are available in 'yylval', so that use with
- Yacc is straightforward.
-
- A value of zero is returned by 'yylex' at end-of-file, and in
- the absence of a return statement in an action, a non-zero value
- ______
- is returned. If computation is performed entirely by the
- lexical analyser, then a suitable main program would be
-
- main()
- {
- while (yylex()) ;
- }
-
-
-
-
- 3.2 Serial Re-Use of yylex()
- ______ ______ __ _______
-
-
- The yylex() function contains several variables which are
- statically initialized at compile time. Once yylex() sees an
- EOF (-1) input character, it will continue to return NULL. If
- yylex() is to be used inside a loop which processes multiple
- files, it must be re-initialized at the beginning of each new
- file with a call to the LEX library routine llinit(). For
- example (slightly extending the previous example):
-
- main()
- {
- getfilelist();
- for(file = first; file != last; file = next)
- {
- llinit();
- while (yylex());
- }
- printf("All files done\n");
- }
-
- The call to llinit() is unnecessary if yylex() is to process
- only one file, or is kept from seeing an EOF input character.
-
-
-
- 3.3 The Lex Table File
- ___ ___ _____ ____
-
-
- In the absence of instructions to the contrary (see below), LEX
- reads a given LEX language file, (from the standard input, if an
- input file has not been specified) and produces a C program file
- 'lextab.c' which largely consists of tables which are then
- interpreted by 'yylex()' (which is in the LEX library). The
- actions supplied by the user in each translation are combined
- with a switch statement into a single function, which is called
- ______
- by the table interpreter when a particular token is found. The
-
-
-
-
- A Lexical Analyser Generator Page 12
-
-
- contents of the program section of the LEX file are added at the
- end of the output file (lextab.c by default). Normally, LEX
- also inserts the lines
-
- #include <stdio.h>
- #include <lex.h>
-
- at the top of the file; this causes declarations required by
- the standard I/O library and by LEX to be included when the C
- program is compiled.
-
-
-
- 3.4 Analyzers Which Don't Use "Standard I/O"
- _________ _____ _____ ___ _________ ____
-
-
- With the current release, LEX supports the generation of
- analyzers which may be incorporated into programs which do not
- use the "standard I/O" library. By setting the "-s" switch, as
- shown below, the generation of the "#include <stdio.h>" line is
- supressed. All references to standard I/O specific files and
- stdio.h have been removed from the LEX library (described in a
- later section), with the exception of lexgetc(), lexerror(),
- mapch() and lexecho(), which are standard I/O dependent.
-
- The declaration of yylex()'s input file iov pointer "lexin" now
- resides in LEXGET.C (lexgetc()). The code which defaults lexin
- to stdin has been moved from yylex() to the table file. yylex()
- now calls the routine llstin(), which is generated into the
- table file. There are no longer any hardwired references to the
- variable "lextab", the default table name. Instead, LEX
- generates a call to lexswitch() in llstin(), which initializes
- yylex() to use the table whose name was given in a "-t" or "-e"
- option in LEX's command line. If neither was given, the default
- name "lextab" is used. Once the initial table has been set up,
- further automatic calls to lexswitch() are supressed, allowing
- the user to manually switch tables as before.
-
- In addition, If the "-s" switch is not given (i.e., normal use
- with standard I/O), llstin() defaults lexin to stdin. If "-s"
- is given, llstin() is generated to do the lexswitch() mentioned
- above only. In any case, yylex() contains no references to the
- standard I/O system.
-
- What all of this means is that under normal operation, you won't
- notice any change in LEX's characteristics. In addition, you
- may use the "-e" ("easy") switch, which will generate a C output
- file and LEX tables which (conveniently) have the same name as
- the input file, and everything will get set up automagically.
- If you specify the "-s" switch, the table file will contain no
- references to the standard I/O package, and you may use any of
- the lexlib routines except lexgetc(), lexerror(), mapch() or
- lexecho().
-
- Don't forget that you must supply your own startup routine
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 13
-
-
- "$$main" if you do not want the standard I/O library. With a
- bit of care in this regard, it will be possible to link your
- program with the C library without dragging in any I/O modules.
- This prevents your having to build another library in order to
- access non-I/O library functions. Just make the reference to
- the C library the last one given to the linker or taskbuilder so
- that only those routines which have not already been found are
- pulled from CLIB.
-
-
- NOTE
-
- Programs that use LEX-generated analyzers and do not use
- the standard I/O package must supply their own lexgetc()
- and lexerror() routines. Failure to do so will result
- in undefined globals.
-
-
-
-
-
- 3.5 Operating LEX
- _________ ___
-
-
- LEX normally reads the grammar from the standard input, writing
- the C program to the file 'lextab.c'. It may be further
- controlled by using the following flags upon invocation:
-
- -i filename The grammar is read from 'filename'.
-
- -o filename The analyser is written to 'filename'.
-
- -t tablename The default finite-state automaton is
- named lextab (and it is, by default,
- written to file 'lextab.c'). The -t
- switch causes the internal tables to be
- named 'tablename' and, if the -o switch
- is not given, written to file
- 'tablename.c'. This is necessary if the
- processor-switching capabilities
- described in a later section are to be
- used.
-
- -e name "Easy" command line. "-e name" is
- equivalent to typing
-
- -i name.LXI -o name.C -t name
-
- Do not include device names or file
- extensions on the "easy" command line.
-
- -v [filename] Internal state information is written to
- 'filename.' If not present, state
- information is written to file
- 'lex.out.'
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 14
-
-
- -d Enable various debugging printouts.
-
- -s Generate analyzer without references to
- standard I/O
-
-
- The command line for compilation of the table file should
- contain no surprises:
-
- cc -c -O lextab.c (on Unix)
-
- xcc lextab -a (on Dec operating systems)
-
- but when one is producing the running program, one must be
- careful to include the necessary libraries. On Unix, the proper
- sequence is:
-
- cc userprog.o lextab.o -ll -lS
-
- The '-ll' causes the LEX library (described below) to be
- searched, and the '-lS' causes the Standard I/O library to be
- used; both libraries are required. If Yacc is used as well,
- the library '-ly' should be included before the '-ll'. The
- ______
- actual order and content of the rest of the command line is
- determined by the user's own requirements.
-
- If using the Decus C compiler, the lexical analyser built by LEX
- is linked with c:lexlib.
-
- The complete process (assuming the Decus compiler running on
- RSTS/E in RT11 mode) is thus:
-
- mcr lex -i grammar.lxi -o grammar.c ! Build analyser
- cc grammar ! Compile the
- as grammar ! grammar table
- link out=in,grammar,c:lexlib,c:suport,c:clib/b:2000
-
-
-
-
- 4.0 The Lex Library
- ___ ___ _______
-
-
- All programs using grammars generated by LEX must be linked
- together with the LEX library. On Unix, this is '/lib/libl.a'
- (or '-ll' on the cc command line) and on DEC operating systems,
- C:LEXLIB (LB:[1,1]LEX.OLB for RSX). It contains routines which
- are either essential or merely useful to users of LEX. The
- essential routines include a routine to obtain a copy of the
- current token, and a routine to switch to a different set of
- scanning tables. Routines of the second, useful, class perform
- functions which might well be written by the user himself, but
- are there to save him the bother; including a routine to
- process various forms of comments and a routine to transform
- numbers written in arbitrary bases. Both sets of routines are
-
-
-
-
-
-
- A Lexical Analyser Generator Page 15
-
-
- expected to grow as LEX sees use.
-
- Those functions which produce diagnostics do so by calling
- lexerror(), which is called as
-
- lexerror(string, arg1, ..., argN)
-
- and is expected to write its arguments (likely using the "remote
- format" facility of the fprintf() function), followed by a
- newline, on some output stream. A Lexerror() function is
- included in the LEX library, but a user is free to include his
- own. The routine in the LEX library is standard I/O specific.
-
-
- NOTE
-
- The VAX/VMS native C library does not support remote
- formats. The Lexerror function in the LEX library
- conditionally compiles to support a call to lexerror()
- with only an error message string. Remote formats are
- supported under Decus C. Learn to use them, they are
- very nice!
-
-
-
-
-
- 4.0.1 Comment -- skip over a comment -
- _______ __ ____ ____ _ _______
-
- comment(delim)
- char delim[];
-
- Comment() may be called by a translation when the sequence of
- characters which mark the start of a comment in the given syntax
- has been recognised by LEX. It takes a string which gives the
- sequence of characters which mark the end of a comment, and
- skips over characters in the input stream until this sequence is
- found. Newlines found while skipping characters cause the
- external 'yyline' to be incremented; an unexpected end-of-file
- produces a suitable diagnostic. Thus, 'comment("*/")' matches
- C-style comments, and 'comment("\n")' matches as-style comments.
- There are other methods of handling comments in LEX; the
- comment() function is usually the best with regard to both space
- and time.
-
-
-
- 4.0.2 Gettoken -- obtain a copy of token -
- ________ __ ______ _ ____ __ _____
-
- gettoken(buf, sizeof(buf))
- char buf[];
-
- Gettoken() takes the address of a character buffer, and its size
- in bytes, and copies the token most recently matched by LEX into
- the buffer. A null byte is added to mark the end of the token
-
-
-
-
-
-
- A Lexical Analyser Generator Page 16
-
-
- in the buffer, but, as null bytes are legitimate characters to
- LEX, the true length of the token is returned by gettoken().
-
- For example, the following function calls lexlength() to obtain
- the length of a token. It then calls the storage allocator to
- allocate sufficient storage for the token and copies the token
- into the allocated area.
-
- char *
- save()
- /*
- * Save current token, return a pointer to it
- */
- {
- register char *tbuffer;
- register int len;
- register char *tend;
- extern char *token();
- extern char *copy();
-
- len = lexlength() + 1;
- if (tbuffer = malloc(len)) == NULL)
- error("No room for token");
- gettoken(tbuffer, len);
- return(tbuffer);
- }
-
-
-
-
- 4.0.3 Integ -- long integer, any base -
- _____ __ ____ ________ ___ ____
-
- long
- integ(nptr, base)
- char *nptr;
-
- Integ() converts the Ascii string at 'nptr' into a long integer,
- which it returns. Conversion stops at the first non-digit,
- where the digits are taken from the class "[0-9a-zA-Z]" as
- limited by the given 'base'. Integ() does not understand signs,
- nor are blanks or tabs allowed in the string.
-
-
-
- 4.0.4 Lexchar -- steal character -
- _______ __ _____ _________
-
- lexchar()
-
- Lexchar() returns the next character from the LEX input stream.
- (This means that LEX will no longer see it.) LEX uses a
- look-ahead buffer to handle complex languages, and this function
- takes this into account.
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 17
-
-
- 4.0.5 Lexecho -- write token to a file (STDIO ONLY) -
- _______ __ _____ _____ __ _ ____ ______ _____
-
- lexecho(fp);
- FILE *fp;
-
- Lexecho() may be called by a LEX action routine to write the
- current token to a specified file.
-
-
- NOTE
-
- Programs using analyzers built with LEX's "-s" switch
- must supply their own lexecho() function if needed.
-
-
-
-
-
- 4.0.6 Lexgetc -- supply characters to yylex (STDIO ONLY) -
- _______ __ ______ __________ __ _____ ______ _____
-
- lexgetc()
-
- Lexgetc() is called by the lexical analyser to obtain characters
- from its input stream. The version in the library is dependent
- on the standard I/O package, and is:
-
- FILE *lexin; /* Declare iov address locally */
- lexgetc()
- {
- return(getc(lexin));
- }
-
- If lexin is NULL when yylex() is entered, it will be assigned to
- stdin. This is done by yylex() calling the function llstin(),
- which is generated in the table file. Unless the "-s" switch is
- given to LEX, the llstin() function assigns lexin to stdin if
- lexin is NULL. If the "-s" switch was given, the llstin()
- routine is a no-op. The user may provide his own version of
- lexgetc() to pre-process the data to the lexical analyser. An
- example of this is shown in the appendix.
-
-
- NOTE
-
- Programs using analyzers built with LEX's "-s" switch
- must supply their own lexgetc() function, and "lexin"
- has no meaning in this context.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 18
-
-
- 4.0.7 Lexlength -- return length of a token. -
- _________ __ ______ ______ __ _ ______
-
- lexlength();
-
- Lexlength() may be called by a LEX action routine to obtain the
- length of the current token in bytes. An example of this is
- shown in the description of gettoken().
-
-
-
- 4.0.8 Lexpeek -- examine character -
- _______ __ _______ _________
-
- lexpeek()
-
- Lexpeek() performs a function similar to that of Lexchar(), but
- does not have the side-effect of removing the character from
- LEX's view.
-
-
-
- 4.0.9 Lexswitch -- switch scanning tables -
- _________ __ ______ ________ ______
-
- struct lextab *
- lexswitch(newtb)
- struct lextab *newtb;
-
- Lexswitch() is called to cause LEX to use a different scanning
- table; it returns a pointer to the one previously in use. This
- facility is useful if certain objects of the language (eg,
- strings in C) have a fairly complicated structure of their own
- which cannot be handled within the translation section of the
- LEX description of the larger language.
-
-
-
- 4.0.10 Llinit -- Reinitialize yylex() -
- ______ __ ____________ _______
-
- llinit()
-
- Llinit() is a function which resets the state of yylex() to it's
- cold-start condition. Several of yylex()'s variables are
- initialized at compile time, and must be reinitialized if it is
- to be serially re-used. An example of this is where yylex() is
- repeatedly called inside a loop which processes multiple input
- files. Each time a new file is started, llinit() must be called
- before the first call to yylex() for the new file.
-
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 19
-
-
- 4.0.11 Mapch -- Handle C escapes within strings (STDIO ONLY) -
- _____ __ ______ _ _______ ______ _______ ______ _____
-
- int mapch(delim, esc)
- char delim;
- char esc;
-
- Mapch() is a function which handles C "escape" characters such
- as "\n" and "\nnn". It will scan off the entire escape sequence
- and return the equivalent ASCII code as an integer. It is meant
- for use with YACC while scanning quoted strings and character
- constants.
-
- If it encounters EOF while scanning, it calls lexerror() to
- print an error message warning of "Unterminated string". If a
- normal character is read, it returns the ASCII value. If
- "delim" (usually " or ') is read, it returns EOF. If a newline
- (ASCII linefeed) is read, it increments the global "yyline" and
- calls itself recursively for the next line of input. It may use
- _____ ______ ___________
- the ungetc() function to back up in the input stream.
-
-
- NOTE
-
- This routine is very application-specific for use by LEX
- and YACC when they are working together. You should
- read the code in MAPCH.C before using this function.
-
-
-
-
-
- 4.0.12 Token -- get pointer to token -
- _____ __ ___ _______ __ _____
-
- char *
- token(end_pointer)
- char **end_pointer;
-
- Token() locates the first byte of the current token and returns
- its address. It takes an argument which is either NULL or a
- pointer to a character pointer; if the latter, that pointer is
- set to point to the byte after the last byte of the current
- _____
- token. Token() is slightly faster, and more convenient than
- gettoken() for those cases where the token is only one or two
- bytes long.
-
-
-
-
- 5.0 Error Detection and Recovery
- _____ _________ ___ ________
-
-
- If a character is detected in the input stream which cannot be
- added to the last-matched string, and which cannot start a
- string, then that character is considered illegal by LEX. LEX
-
-
-
-
- may be instructed to return a special 'error' token, or to write
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 20
-
-
- a diagnostic with lexerror(). At present, the former is the
- default action.
-
- The token LEXERR is a special value which is recognised by Yacc,
- and causes it to start its own error recovery. It is defined by
- the header file lex.h for use by other programs.
-
- Often, it makes more sense to simply type a suitable diagnostic,
- and continue by ignoring the offending character. It is fairly
- easy to cause LEX to do this, by including the auxiliary
- definition:
-
- illegal = [\0-\377];
-
- which defines a character class "illegal" which is handled
- specially by LEX. If the character that is causing the trouble
- is a member of that character class (and in the example, all
- characters are), then LEX will write a diagnostic, and ignore
- it; otherwise, it will return the special token LEXERR
-
- More comprehensive techniques may be added as they become
- apparent.
-
-
-
- 6.0 Ambiguity and Look-ahead
- _________ ___ __________
-
- Many computer languages have ambiguous grammars in that an input
- token may represent more than one logical entity. This section
- discusses the way in which grammars built by LEX resolve
- ambiguous input, as well as a way for the grammar to assign
- unique meaning to a token by looking ahead in the input stream.
-
-
-
- 6.1 Resolving Ambiguities
- _________ ___________
-
-
- A LEX program may be ambiguous, in the sense that a particular
- input string or strings might be matched by the regular
- expression of more than one translation. Consider,
-
- [a-z] { putchar(*token(NULL)); }
- aaa* { printf("abc"); }
-
- in which the string 'aa' is matched by both regular expressions
- (twice by the first, and once by the second). Also, the string
- 'aaaaaa' may be matched in many different ways. LEX has to
- decide somehow which actions should be performed.
- (Alternatively, it could produce a diagnostic, and give up. As
- it happens, LEX never does this.)
-
- Consider a second example,
-
- letter = [a-z];
-
-
-
-
-
-
- A Lexical Analyser Generator Page 21
-
-
- %%
- A(letter)* { return(1); }
- AB(letter)* { return(2); }
-
- which attempts to distinguish sequences of letters that begin
- with 'a' from similar sequences that begin with 'ab'. These two
- examples illustrate two different kinds of ambiguity, and the
- following indicates how LEX resolves these.
-
- In the first example, it seems likely that the intent was to
- have both 'aa' and 'aaaaaa' perform the second action, while all
- single letters 'a' cause the first action to be performed. LEX
- does this by ensuring that the longest possible part of the
- input stream will be used to determine the match. Thus,
-
- < { return(LESS); }
- <= { return(LESSEQ); }
-
- or
-
- digit(digit)* { return(NUMBER); }
- letter(letter|digit)* { return(NAME); }
-
- would work as one might expect.
-
- In the second example, the longest-string need not work. On the
- string "abb9", either action could apply, and so another rule
- must be followed. This states that if, after the longest-string
- rule has been applied, there remains an ambiguity, then the
- action which appears first in the LEX program file is to be
- performed. As the second example is written, the second action
- will never be performed. It would have been written as:
-
- letter = [a-z];
- %%
- AB(letter)* { return(1); }
- A(letter)* { return(2); }
-
- The two rules together completely determine a string.
-
- At present, LEX produces no diagnostic in either case; it
- merely applies the rules and proceeds. In the case where
- priority is given to the first-appearing rule, it might be a
- good idea to produce a diagnostic.
-
-
-
- 6.2 Look-ahead
- __________
-
-
- Some facility for looking ahead in the input stream is sometimes
- required. (This facility might also be regarded as a way for
- the programmer to more closely control LEX's ambiguity
- resolution process.) For example, in C, a name followed by "("
- is to be contextually declared as an external function if it is
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 22
-
-
- otherwise undefined.
-
- In Pascal, look-ahead is required to determine that
-
- 123..1234
-
- is an integer 123, followed by the subrange symbol "..",
- followed by the integer 1234, and not simply two real numbers
- run together.
-
- In both of these cases, the desire is to look ahead in the input
- stream far enough to be able to make a decision, but without
- losing tokens in the process.
-
- A special form of regular expression is used to indicate
- look-ahead:
-
- re1 '/' re2 '{' action '}'
-
- where 're1' and 're2' are regular expressions. The slash is
- treated as concatenation for the purposes of matching incoming
- characters; thus both 're1' and 're2' must match adjacently for
- the action to be performed. 'Re1' indicates that part of the
- input string which is the token to be returned, while 're2'
- indicates the context. The characters matched by 're2' will be
- re-read at the next call to yylex(), and broken into tokens.
-
- Note that you cannot write:
-
- name = re1 / re2;
-
- The look-ahead operator must be part of the rule. It is not
- valid in definitions.
-
- In the first example, the look-ahead operator would be used as:
-
- name / "(" {
- if (name undefined)
- declare name a global function;
- }
- name { /* usual processing for identifiers */
- }
-
- In the second example, the range construction would be parsed as
- follows:
-
- digit = [0-9];
- int = digit(digit)*;
- %%
- int / ".." int { /* Start of a range */
- ".." int { /* End of a range */
-
-
- Note that right-context is not sufficient to handle certain
- types of ambiguity, as is found in several places in the Fortran
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 23
-
-
- language. For example,
-
- do i = 1 Is an assignment statement
- do i = 1, 4 Is a DO statement
-
- It is not sufficient to use right-context scanning to look for
- the comma, as it may occur within a parenthesized
- sub-expression:
-
- do i = j(k,l) Is an assignment statement
-
- In Fortran, similar problems exist for IF and FORMAT statements,
- as well as counted (Hollerith) string constants. All of these
- require a more powerful grammar than is possible with LEX
- regular-expressions.
-
-
-
- 7.0 Multiple Scanning Tables; Processor Switching
- ________ ________ _______ _________ _________
-
-
- Even a fairly simple syntax may be difficult, or impossible, to
- describe and process with a single set of translations. An
- example of this may be found in C, where strings, which are part
- of the language, have quite a different structure, and in order
- to process them, either a function must be called which reads
- and parses the input stream for itself, or some mechanism within
- LEX must be invoked to cause a (usually massive) change of
- state.
-
- LEX does provide such a facility, which is known, after AED, as
- 'processor switching'. Yylex() locates its tables through a
- pointer; if one simply changes the pointer to point at a new
- set of tables, one will have effected the required change of
- state. The LEX library function lexswitch(), which is described
- elsewhere in this guide, arranges to do this; it also returns
- the old value of the pointer so that it may be restored by a
- later call to Lexswitch. Thus, scanning environments may be
- stacked, or not, as the user requires.
-
-
-
- 7.1 Creation of a Processor
- ________ __ _ _________
-
-
- It should be clear that if all the tables produced by LEX from a
- user's translation file have the same name, someone (the loader)
- is bound to object. Some method must be provided to change the
- name of the table.
-
- This is done by an option flag to the LEX command:
-
- -t name
-
- will cause the scanning table to be declared as
-
-
-
-
-
-
- A Lexical Analyser Generator Page 24
-
-
- struct lextab name;
-
- so that it may be passed to LEXswitch:
-
- lexswitch(&name);
-
- LEX also writes the program file to the file "name.c" rather
- than to "lextab.c."
-
-
- NOTE
-
- If you use the "easy" command line ("-e name") when
- running LEX, the output file and table names will
- correspond nicely. Re-read the section on operating LEX
- for more details.
-
-
-
-
-
- 8.0 Conclusion
- __________
-
-
- LEX seems to handle most lexical analysis tasks easily. Indeed,
- LEX may be more generally used to write commands of a
- text-processing nature; an example of such usage may be found
- in an appendix. LEX programs are far easier to write than the
- equivalent C programs, and generally consume less space
- (although there is an initial overhead for the more general
- table-interpreter program). The encoding suggested in [4]
- achieves a reasonable compromise between table size, and
- scanning speed. Certainly lexical analysers are less tedious
- and time-consuming to write.
-
- It is expected that most change in the future will be through
- additions to the LEX library. The LEX language may change
- slightly to accomodate common kinds of processing (eg, break
- characters), or to extend its range of application. Neither
- kind of change should affect existing LEX programs.
-
- LEX produces tables and programs for the C language. The tables
- are in a very simple (and stylised) format, and when LEX copies
- the action routines or the program section, the code might as
- well be Fortran for all it cares. One could write Unix filters
- to translate the very simple C format tables into other
- languages, allowing LEX to be used with a larger number of
- languages, with little extra development cost. This seems a
- likely future addition.
-
- Because of the look-ahead necessary to implement the "longest
- string match" rule, LEX is unsuitable for interactive programs
- whose overall structure is:
-
- for (;;) {
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page 25
-
-
- prompt_user();
- get_input();
- process();
- print_output();
- }
-
- If these are rewritten as LEX-generated grammars, the user will
- be confused by the fact the second input datum must be entered
- before the first is processed. It is possible to solve this
- dilemna by rewriting function lexgetc() to return an
- "end-of-line" character until processing is complete for that
- line. An example is shown in the Appendix.
-
-
-
- 9.0 Acknowledgements
- ________________
-
-
- LEX is based on a processor of the same name at Bell
- Laboratories, which also runs under Unix [3], and, more
- distantly, on AED-0 [1]. This version of LEX was based on the
- description and suggestions of [4], although the implementation
- differs significantly in a number of ways.
-
-
-
- 10.0 References
- __________
-
-
- 1. Johnson, W.L., et. al., "Automatic generation of
- efficient lexical analysers using finite state
- techniques", CACM Vol. 11, No. 12, pp. 805-813, 1968.
-
- 2. Johnson, S.C., "Yacc -- Yet Another Compiler-Compiler",
- CSTR-32, Bell Telephone Laboratories, Murray Hill, New
- Jersey, 1974.
-
- 3. Lesk, M.E., "Lex - a lexical analyser generator",
- CSTR-39, Bell Telephone Laboratories, Murray Hill, New
- Jersey, 1975.
-
- 4. Aho, A.V., Ullman, J.D., Principles of Compiler Design,
- __________ __ ________ _______
- Addison-Wesley, Don Mills, Ontario, 1977.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- APPENDIX A
-
- LEX SOURCE GRAMMAR
-
-
-
- The following is a grammar of LEX programs which generally
- follows Bacus-Naur conventions. In the rules, "||" stands for
- alternation (choose one or the other). Other graphic text
- stands for itself. Several grammar elements have special
- meaning:
-
- <anything> Any text not including the following
- grammar element (either a literal or
- end-of-file).
-
- <nothing> Nothing -- used for optional rule
- elements.
-
- <name> A variable name.
-
- <char_class> A character class specifier.
-
- <string> A string (text inclosed in '"').
-
- <EOF> The end of the input file.
-
- This grammar was abstracted from the Yacc grammar used to
- describe LEX.
-
- program :== aux_section trans_section
-
- aux_section ::= auxiliaries %%
- || %%
-
- auxiliaries ::= auxiliaries aux_def
- || aux_def
-
- aux_def ::= name_def = reg_exp ;
- || %{ <anything> %}
-
- name_def ::= <name>
-
- reg_exp ::= <char_class>
- || <string>
- || <name>
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page A-2
- LEX Source Grammar
-
-
- || reg_exp *
- || reg_exp | reg_exp
- || reg_exp reg_exp
- || ( reg_exp )
-
- trans_section ::= translations
- || <nothing>
-
- translations ::= translations translation
- || translation
-
- translation ::= pattern action
- || %{ <anything> %}
- || %% <anything> <EOF>
-
- pattern ::= reg_exp / reg_exp
- || reg_exp
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- APPENDIX B
-
- SOME SMALL EXAMPLES
-
-
-
-
-
- The following example illustrates the use of the look-ahead
- operator, and various other of the nuances of using LEX.
-
-
-
- B.1 A Complete Command
- _ ________ _______
-
-
- The C programming language has had two different ways of writing
- its assignment operators. The original method was to write a
- binary operator immediately following the ordinary assignment
- operator, forming a compound operator. Thus 'a =+ b' caused the
- value of 'a+b' to be assigned to 'a'. Similarly,
-
- =- =/ =% =* =<< =>> =| =& =^
-
- were written for the assignment operators corresponding to
- subtraction, division, modulus, multiplication, left shift,
- right shift, logical OR, logical AND, and exclusive OR. In the
- current version of the language, the binary operator is written
- to the left of the assignment operator, to remove potential
- ambiguity.
-
- The LEX program "ctoc" is a filter which converts programs
- written in the older style into programs written in the newer
- style. It uses the look-ahead operator, and the various
- dis-ambiguating rules to ensure that sequences like
-
- a==-1 a=++b
-
- remain unchanged.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page B-2
- Some Small Examples
-
-
- /*
- * ctoc.lxi -- Convert old C operators to new C form
- *
- * Adapted from example in C. Forsythe's LEX manual.
- *
- * NOTE:
- * Forsythe's program put an entire comment into the token
- * buffer. Either define a huge token buffer for my typical
- * monster comments, or filter text within comments as if
- * it were real C code. This is what I did. So =+ inside
- * a comment will get changed to +=, etc. Note tnat you
- * may use the commen() function in LEXLIB if you want the
- * comments eaten. I wanted 'em in the output.
- * by
- * Bob Denny
- * 31-Feb-81
- */
- %{
- char tbuf[80]; /* Token buffer */
- main()
- {
- while (yylex())
- ;
- }
- %}
- any = [\0-\177];
- nesc = [^\\];
- nescquote = [^\\"];
- nescapost = [^\\'];
- schar = "\\" any | nescquote;
- cchar = "\\" any | nescapost;
- string = '"' schar* '"';
- charcon = "'" cchar* "'";
- %%
- "=" ( << | >> | "*" | + | - | "/" | "%" | "&" | "|" | "^" )
- {
- gettoken(tbuf, sizeof tbuf);
- printf("%s=",tbuf+1);
- }
- /*
- * The following will overflow the token buffer on any but a
- * small comment:
- */
- /*********
- "/*" ([^*] | "*"[^/])* "*/"
- {
- lexecho(stdout);
- }
- **********/
- [<=>!]"=" | "="[<>]
- {
- lexecho(stdout);
- }
- "=" / ( ++ | -- )
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page B-3
- Some Small Examples
-
-
- {
- lexecho(stdout);
- }
- charcon
- {
- lexecho(stdout);
- }
- string
- {
- lexecho(stdout);
- }
- [\0-\377]
- {
- lexecho(stdout);
- }
-
-
- Assuming the Decus compiler running on RSTS/E in RT11 mode, the
- above program would be built and executed as follows:
-
- mcr lex -i ctoc.lxi -o ctoc.c
- cc ctoc/v
- as ctoc/d
- link ctoc=ctoc,c:lexlib,c:suport,c:clib/b:2000
-
- mcr ctoc <old.c >new.c
-
-
-
-
- B.2 Interactive Lexical Analysis
- ___________ _______ ________
-
- The following program reads words from the terminal, counting
- each as they are entered. The interaction with the operator is
- "natural" in the sense that processing for one line is complete
- before the next line is input. To implement this program, it
- was necessary to include a special version of lexgetc() which
- returns <NULL> if the current line has been completely
- transmitted to the parser. Because the parser must still have
- some look-ahead context, it will return the "end-of-line" token
- twice at the beginning of processing. This required some
- _____
- additional tests in the main program.
-
- /*
- * Count words -- interactively
- */
- white = [\n\t ]; /* End of a word */
- eol = [\0]; /* End of input line */
- any = [!-~]; /* All printing char's */
- illegal = [\0-\377]; /* Skip over junk */
- %{
- char line[133];
- char *linep = &line;
- int iseof = 0;
-
-
-
-
-
-
- A Lexical Analyser Generator Page B-4
- Some Small Examples
-
-
- int wordct = 0;
- #define T_EOL 1
- main()
- {
- register int i;
- while ((i = yylex()) != 0) {
- /*
- * If the "end-of-line" token is
- * returned AND we're really at
- * the end of a line, read the
- * next line. Note that T_EOL is
- * returned twice when the program
- * starts because of the nature of
- * the look-ahead algorithms.
- */
- if (i == T_EOL && !is_eof
- && *linep == 0) {
- printf("* ");
- fflush(stdout);
- getline();
- }
- }
- printf("%d words\n", wordct);
- }
- %}
- %%
- any(any)* {
- /*
- * Write each word on a
- * seperate output line.
- */
- lexecho(stdout);
- printf("\n");
- wordct++;
- return(LEXSKIP);
- }
- eol {
- return(T_EOL);
- }
- white(white)* {
- return(LEXSKIP);
- }
- %%
- getline()
- /*
- * Read a line for lexgetc()
- */
- {
- is_eof = (fgets(line, sizeof line, stdin)
- == NULL);
- linep = &line;
- }
- lexgetc()
- /*
-
-
-
-
-
-
-
-
- A Lexical Analyser Generator Page B-5
- Some Small Examples
-
-
- * Homemade lexgetc -- return zero while at the
- * end of an input line or EOF at end of file. If
- * more on this line, return the next byte.
- */
- {
- return( (is_eof) ? EOF
- : (*linep == 0) ? 0
- : *linep++);
- }
- EOF
- : (*linep == 0) ? 0
- : *linep++);