home *** CD-ROM | disk | FTP | other *** search
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- NAME
- flex - fast lexical analyzer generator
-
- SYNOPSIS
- flex [-bcdfinpstvFILT8 -C[efmF] -Sskeleton] [_f_i_l_e_n_a_m_e ...]
-
- DESCRIPTION
- _f_l_e_x is a tool for generating _s_c_a_n_n_e_r_s: programs which
- recognized lexical patterns in text. _f_l_e_x reads the given
- input files, or its standard input if no file names are
- given, for a description of a scanner to generate. The
- description is in the form of pairs of regular expressions
- and C code, called _r_u_l_e_s. _f_l_e_x generates as output a C
- source file, lex.yy.c, which defines a routine yylex(). This
- file is compiled and linked with the -lfl library to produce
- an executable. When the executable is run, it analyzes its
- input for occurrences of the regular expressions. Whenever
- it finds one, it executes the corresponding C code.
-
- For full documentation, see flexdoc(1). This manual entry is
- intended for use as a quick reference.
-
- OPTIONS
- _f_l_e_x has the following options:
-
- -b Generate backtracking information to _l_e_x._b_a_c_k_t_r_a_c_k.
- This is a list of scanner states which require back-
- tracking and the input characters on which they do so.
- By adding rules one can remove backtracking states. If
- all backtracking states are eliminated and -f or -F is
- used, the generated scanner will run faster.
-
- -c is a do-nothing, deprecated option included for POSIX
- compliance.
-
- NOTE: in previous releases of _f_l_e_x -c specified table-
- compression options. This functionality is now given
- by the -C flag. To ease the the impact of this change,
- when _f_l_e_x encounters -c, it currently issues a warning
- message and assumes that -C was desired instead. In
- the future this "promotion" of -c to -C will go away in
- the name of full POSIX compliance (unless the POSIX
- meaning is removed first).
-
- -d makes the generated scanner run in _d_e_b_u_g mode. When-
- ever a pattern is recognized and the global
- yy_flex_debug is non-zero (which is the default), the
- scanner will write to _s_t_d_e_r_r a line of the form:
-
- --accepting rule at line 53 ("the matched text")
-
- The line number refers to the location of the rule in
- the file defining the scanner (i.e., the file that was
-
-
- Printed 4/3/91 26 May 1990 1
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- fed to flex). Messages are also generated when the
- scanner backtracks, accepts the default rule, reaches
- the end of its input buffer (or encounters a NUL; the
- two look the same as far as the scanner's concerned),
- or reaches an end-of-file.
-
- -f specifies (take your pick) _f_u_l_l _t_a_b_l_e or _f_a_s_t _s_c_a_n_n_e_r.
- No table compression is done. The result is large but
- fast. This option is equivalent to -Cf (see below).
-
- -i instructs _f_l_e_x to generate a _c_a_s_e-_i_n_s_e_n_s_i_t_i_v_e scanner.
- The case of letters given in the _f_l_e_x input patterns
- will be ignored, and tokens in the input will be
- matched regardless of case. The matched text given in
- _y_y_t_e_x_t will have the preserved case (i.e., it will not
- be folded).
-
- -n is another do-nothing, deprecated option included only
- for POSIX compliance.
-
- -p generates a performance report to stderr. The report
- consists of comments regarding features of the _f_l_e_x
- input file which will cause a loss of performance in
- the resulting scanner.
-
- -s causes the _d_e_f_a_u_l_t _r_u_l_e (that unmatched scanner input
- is echoed to _s_t_d_o_u_t) to be suppressed. If the scanner
- encounters input that does not match any of its rules,
- it aborts with an error.
-
- -t instructs _f_l_e_x to write the scanner it generates to
- standard output instead of lex.yy.c.
-
- -v specifies that _f_l_e_x should write to _s_t_d_e_r_r a summary of
- statistics regarding the scanner it generates.
-
- -F specifies that the _f_a_s_t scanner table representation
- should be used. This representation is about as fast
- as the full table representation (-_f), and for some
- sets of patterns will be considerably smaller (and for
- others, larger). See flexdoc(1) for details.
-
- This option is equivalent to -CF (see below).
-
- -I instructs _f_l_e_x to generate an _i_n_t_e_r_a_c_t_i_v_e scanner, that
- is, a scanner which stops immediately rather than look-
- ing ahead if it knows that the currently scanned text
- cannot be part of a longer rule's match. Again, see
- flexdoc(1) for details.
-
- Note, -I cannot be used in conjunction with _f_u_l_l or
- _f_a_s_t _t_a_b_l_e_s, i.e., the -f, -F, -Cf, or -CF flags.
-
-
-
- Printed 4/3/91 26 May 1990 2
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- -L instructs _f_l_e_x not to generate #line directives in
- lex.yy.c. The default is to generate such directives so
- error messages in the actions will be correctly located
- with respect to the original _f_l_e_x input file, and not
- to the fairly meaningless line numbers of lex.yy.c.
-
- -T makes _f_l_e_x run in _t_r_a_c_e mode. It will generate a lot
- of messages to _s_t_d_o_u_t concerning the form of the input
- and the resultant non-deterministic and deterministic
- finite automata. This option is mostly for use in
- maintaining _f_l_e_x.
-
- -8 instructs _f_l_e_x to generate an 8-bit scanner. On some
- sites, this is the default. On others, the default is
- 7-bit characters. To see which is the case, check the
- verbose (-v) output for "equivalence classes created".
- If the denominator of the number shown is 128, then by
- default _f_l_e_x is generating 7-bit characters. If it is
- 256, then the default is 8-bit characters.
-
- -C[efmF]
- controls the degree of table compression.
-
- -Ce directs _f_l_e_x to construct _e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s_e_s,
- i.e., sets of characters which have identical lexical
- properties. Equivalence classes usually give dramatic
- reductions in the final table/object file sizes (typi-
- cally a factor of 2-5) and are pretty cheap
- performance-wise (one array look-up per character
- scanned).
-
- -Cf specifies that the _f_u_l_l scanner tables should be
- generated - _f_l_e_x should not compress the tables by tak-
- ing advantages of similar transition functions for dif-
- ferent states.
-
- -CF specifies that the alternate fast scanner represen-
- tation (described in flexdoc(1)) should be used.
-
- -Cm directs _f_l_e_x to construct _m_e_t_a-_e_q_u_i_v_a_l_e_n_c_e _c_l_a_s_s_e_s,
- which are sets of equivalence classes (or characters,
- if equivalence classes are not being used) that are
- commonly used together. Meta-equivalence classes are
- often a big win when using compressed tables, but they
- have a moderate performance impact (one or two "if"
- tests and one array look-up per character scanned).
-
- A lone -C specifies that the scanner tables should be
- compressed but neither equivalence classes nor meta-
- equivalence classes should be used.
-
- The options -Cf or -CF and -Cm do not make sense
- together - there is no opportunity for meta-equivalence
-
-
- Printed 4/3/91 26 May 1990 3
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- classes if the table is not being compressed. Other-
- wise the options may be freely mixed.
-
- The default setting is -Cem, which specifies that _f_l_e_x
- should generate equivalence classes and meta-
- equivalence classes. This setting provides the highest
- degree of table compression. You can trade off
- faster-executing scanners at the cost of larger tables
- with the following generally being true:
-
- slowest & smallest
- -Cem
- -Cm
- -Ce
- -C
- -C{f,F}e
- -C{f,F}
- fastest & largest
-
-
- -C options are not cumulative; whenever the flag is
- encountered, the previous -C settings are forgotten.
-
- -Sskeleton_file
- overrides the default skeleton file from which _f_l_e_x
- constructs its scanners. You'll never need this option
- unless you are doing _f_l_e_x maintenance or development.
-
- SUMMARY OF FLEX REGULAR EXPRESSIONS
- The patterns in the input are written using an extended set
- of regular expressions. These are:
-
- x match the character 'x'
- . any character except newline
- [xyz] a "character class"; in this case, the pattern
- matches either an 'x', a 'y', or a 'z'
- [abj-oZ] a "character class" with a range in it; matches
- an 'a', a 'b', any letter from 'j' through 'o',
- or a 'Z'
- [^A-Z] a "negated character class", i.e., any character
- but those in the class. In this case, any
- character EXCEPT an uppercase letter.
- [^A-Z\n] any character EXCEPT an uppercase letter or
- a newline
- r* zero or more r's, where r is any regular expression
- r+ one or more r's
- r? zero or one r's (that is, "an optional r")
- r{2,5} anywhere from two to five r's
- r{2,} two or more r's
- r{4} exactly 4 r's
- {name} the expansion of the "name" definition
- (see above)
- "[xyz]\"foo"
-
-
- Printed 4/3/91 26 May 1990 4
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- the literal string: [xyz]"foo
- \X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
- then the ANSI-C interpretation of \x.
- Otherwise, a literal 'X' (used to escape
- operators such as '*')
- \123 the character with octal value 123
- \x2a the character with hexadecimal value 2a
- (r) match an r; parentheses are used to override
- precedence (see below)
-
-
- rs the regular expression r followed by the
- regular expression s; called "concatenation"
-
-
- r|s either an r or an s
-
-
- r/s an r but only if it is followed by an s. The
- s is not part of the matched text. This type
- of pattern is called as "trailing context".
- ^r an r, but only at the beginning of a line
- r$ an r, but only at the end of a line. Equivalent
- to "r/\n".
-
-
- <s>r an r, but only in start condition s (see
- below for discussion of start conditions)
- <s1,s2,s3>r
- same, but in any of start conditions s1,
- s2, or s3
-
-
- <<EOF>> an end-of-file
- <s1,s2><<EOF>>
- an end-of-file when in start condition s1 or s2
-
- The regular expressions listed above are grouped according
- to precedence, from highest precedence at the top to lowest
- at the bottom. Those grouped together have equal pre-
- cedence.
-
- Some notes on patterns:
-
- - Negated character classes _m_a_t_c_h _n_e_w_l_i_n_e_s unless "\n"
- (or an equivalent escape sequence) is one of the char-
- acters explicitly present in the negated character
- class (e.g., "[^A-Z\n]").
-
- - A rule can have at most one instance of trailing con-
- text (the '/' operator or the '$' operator). The start
- condition, '^', and "<<EOF>>" patterns can only occur
- at the beginning of a pattern, and, as well as with '/'
-
-
- Printed 4/3/91 26 May 1990 5
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- and '$', cannot be grouped inside parentheses. The
- following are all illegal:
-
- foo/bar$
- foo|(bar$)
- foo|^bar
- <sc1>foo<sc2>bar
-
-
- SUMMARY OF SPECIAL ACTIONS
- In addition to arbitrary C code, the following can appear in
- actions:
-
- - ECHO copies yytext to the scanner's output.
-
- - BEGIN followed by the name of a start condition places
- the scanner in the corresponding start condition.
-
- - REJECT directs the scanner to proceed on to the "second
- best" rule which matched the input (or a prefix of the
- input). yytext and yyleng are set up appropriately.
- Note that REJECT is a particularly expensive feature in
- terms scanner performance; if it is used in _a_n_y of the
- scanner's actions it will slow down _a_l_l of the
- scanner's matching. Furthermore, REJECT cannot be used
- with the -_f or -_F options.
-
- Note also that unlike the other special actions, REJECT
- is a _b_r_a_n_c_h; code immediately following it in the
- action will _n_o_t be executed.
-
- - yymore() tells the scanner that the next time it
- matches a rule, the corresponding token should be
- _a_p_p_e_n_d_e_d onto the current value of yytext rather than
- replacing it.
-
- - yyless(n) returns all but the first _n characters of the
- current token back to the input stream, where they will
- be rescanned when the scanner looks for the next match.
- yytext and yyleng are adjusted appropriately (e.g.,
- yyleng will now be equal to _n ).
-
- - unput(c) puts the character _c back onto the input
- stream. It will be the next character scanned.
-
- - input() reads the next character from the input stream
- (this routine is called yyinput() if the scanner is
- compiled using C++).
-
- - yyterminate() can be used in lieu of a return statement
- in an action. It terminates the scanner and returns a
- 0 to the scanner's caller, indicating "all done".
-
-
-
- Printed 4/3/91 26 May 1990 6
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- By default, yyterminate() is also called when an end-
- of-file is encountered. It is a macro and may be rede-
- fined.
-
- - YY_NEW_FILE is an action available only in <<EOF>>
- rules. It means "Okay, I've set up a new input file,
- continue scanning".
-
- - yy_create_buffer( file, size ) takes a _F_I_L_E pointer and
- an integer _s_i_z_e. It returns a YY_BUFFER_STATE handle to
- a new input buffer large enough to accomodate _s_i_z_e
- characters and associated with the given file. When in
- doubt, use YY_BUF_SIZE for the size.
-
- - yy_switch_to_buffer( new_buffer ) switches the
- scanner's processing to scan for tokens from the given
- buffer, which must be a YY_BUFFER_STATE.
-
- - yy_delete_buffer( buffer ) deletes the given buffer.
-
- VALUES AVAILABLE TO THE USER
- - char *yytext holds the text of the current token. It
- may not be modified.
-
- - int yyleng holds the length of the current token. It
- may not be modified.
-
- - FILE *yyin is the file which by default _f_l_e_x reads
- from. It may be redefined but doing so only makes
- sense before scanning begins. Changing it in the mid-
- dle of scanning will have unexpected results since _f_l_e_x
- buffers its input. Once scanning terminates because an
- end-of-file has been seen, void yyrestart( FILE
- *new_file ) may be called to point _y_y_i_n at the new
- input file.
-
- - FILE *yyout is the file to which ECHO actions are done.
- It can be reassigned by the user.
-
- - YY_CURRENT_BUFFER returns a YY_BUFFER_STATE handle to
- the current buffer.
-
- MACROS THE USER CAN REDEFINE
- - YY_DECL controls how the scanning routine is declared.
- By default, it is "int yylex()", or, if prototypes are
- being used, "int yylex(void)". This definition may be
- changed by redefining the "YY_DECL" macro. Note that
- if you give arguments to the scanning routine using a
- K&R-style/non-prototyped function declaration, you must
- terminate the definition with a semi-colon (;).
-
- - The nature of how the scanner gets its input can be
- controlled by redefining the YY_INPUT macro.
-
-
- Printed 4/3/91 26 May 1990 7
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- YY_INPUT's calling sequence is
- "YY_INPUT(buf,result,max_size)". Its action is to
- place up to _m_a_x__s_i_z_e characters in the character array
- _b_u_f and return in the integer variable _r_e_s_u_l_t either
- the number of characters read or the constant YY_NULL
- (0 on Unix systems) to indicate EOF. The default
- YY_INPUT reads from the global file-pointer "yyin". A
- sample redefinition of YY_INPUT (in the definitions
- section of the input file):
-
- %{
- #undef YY_INPUT
- #define YY_INPUT(buf,result,max_size) \
- { \
- int c = getchar(); \
- result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \
- }
- %}
-
-
- - When the scanner receives an end-of-file indication
- from YY_INPUT, it then checks the yywrap() function.
- If yywrap() returns false (zero), then it is assumed
- that the function has gone ahead and set up _y_y_i_n to
- point to another input file, and scanning continues.
- If it returns true (non-zero), then the scanner ter-
- minates, returning 0 to its caller.
-
- The default yywrap() always returns 1. Presently, to
- redefine it you must first "#undef yywrap", as it is
- currently implemented as a macro. It is likely that
- yywrap() will soon be defined to be a function rather
- than a macro.
-
- - YY_USER_ACTION can be redefined to provide an action
- which is always executed prior to the matched rule's
- action.
-
- - The macro YY_USER_INIT may be redefined to provide an
- action which is always executed before the first scan.
-
- - In the generated scanner, the actions are all gathered
- in one large switch statement and separated using
- YY_BREAK, which may be redefined. By default, it is
- simply a "break", to separate each rule's action from
- the following rule's.
-
- FILES
- _f_l_e_x._s_k_e_l
- skeleton scanner.
-
- _l_e_x._y_y._c
- generated scanner (called _l_e_x_y_y._c on some systems).
-
-
- Printed 4/3/91 26 May 1990 8
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- _l_e_x._b_a_c_k_t_r_a_c_k
- backtracking information for -b flag (called _l_e_x._b_c_k on
- some systems).
-
- -lfl library with which to link the scanners.
-
- SEE ALSO
- flexdoc(1), lex(1), yacc(1), sed(1), awk(1).
-
- M. E. Lesk and E. Schmidt, _L_E_X - _L_e_x_i_c_a_l _A_n_a_l_y_z_e_r _G_e_n_e_r_a_t_o_r
-
- DIAGNOSTICS
- _r_e_j_e_c_t__u_s_e_d__b_u_t__n_o_t__d_e_t_e_c_t_e_d _u_n_d_e_f_i_n_e_d or
-
- _y_y_m_o_r_e__u_s_e_d__b_u_t__n_o_t__d_e_t_e_c_t_e_d _u_n_d_e_f_i_n_e_d - These errors can
- occur at compile time. They indicate that the scanner uses
- REJECT or yymore() but that _f_l_e_x failed to notice the fact,
- meaning that _f_l_e_x scanned the first two sections looking for
- occurrences of these actions and failed to find any, but
- somehow you snuck some in (via a #include file, for exam-
- ple). Make an explicit reference to the action in your _f_l_e_x
- input file. (Note that previously _f_l_e_x supported a
- %used/%unused mechanism for dealing with this problem; this
- feature is still supported but now deprecated, and will go
- away soon unless the author hears from people who can argue
- compellingly that they need it.)
-
- _f_l_e_x _s_c_a_n_n_e_r _j_a_m_m_e_d - a scanner compiled with -s has encoun-
- tered an input string which wasn't matched by any of its
- rules.
-
- _f_l_e_x _i_n_p_u_t _b_u_f_f_e_r _o_v_e_r_f_l_o_w_e_d - a scanner rule matched a
- string long enough to overflow the scanner's internal input
- buffer (16K bytes - controlled by YY_BUF_MAX in
- "flex.skel").
-
- _s_c_a_n_n_e_r _r_e_q_u_i_r_e_s -_8 _f_l_a_g - Your scanner specification
- includes recognizing 8-bit characters and you did not
- specify the -8 flag (and your site has not installed flex
- with -8 as the default).
-
- _f_a_t_a_l _f_l_e_x _s_c_a_n_n_e_r _i_n_t_e_r_n_a_l _e_r_r_o_r--_e_n_d _o_f _b_u_f_f_e_r _m_i_s_s_e_d -
- This can occur in an scanner which is reentered after a
- long-jump has jumped out (or over) the scanner's activation
- frame. Before reentering the scanner, use:
-
- yyrestart( yyin );
-
-
- _t_o_o _m_a_n_y %_t _c_l_a_s_s_e_s! - You managed to put every single char-
- acter into its own %t class. _f_l_e_x requires that at least
- one of the classes share characters.
-
-
-
- Printed 4/3/91 26 May 1990 9
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- AUTHOR
- Vern Paxson, with the help of many ideas and much inspira-
- tion from Van Jacobson. Original version by Jef Poskanzer.
-
- See flexdoc(1) for additional credits and the address to
- send comments to.
-
- DEFICIENCIES / BUGS
- Some trailing context patterns cannot be properly matched
- and generate warning messages ("Dangerous trailing con-
- text"). These are patterns where the ending of the first
- part of the rule matches the beginning of the second part,
- such as "zx*/xy*", where the 'x*' matches the 'x' at the
- beginning of the trailing context. (Note that the POSIX
- draft states that the text matched by such patterns is unde-
- fined.)
-
- For some trailing context rules, parts which are actually
- fixed-length are not recognized as such, leading to the
- abovementioned performance loss. In particular, parts using
- '|' or {n} (such as "foo{3}") are always considered
- variable-length.
-
- Combining trailing context with the special '|' action can
- result in _f_i_x_e_d trailing context being turned into the more
- expensive _v_a_r_i_a_b_l_e trailing context. For example, this hap-
- pens in the following example:
-
- %%
- abc |
- xyz/def
-
-
- Use of unput() invalidates yytext and yyleng.
-
- Use of unput() to push back more text than was matched can
- result in the pushed-back text matching a beginning-of-line
- ('^') rule even though it didn't come at the beginning of
- the line (though this is rare!).
-
- Pattern-matching of NUL's is substantially slower than
- matching other characters.
-
- _f_l_e_x does not generate correct #line directives for code
- internal to the scanner; thus, bugs in _f_l_e_x._s_k_e_l yield bogus
- line numbers.
-
- Due to both buffering of input and read-ahead, you cannot
- intermix calls to <stdio.h> routines, such as, for example,
- getchar(), with _f_l_e_x rules and expect it to work. Call
- input() instead.
-
-
-
-
- Printed 4/3/91 26 May 1990 10
-
-
-
-
- FLEX(1) UNIX Programmer's Manual FLEX(1)
-
-
- The total table entries listed by the -v flag excludes the
- number of table entries needed to determine what rule has
- been matched. The number of entries is equal to the number
- of DFA states if the scanner does not use REJECT, and some-
- what greater than the number of states if it does.
-
- REJECT cannot be used with the -_f or -_F options.
-
- Some of the macros, such as yywrap(), may in the future
- become functions which live in the -lfl library. This will
- doubtless break a lot of code, but may be required for
- POSIX-compliance.
-
- The _f_l_e_x internal algorithms need documentation.
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
- Printed 4/3/91 26 May 1990 11
-
-
-