Support for yylval assumes that `YYSTYPE' is a valid type. Support
for yylloc assumes that `YYSLYPE' is a valid type. Typically, these
types are generated by `bison', and are included in section 1 of the
`flex' input.
File: flex, Node: Lex and Posix, Next: Memory Management, Prev: Reentrant, Up: Top
20 Incompatibilities with Lex and Posix
***************************************
`flex' is a rewrite of the AT&T Unix _lex_ tool (the two
implementations do not share any code, though), with some extensions and
incompatibilities, both of which are of concern to those who wish to
write scanners acceptable to both implementations. `flex' is fully
compliant with the POSIX `lex' specification, except that when using
`%pointer' (the default), a call to `unput()' destroys the contents of
`yytext', which is counter to the POSIX specification. In this section
we discuss all of the known areas of incompatibility between `flex',
AT&T `lex', and the POSIX specification. `flex''s `-l' option turns on
maximum compatibility with the original AT&T `lex' implementation, at
the cost of a major loss in the generated scanner's performance. We
note below which incompatibilities can be overcome using the `-l'
option. `flex' is fully compatible with `lex' with the following
exceptions:
* The undocumented `lex' scanner internal variable `yylineno' is not
supported unless `-l' or `%option yylineno' is used.
* `yylineno' should be maintained on a per-buffer basis, rather than
a per-scanner (single global variable) basis.
* `yylineno' is not part of the POSIX specification.
* The `input()' routine is not redefinable, though it may be called
to read characters following whatever has been matched by a rule.
If `input()' encounters an end-of-file the normal `yywrap()'
processing is done. A "real" end-of-file is returned by `input()'
as `EOF'.
* Input is instead controlled by defining the `YY_INPUT()' macro.
* The `flex' restriction that `input()' cannot be redefined is in
accordance with the POSIX specification, which simply does not
specify any way of controlling the scanner's input other than by
making an initial assignment to `yyin'.
* The `unput()' routine is not redefinable. This restriction is in
accordance with POSIX.
* `flex' scanners are not as reentrant as `lex' scanners. In
particular, if you have an interactive scanner and an interrupt
handler which long-jumps out of the scanner, and the scanner is
subsequently called again, you may get the following message:
fatal @code{flex} scanner internal error--end of buffer missed
To reenter the scanner, first use:
yyrestart( yyin );
Note that this call will throw away any buffered input; usually
this isn't a problem with an interactive scanner. *Note
Reentrant::, for `flex''s reentrant API.
* Also note that `flex' C++ scanner classes _are_ reentrant, so if
using C++ is an option for you, you should use them instead.
*Note Cxx::, and *Note Reentrant:: for details.
* `output()' is not supported. Output from the ECHO macro is done
to the file-pointer `yyout' (default `stdout)'.
* `output()' is not part of the POSIX specification.
* `lex' does not support exclusive start conditions (%x), though they
are in the POSIX specification.
* When definitions are expanded, `flex' encloses them in parentheses.
With `lex', the following:
NAME [A-Z][A-Z0-9]*
%%
foo{NAME}? printf( "Found it\n" );
%%
will not match the string `foo' because when the macro is expanded
the rule is equivalent to `foo[A-Z][A-Z0-9]*?' and the precedence
is such that the `?' is associated with `[A-Z0-9]*'. With `flex',
the rule will be expanded to `foo([A-Z][A-Z0-9]*)?' and so the
string `foo' will match.
* Note that if the definition begins with `^' or ends with `$' then
it is _not_ expanded with parentheses, to allow these operators to
appear in definitions without losing their special meanings. But
the `<s>', `/', and `<<EOF>>' operators cannot be used in a `flex'
definition.
* Using `-l' results in the `lex' behavior of no parentheses around
the definition.
* The POSIX specification is that the definition be enclosed in
parentheses.
* Some implementations of `lex' allow a rule's action to begin on a
separate line, if the rule's pattern has trailing whitespace:
%%
foo|bar<space here>
{ foobar_action();}
`flex' does not support this feature.
* The `lex' `%r' (generate a Ratfor scanner) option is not
supported. It is not part of the POSIX specification.
* After a call to `unput()', _yytext_ is undefined until the next
token is matched, unless the scanner was built using `%array'.
This is not the case with `lex' or the POSIX specification. The
`-l' option does away with this incompatibility.
* The precedence of the `{,}' (numeric range) operator is different.
The AT&T and POSIX specifications of `lex' interpret `abc{1,3}'
as match one, two, or three occurrences of `abc'", whereas `flex'
interprets it as "match `ab' followed by one, two, or three
occurrences of `c'". The `-l' and `--posix' options do away with
this incompatibility.
* The precedence of the `^' operator is different. `lex' interprets
`^foo|bar' as "match either 'foo' at the beginning of a line, or
'bar' anywhere", whereas `flex' interprets it as "match either
`foo' or `bar' if they come at the beginning of a line". The
latter is in agreement with the POSIX specification.
* The special table-size declarations such as `%a' supported by
`lex' are not required by `flex' scanners.. `flex' ignores them.
* The name `FLEX_SCANNER' is `#define''d so scanners may be written
for use with either `flex' or `lex'. Scanners also include
`YY_FLEX_MAJOR_VERSION', `YY_FLEX_MINOR_VERSION' and
`YY_FLEX_SUBMINOR_VERSION' indicating which version of `flex'
generated the scanner. For example, for the 2.5.22 release, these
defines would be 2, 5 and 22 respectively. If the version of
`flex' being used is a beta version, then the symbol `FLEX_BETA'
is defined.
The following `flex' features are not included in `lex' or the POSIX
specification:
* C++ scanners
* %option
* start condition scopes
* start condition stacks
* interactive/non-interactive scanners
* yy_scan_string() and friends
* yyterminate()
* yy_set_interactive()
* yy_set_bol()
* YY_AT_BOL() <<EOF>>
* <*>
* YY_DECL
* YY_START
* YY_USER_ACTION
* YY_USER_INIT
* #line directives
* %{}'s around actions
* reentrant C API
* multiple actions on a line
* almost all of the `flex' command-line options
The feature "multiple actions on a line" refers to the fact that
with `flex' you can put multiple actions on the same line, separated
with semi-colons, while with `lex', the following:
foo handle_foo(); ++num_foos_seen;
is (rather surprisingly) truncated to
foo handle_foo();
`flex' does not truncate the action. Actions that are not enclosed
in braces are simply terminated at the end of the line.
File: flex, Node: Memory Management, Next: Serialized Tables, Prev: Lex and Posix, Up: Top
21 Memory Management
********************
This chapter describes how flex handles dynamic memory, and how you can
override the default behavior.
* Menu:
* The Default Memory Management::
* Overriding The Default Memory Management::
* A Note About yytext And Memory::
File: flex, Node: The Default Memory Management, Next: Overriding The Default Memory Management, Prev: Memory Management, Up: Memory Management
21.1 The Default Memory Management
==================================
Flex allocates dynamic memory during initialization, and once in a
while from within a call to yylex(). Initialization takes place during
the first call to yylex(). Thereafter, flex may reallocate more memory
if it needs to enlarge a buffer. As of version 2.5.9 Flex will clean up
all memory when you call `yylex_destroy' *Note faq-memory-leak::.
Flex allocates dynamic memory for four purposes, listed below (1)
16kB for the input buffer.
Flex allocates memory for the character buffer used to perform
pattern matching. Flex must read ahead from the input stream and
store it in a large character buffer. This buffer is typically
the largest chunk of dynamic memory flex consumes. This buffer
will grow if necessary, doubling the size each time. Flex frees
this memory when you call yylex_destroy(). The default size of
this buffer (16384 bytes) is almost always too large. The ideal
size for this buffer is the length of the longest token expected.
Flex will allocate a few extra bytes for housekeeping.
16kb for the REJECT state. This will only be allocated if you use REJECT.
The size is the same as the input buffer, so if you override the
size of the input buffer, then you automatically override the size
of this buffer as well.
100 bytes for the start condition stack.
Flex allocates memory for the start condition stack. This is the
stack used for pushing start states, i.e., with yy_push_state().
It will grow if necessary. Since the states are simply integers,
this stack doesn't consume much memory. This stack is not present
if `%option stack' is not specified. You will rarely need to tune
this buffer. The ideal size for this stack is the maximum depth
expected. The memory for this stack is automatically destroyed
when you call yylex_destroy(). *Note option-stack::.
40 bytes for each YY_BUFFER_STATE.
Flex allocates memory for each YY_BUFFER_STATE. The buffer state
itself is about 40 bytes, plus an additional large character
buffer (described above.) The initial buffer state is created
during initialization, and with each call to yy_create_buffer().
You can't tune the size of this, but you can tune the character
buffer as described above. Any buffer state that you explicitly
create by calling yy_create_buffer() is _NOT_ destroyed
automatically. You must call yy_delete_buffer() to free the
memory. The exception to this rule is that flex will delete the
current buffer automatically when you call yylex_destroy(). If you
delete the current buffer, be sure to set it to NULL. That way,
flex will not try to delete the buffer a second time (possibly
crashing your program!) At the time of this writing, flex does not
provide a growable stack for the buffer states. You have to
manage that yourself. *Note Multiple Input Buffers::.
84 bytes for the reentrant scanner guts
Flex allocates about 84 bytes for the reentrant scanner structure
when you call yylex_init(). It is destroyed when the user calls
yylex_destroy().
---------- Footnotes ----------
(1) The quantities given here are approximate, and may vary due to
host architecture, compiler configuration, or due to future
enhancements to flex.
File: flex, Node: Overriding The Default Memory Management, Next: A Note About yytext And Memory, Prev: The Default Memory Management, Up: Memory Management
21.2 Overriding The Default Memory Management
=============================================
Flex calls the functions `yyalloc', `yyrealloc', and `yyfree' when it
needs to allocate or free memory. By default, these functions are
wrappers around the standard C functions, `malloc', `realloc', and
`free', respectively. You can override the default implementations by
telling flex that you will provide your own implementations.
To override the default implementations, you must do two things:
1. Suppress the default implementations by specifying one or more of
the following options:
* `%option noyyalloc'
* `%option noyyrealloc'
* `%option noyyfree'.
2. Provide your own implementation of the following functions: (1)
The following is a list of `flex' diagnostic messages:
* `warning, rule cannot be matched' indicates that the given rule
cannot be matched because it follows other rules that will always
match the same text as it. For example, in the following `foo'
cannot be matched because it comes after an identifier "catch-all"
rule:
[a-z]+ got_identifier();
foo got_foo();
Using `REJECT' in a scanner suppresses this warning.
* `warning, -s option given but default rule can be matched' means
that it is possible (perhaps only in a particular start condition)
that the default rule (match any single character) is the only one
that will match a particular input. Since `-s' was given,
presumably this is not intended.
* `reject_used_but_not_detected undefined' or
`yymore_used_but_not_detected undefined'. These errors can occur
at compile time. They indicate that the scanner uses `REJECT' or
`yymore()' but that `flex' failed to notice the fact, meaning that
`flex' scanned the first two sections looking for occurrences of
these actions and failed to find any, but somehow you snuck some in
(via a #include file, for example). Use `%option reject' or
`%option yymore' to indicate to `flex' that you really do use
these features.
* `flex scanner jammed'. a scanner compiled with `-s' has
encountered an input string which wasn't matched by any of its
rules. This error can also occur due to internal problems.
* `token too large, exceeds YYLMAX'. your scanner uses `%array' and
one of its rules matched a string longer than the `YYLMAX'
constant (8K bytes by default). You can increase the value by
#define'ing `YYLMAX' in the definitions section of your `flex'
input.
* `scanner requires -8 flag to use the character 'x''. Your scanner
specification includes recognizing the 8-bit character `'x'' and
you did not specify the -8 flag, and your scanner defaulted to
7-bit because you used the `-Cf' or `-CF' table compression
options. See the discussion of the `-7' flag, *Note Scanner
Options::, for details.
* `flex scanner push-back overflow'. you used `unput()' to push back
so much text that the scanner's buffer could not hold both the
pushed-back text and the current token in `yytext'. Ideally the
scanner should dynamically resize the buffer in this case, but at
present it does not.
* `input buffer overflow, can't enlarge buffer because scanner uses
REJECT'. the scanner was working on matching an extremely large
token and needed to expand the input buffer. This doesn't work
with scanners that use `REJECT'.
* `fatal flex scanner internal error--end of buffer missed'. This can
occur in a scanner which is reentered after a long-jump has jumped
out (or over) the scanner's activation frame. Before reentering
the scanner, use:
yyrestart( yyin );
or, as noted above, switch to using the C++ scanner class.
* `too many start conditions in <> construct!' you listed more start
conditions in a <> construct than exist (so you must have listed at
least one of them twice).
File: flex, Node: Limitations, Next: Bibliography, Prev: Diagnostics, Up: Top
24 Limitations
**************
Some trailing context patterns cannot be properly matched and generate
warning messages (`dangerous trailing context'). These are patterns
where the ending of the first part of the rule matches the beginning of
the second part, such as `zx*/xy*', where the 'x*' matches the 'x' at
the beginning of the trailing context. (Note that the POSIX draft
states that the text matched by such patterns is undefined.) For some
trailing context rules, parts which are actually fixed-length are not
recognized as such, leading to the abovementioned performance loss. In
particular, parts using `|' or `{n}' (such as `foo{3}') are always
considered variable-length. Combining trailing context with the
special `|' action can result in _fixed_ trailing context being turned
into the more expensive _variable_ trailing context. For example, in
the following:
%%
abc |
xyz/def
Use of `unput()' invalidates yytext and yyleng, unless the `%array'
directive or the `-l' option has been used. Pattern-matching of `NUL's
is substantially slower than matching other characters. Dynamic
resizing of the input buffer is slow, as it entails rescanning all the
text matched so far by the current (generally huge) token. Due to both
buffering of input and read-ahead, you cannot intermix calls to
`<stdio.h>' routines, such as, getchar(), with `flex' rules and expect
it to work. Call `input()' instead. The total table entries listed by
the `-v' flag excludes the number of table entries needed to determine
what rule has been matched. The number of entries is equal to the
number of DFA states if the scanner does not use `REJECT', and somewhat
greater than the number of states if it does. `REJECT' cannot be used
with the `-f' or `-F' options.
The `flex' internal algorithms need documentation.
File: flex, Node: Bibliography, Next: FAQ, Prev: Limitations, Up: Top
25 Additional Reading
*********************
You may wish to read more about the following programs:
* lex
* yacc
* sed
* awk
The following books may contain material of interest:
John Levine, Tony Mason, and Doug Brown, _Lex & Yacc_, O'Reilly and
Associates. Be sure to get the 2nd edition.
M. E. Lesk and E. Schmidt, _LEX - Lexical Analyzer Generator_
Alfred Aho, Ravi Sethi and Jeffrey Ullman, _Compilers: Principles,
Techniques and Tools_, Addison-Wesley (1986). Describes the
pattern-matching techniques used by `flex' (deterministic finite
automata).
File: flex, Node: FAQ, Next: Appendices, Prev: Bibliography, Up: Top
FAQ
***
From time to time, the `flex' maintainer receives certain questions.
Rather than repeat answers to well-understood problems, we publish them
here.
* Menu:
* When was flex born?::
* How do I expand \ escape sequences in C-style quoted strings?::
* Why do flex scanners call fileno if it is not ANSI compatible?::
* Does flex support recursive pattern definitions?::
* How do I skip huge chunks of input (tens of megabytes) while using flex?::
* Flex is not matching my patterns in the same order that I defined them.::
* My actions are executing out of order or sometimes not at all.::
* How can I have multiple input sources feed into the same scanner at the same time?::
* Can I build nested parsers that work with the same input file?::
* How can I match text only at the end of a file?::
* How can I make REJECT cascade across start condition boundaries?::
* Why cant I use fast or full tables with interactive mode?::
* How much faster is -F or -f than -C?::
* If I have a simple grammar cant I just parse it with flex?::
* Why doesnt yyrestart() set the start state back to INITIAL?::
* How can I match C-style comments?::
* The period isnt working the way I expected.::
* Can I get the flex manual in another format?::
* Does there exist a "faster" NDFA->DFA algorithm?::
* How does flex compile the DFA so quickly?::
* How can I use more than 8192 rules?::
* How do I abandon a file in the middle of a scan and switch to a new file?::
* How do I execute code only during initialization (only before the first scan)?::
* How do I execute code at termination?::
* Where else can I find help?::
* Can I include comments in the "rules" section of the file?::
* I get an error about undefined yywrap().::
* How can I change the matching pattern at run time?::
* How can I expand macros in the input?::
* How can I build a two-pass scanner?::
* How do I match any string not matched in the preceding rules?::
* I am trying to port code from AT&T lex that uses yysptr and yysbuf.::
* Is there a way to make flex treat NULL like a regular character?::
* Whenever flex can not match the input it says "flex scanner jammed".::
* Why doesnt flex have non-greedy operators like perl does?::
* Memory leak - 16386 bytes allocated by malloc.::
* How do I track the byte offset for lseek()?::
* How do I use my own I/O classes in a C++ scanner?::
* How do I skip as many chars as possible?::
* deleteme00::
* Are certain equivalent patterns faster than others?::
* Is backing up a big deal?::
* Can I fake multi-byte character support?::
* deleteme01::
* Can you discuss some flex internals?::
* unput() messes up yy_at_bol::
* The | operator is not doing what I want::
* Why can't flex understand this variable trailing context pattern?::
* The ^ operator isn't working::
* Trailing context is getting confused with trailing optional patterns::
* Is flex GNU or not?::
* ERASEME53::
* I need to scan if-then-else blocks and while loops::
* ERASEME55::
* ERASEME56::
* ERASEME57::
* Is there a repository for flex scanners?::
* How can I conditionally compile or preprocess my flex input file?::
* Where can I find grammars for lex and yacc?::
* I get an end-of-buffer message for each character scanned.::
* unnamed-faq-62::
* unnamed-faq-63::
* unnamed-faq-64::
* unnamed-faq-65::
* unnamed-faq-66::
* unnamed-faq-67::
* unnamed-faq-68::
* unnamed-faq-69::
* unnamed-faq-70::
* unnamed-faq-71::
* unnamed-faq-72::
* unnamed-faq-73::
* unnamed-faq-74::
* unnamed-faq-75::
* unnamed-faq-76::
* unnamed-faq-77::
* unnamed-faq-78::
* unnamed-faq-79::
* unnamed-faq-80::
* unnamed-faq-81::
* unnamed-faq-82::
* unnamed-faq-83::
* unnamed-faq-84::
* unnamed-faq-85::
* unnamed-faq-86::
* unnamed-faq-87::
* unnamed-faq-88::
* unnamed-faq-90::
* unnamed-faq-91::
* unnamed-faq-92::
* unnamed-faq-93::
* unnamed-faq-94::
* unnamed-faq-95::
* unnamed-faq-96::
* unnamed-faq-97::
* unnamed-faq-98::
* unnamed-faq-99::
* unnamed-faq-100::
* unnamed-faq-101::
File: flex, Node: When was flex born?, Next: How do I expand \ escape sequences in C-style quoted strings?, Up: FAQ
When was flex born?
===================
Vern Paxson took over the `Software Tools' lex project from Jef
Poskanzer in 1982. At that point it was written in Ratfor. Around
1987 or so, Paxson translated it into C, and a legend was born :-).
File: flex, Node: How do I expand \ escape sequences in C-style quoted strings?, Next: Why do flex scanners call fileno if it is not ANSI compatible?, Prev: When was flex born?, Up: FAQ
How do I expand \ escape sequences in C-style quoted strings?
A key point when scanning quoted strings is that you cannot (easily)
write a single rule that will precisely match the string if you allow
things like embedded escape sequences and newlines. If you try to
match strings with a single rule then you'll wind up having to rescan
the string anyway to find any escape sequences.
Instead you can use exclusive start conditions and a set of rules,
one for matching non-escaped text, one for matching a single escape,
one for matching an embedded newline, and one for recognizing the end
of the string. Each of these rules is then faced with the question of
where to put its intermediary results. The best solution is for the
rules to append their local value of `yytext' to the end of a "string
literal" buffer. A rule like the escape-matcher will append to the
buffer the meaning of the escape sequence rather than the literal text
in `yytext'. In this way, `yytext' does not need to be modified at all.
File: flex, Node: Why do flex scanners call fileno if it is not ANSI compatible?, Next: Does flex support recursive pattern definitions?, Prev: How do I expand \ escape sequences in C-style quoted strings?, Up: FAQ
Why do flex scanners call fileno if it is not ANSI compatible?
Flex scanners call `fileno()' in order to get the file descriptor
corresponding to `yyin'. The file descriptor may be passed to
`isatty()' or `read()', depending upon which `%options' you specified.
If your system does not have `fileno()' support, to get rid of the
`read()' call, do not specify `%option read'. To get rid of the
`isatty()' call, you must specify one of `%option always-interactive' or
`%option never-interactive'.
File: flex, Node: Does flex support recursive pattern definitions?, Next: How do I skip huge chunks of input (tens of megabytes) while using flex?, Prev: Why do flex scanners call fileno if it is not ANSI compatible?, Up: FAQ
Does flex support recursive pattern definitions?
================================================
e.g.,
%%
block "{"({block}|{statement})*"}"
No. You cannot have recursive definitions. The pattern-matching
power of regular expressions in general (and therefore flex scanners,
too) is limited. In particular, regular expressions cannot "balance"
parentheses to an arbitrary degree. For example, it's impossible to
write a regular expression that matches all strings containing the same
number of '{'s as '}'s. For more powerful pattern matching, you need a
parser, such as `GNU bison'.
File: flex, Node: How do I skip huge chunks of input (tens of megabytes) while using flex?, Next: Flex is not matching my patterns in the same order that I defined them., Prev: Does flex support recursive pattern definitions?, Up: FAQ
How do I skip huge chunks of input (tens of megabytes) while using flex?
Use `fseek()' (or `lseek()') to position yyin, then call `yyrestart()'.
File: flex, Node: Flex is not matching my patterns in the same order that I defined them., Next: My actions are executing out of order or sometimes not at all., Prev: How do I skip huge chunks of input (tens of megabytes) while using flex?, Up: FAQ
Flex is not matching my patterns in the same order that I defined them.
`flex' picks the rule that matches the most text (i.e., the longest
possible input string). This is because `flex' uses an entirely
different matching technique ("deterministic finite automata") that
actually does all of the matching simultaneously, in parallel. (Seems
impossible, but it's actually a fairly simple technique once you
understand the principles.)
A side-effect of this parallel matching is that when the input
matches more than one rule, `flex' scanners pick the rule that matched
the _most_ text. This is explained further in the manual, in the
section *Note Matching::.
If you want `flex' to choose a shorter match, then you can work
around this behavior by expanding your short rule to match more text,
then put back the extra:
data_.* yyless( 5 ); BEGIN BLOCKIDSTATE;
Another fix would be to make the second rule active only during the
`<BLOCKIDSTATE>' start condition, and make that start condition
exclusive by declaring it with `%x' instead of `%s'.
A final fix is to change the input language so that the ambiguity for
`data_' is removed, by adding characters to it that don't match the
identifier rule, or by removing characters (such as `_') from the
identifier rule so it no longer matches `data_'. (Of course, you might
also not have the option of changing the input language.)
File: flex, Node: My actions are executing out of order or sometimes not at all., Next: How can I have multiple input sources feed into the same scanner at the same time?, Prev: Flex is not matching my patterns in the same order that I defined them., Up: FAQ
My actions are executing out of order or sometimes not at all.
Most likely, you have (in error) placed the opening `{' of the action
block on a different line than the rule, e.g.,
^(foo|bar)
{ <<<--- WRONG!
}
`flex' requires that the opening `{' of an action associated with a
rule begin on the same line as does the rule. You need instead to
write your rules as follows:
^(foo|bar) { // CORRECT!
}
File: flex, Node: How can I have multiple input sources feed into the same scanner at the same time?, Next: Can I build nested parsers that work with the same input file?, Prev: My actions are executing out of order or sometimes not at all., Up: FAQ
How can I have multiple input sources feed into the same scanner at the same time?
* your scanner is free of backtracking (verified using `flex''s `-b'
flag),
* AND you run your scanner interactively (`-I' option; default
unless using special table compression options),
* AND you feed it one character at a time by redefining `YY_INPUT'
to do so,
then every time it matches a token, it will have exhausted its input
buffer (because the scanner is free of backtracking). This means you
can safely use `select()' at the point and only call `yylex()' for
another token if `select()' indicates there's data available.
That is, move the `select()' out from the input function to a point
where it determines whether `yylex()' gets called for the next token.
With this approach, you will still have problems if your input can
arrive piecemeal; `select()' could inform you that the beginning of a
token is available, you call `yylex()' to get it, but it winds up
blocking waiting for the later characters in the token.
Here's another way: Move your input multiplexing inside of
`YY_INPUT'. That is, whenever `YY_INPUT' is called, it `select()''s to
see where input is available. If input is available for the scanner,
it reads and returns the next byte. If input is available from another
source, it calls whatever function is responsible for reading from that
source. (If no input is available, it blocks until some input is
available.) I've used this technique in an interpreter I wrote that
both reads keyboard input using a `flex' scanner and IPC traffic from
sockets, and it works fine.
File: flex, Node: Can I build nested parsers that work with the same input file?, Next: How can I match text only at the end of a file?, Prev: How can I have multiple input sources feed into the same scanner at the same time?, Up: FAQ
Can I build nested parsers that work with the same input file?
This is not going to work without some additional effort. The reason is
that `flex' block-buffers the input it reads from `yyin'. This means
that the "outermost" `yylex()', when called, will automatically slurp
up the first 8K of input available on yyin, and subsequent calls to
other `yylex()''s won't see that input. You might be tempted to work
around this problem by redefining `YY_INPUT' to only return a small
amount of text, but it turns out that that approach is quite difficult.
Instead, the best solution is to combine all of your scanners into one
large scanner, using a different exclusive start condition for each.
File: flex, Node: How can I match text only at the end of a file?, Next: How can I make REJECT cascade across start condition boundaries?, Prev: Can I build nested parsers that work with the same input file?, Up: FAQ
How can I match text only at the end of a file?
===============================================
There is no way to write a rule which is "match this text, but only if
it comes at the end of the file". You can fake it, though, if you
happen to have a character lying around that you don't allow in your
input. Then you redefine `YY_INPUT' to call your own routine which, if
it sees an `EOF', returns the magic character first (and remembers to
return a real `EOF' next time it's called). Then you could write:
<COMMENT>(.|\n)*{EOF_CHAR} /* saw comment at EOF */
File: flex, Node: How can I make REJECT cascade across start condition boundaries?, Next: Why cant I use fast or full tables with interactive mode?, Prev: How can I match text only at the end of a file?, Up: FAQ
How can I make REJECT cascade across start condition boundaries?
You can do this as follows. Suppose you have a start condition `A', and
after exhausting all of the possible matches in `<A>', you want to try
matches in `<INITIAL>'. Then you could use the following:
%x A
%%
<A>rule_that_is_long ...; REJECT;
<A>rule ...; REJECT; /* shorter rule */
<A>etc.
...
<A>.|\n {
/* Shortest and last rule in <A>, so
* cascaded REJECT's will eventually
* wind up matching this rule. We want
* to now switch to the initial state
* and try matching from there instead.
*/
yyless(0); /* put back matched text */
BEGIN(INITIAL);
}
File: flex, Node: Why cant I use fast or full tables with interactive mode?, Next: How much faster is -F or -f than -C?, Prev: How can I make REJECT cascade across start condition boundaries?, Up: FAQ
Why can't I use fast or full tables with interactive mode?
One of the assumptions flex makes is that interactive applications are
inherently slow (they're waiting on a human after all). It has to do
with how the scanner detects that it must be finished scanning a token.
For interactive scanners, after scanning each character the current
state is looked up in a table (essentially) to see whether there's a
chance of another input character possibly extending the length of the
match. If not, the scanner halts. For non-interactive scanners, the
end-of-token test is much simpler, basically a compare with 0, so no
memory bus cycles. Since the test occurs in the innermost scanning
loop, one would like to make it go as fast as possible.
Still, it seems reasonable to allow the user to choose to trade off
a bit of performance in this area to gain the corresponding
flexibility. There might be another reason, though, why fast scanners
don't support the interactive option.
File: flex, Node: How much faster is -F or -f than -C?, Next: If I have a simple grammar cant I just parse it with flex?, Prev: Why cant I use fast or full tables with interactive mode?, Up: FAQ
How much faster is -F or -f than -C?
====================================
Much faster (factor of 2-3).
File: flex, Node: If I have a simple grammar cant I just parse it with flex?, Next: Why doesnt yyrestart() set the start state back to INITIAL?, Prev: How much faster is -F or -f than -C?, Up: FAQ
If I have a simple grammar can't I just parse it with flex?
Is your grammar recursive? That's almost always a sign that you're
better off using a parser/scanner rather than just trying to use a
scanner alone.
File: flex, Node: Why doesnt yyrestart() set the start state back to INITIAL?, Next: How can I match C-style comments?, Prev: If I have a simple grammar cant I just parse it with flex?, Up: FAQ
Why doesn't yyrestart() set the start state back to INITIAL?
There are two reasons. The first is that there might be programs that
rely on the start state not changing across file changes. The second
is that beginning with `flex' version 2.4, use of `yyrestart()' is no
longer required, so fixing the problem there doesn't solve the more
general problem.
File: flex, Node: How can I match C-style comments?, Next: The period isnt working the way I expected., Prev: Why doesnt yyrestart() set the start state back to INITIAL?, Up: FAQ
How can I match C-style comments?
=================================
You might be tempted to try something like this:
"/*".*"*/" // WRONG!
or, worse, this:
"/*"(.|\n)"*/" // WRONG!
The above rules will eat too much input, and blow up on things like:
/* a comment */ do_my_thing( "oops */" );
Here is one way which allows you to track line information:
<INITIAL>{
"/*" BEGIN(IN_COMMENT);
}
<IN_COMMENT>{
"*/" BEGIN(INITIAL);
[^*\n]+ // eat comment in chunks
"*" // eat the lone star
\n yylineno++;
}
File: flex, Node: The period isnt working the way I expected., Next: Can I get the flex manual in another format?, Prev: How can I match C-style comments?, Up: FAQ
The '.' isn't working the way I expected.
=========================================
Here are some tips for using `.':
* A common mistake is to place the grouping parenthesis AFTER an
operator, when you really meant to place the parenthesis BEFORE
the operator, e.g., you probably want this `(foo|bar)+' and NOT
this `(foo|bar+)'.
The first pattern matches the words `foo' or `bar' any number of
times, e.g., it matches the text `barfoofoobarfoo'. The second
pattern matches a single instance of `foo' or a single instance of
`bar' followed by one or more `r's, e.g., it matches the text
`barrrr' .
* A `.' inside `[]''s just means a literal`.' (period), and NOT "any
character except newline".
* Remember that `.' matches any character EXCEPT `\n' (and `EOF').
If you really want to match ANY character, including newlines,
then use `(.|\n)' Beware that the regex `(.|\n)+' will match your
entire input!
* Finally, if you want to match a literal `.' (a period), then use
`[.]' or `"."'
File: flex, Node: Can I get the flex manual in another format?, Next: Does there exist a "faster" NDFA->DFA algorithm?, Prev: The period isnt working the way I expected., Up: FAQ
Can I get the flex manual in another format?
============================================
The `flex' source distribution includes a texinfo manual. You are free
to convert that texinfo into whatever format you desire. The `texinfo'
package includes tools for conversion to a number of formats.
File: flex, Node: Does there exist a "faster" NDFA->DFA algorithm?, Next: How does flex compile the DFA so quickly?, Prev: Can I get the flex manual in another format?, Up: FAQ
Does there exist a "faster" NDFA->DFA algorithm?
================================================
There's no way around the potential exponential running time - it can
take you exponential time just to enumerate all of the DFA states. In
practice, though, the running time is closer to linear, or sometimes
quadratic.
File: flex, Node: How does flex compile the DFA so quickly?, Next: How can I use more than 8192 rules?, Prev: Does there exist a "faster" NDFA->DFA algorithm?, Up: FAQ
How does flex compile the DFA so quickly?
=========================================
There are two big speed wins that `flex' uses:
1. It analyzes the input rules to construct equivalence classes for
those characters that always make the same transitions. It then
rewrites the NFA using equivalence classes for transitions instead
of characters. This cuts down the NFA->DFA computation time
dramatically, to the point where, for uncompressed DFA tables, the
DFA generation is often I/O bound in writing out the tables.
2. It maintains hash values for previously computed DFA states, so
testing whether a newly constructed DFA state is equivalent to a
previously constructed state can be done very quickly, by first
comparing hash values.
File: flex, Node: How can I use more than 8192 rules?, Next: How do I abandon a file in the middle of a scan and switch to a new file?, Prev: How does flex compile the DFA so quickly?, Up: FAQ
How can I use more than 8192 rules?
===================================
`Flex' is compiled with an upper limit of 8192 rules per scanner. If
you need more than 8192 rules in your scanner, you'll have to recompile
`flex' with the following changes in `flexdef.h':
< #define YY_TRAILING_MASK 0x2000
< #define YY_TRAILING_HEAD_MASK 0x4000
--
> #define YY_TRAILING_MASK 0x20000000
> #define YY_TRAILING_HEAD_MASK 0x40000000
This should work okay as long as your C compiler uses 32 bit
integers. But you might want to think about whether using such a huge
number of rules is the best way to solve your problem.
The following may also be relevant:
With luck, you should be able to increase the definitions in
flexdef.h for:
#define JAMSTATE -32766 /* marks a reference to the state that always jams */
#define MAXIMUM_MNS 31999
#define BAD_SUBSCRIPT -32767
recompile everything, and it'll all work. Flex only has these
16-bit-like values built into it because a long time ago it was
developed on a machine with 16-bit ints. I've given this advice to
others in the past but haven't heard back from them whether it worked
okay or not...
File: flex, Node: How do I abandon a file in the middle of a scan and switch to a new file?, Next: How do I execute code only during initialization (only before the first scan)?, Prev: How can I use more than 8192 rules?, Up: FAQ
How do I abandon a file in the middle of a scan and switch to a new file?
Just call `yyrestart(newfile)'. Be sure to reset the start state if you
want a "fresh start, since `yyrestart' does NOT reset the start state
back to `INITIAL'.
File: flex, Node: How do I execute code only during initialization (only before the first scan)?, Next: How do I execute code at termination?, Prev: How do I abandon a file in the middle of a scan and switch to a new file?, Up: FAQ
How do I execute code only during initialization (only before the first scan)?
You can specify an initial action by defining the macro `YY_USER_INIT'
(though note that `yyout' may not be available at the time this macro
is executed). Or you can add to the beginning of your rules section:
%%
/* Must be indented! */
static int did_init = 0;
if ( ! did_init ){
do_my_init();
did_init = 1;
}
File: flex, Node: How do I execute code at termination?, Next: Where else can I find help?, Prev: How do I execute code only during initialization (only before the first scan)?, Up: FAQ
How do I execute code at termination?
=====================================
You can specify an action for the `<<EOF>>' rule.
File: flex, Node: Where else can I find help?, Next: Can I include comments in the "rules" section of the file?, Prev: How do I execute code at termination?, Up: FAQ
Where else can I find help?
===========================
You can find the flex homepage on the web at
`http://lex.sourceforge.net/'. See that page for details about flex
mailing lists as well.
File: flex, Node: Can I include comments in the "rules" section of the file?, Next: I get an error about undefined yywrap()., Prev: Where else can I find help?, Up: FAQ
Can I include comments in the "rules" section of the file?
Yes, just about anywhere you want to. See the manual for the specific
syntax.
File: flex, Node: I get an error about undefined yywrap()., Next: How can I change the matching pattern at run time?, Prev: Can I include comments in the "rules" section of the file?, Up: FAQ
I get an error about undefined yywrap().
========================================
You must supply a `yywrap()' function of your own, or link to `libfl.a'
(which provides one), or use
%option noyywrap
in your source to say you don't want a `yywrap()' function.
File: flex, Node: How can I change the matching pattern at run time?, Next: How can I expand macros in the input?, Prev: I get an error about undefined yywrap()., Up: FAQ
How can I change the matching pattern at run time?
You can't, it's compiled into a static table when flex builds the
scanner.
File: flex, Node: How can I expand macros in the input?, Next: How can I build a two-pass scanner?, Prev: How can I change the matching pattern at run time?, Up: FAQ
How can I expand macros in the input?
=====================================
The best way to approach this problem is at a higher level, e.g., in
the parser.
However, you can do this using multiple input buffers.
%%
macro/[a-z]+ {
/* Saw the macro "macro" followed by extra stuff. */
You probably will want a stack of expansion buffers to allow nested
macros. From the above though hopefully the idea is clear.
File: flex, Node: How can I build a two-pass scanner?, Next: How do I match any string not matched in the preceding rules?, Prev: How can I expand macros in the input?, Up: FAQ
How can I build a two-pass scanner?
===================================
One way to do it is to filter the first pass to a temporary file, then
process the temporary file on the second pass. You will probably see a
performance hit, do to all the disk I/O.
When you need to look ahead far forward like this, it almost always
means that the right solution is to build a parse tree of the entire
input, then walk it after the parse in order to generate the output.
In a sense, this is a two-pass approach, once through the text and once
through the parse tree, but the performance hit for the latter is
usually an order of magnitude smaller, since everything is already
classified, in binary format, and residing in memory.
File: flex, Node: How do I match any string not matched in the preceding rules?, Next: I am trying to port code from AT&T lex that uses yysptr and yysbuf., Prev: How can I build a two-pass scanner?, Up: FAQ
How do I match any string not matched in the preceding rules?
One way to assign precedence, is to place the more specific rules
first. If two rules would match the same input (same sequence of
characters) then the first rule listed in the `flex' input wins. e.g.,
%%
foo[a-zA-Z_]+ return FOO_ID;
bar[a-zA-Z_]+ return BAR_ID;
[a-zA-Z_]+ return GENERIC_ID;
Note that the rule `[a-zA-Z_]+' must come *after* the others. It
will match the same amount of text as the more specific rules, and in
that case the `flex' scanner will pick the first rule listed in your
scanner as the one to match.
File: flex, Node: I am trying to port code from AT&T lex that uses yysptr and yysbuf., Next: Is there a way to make flex treat NULL like a regular character?, Prev: How do I match any string not matched in the preceding rules?, Up: FAQ
I am trying to port code from AT&T lex that uses yysptr and yysbuf.
Those are internal variables pointing into the AT&T scanner's input
buffer. I imagine they're being manipulated in user versions of the
`input()' and `unput()' functions. If so, what you need to do is
analyze those functions to figure out what they're doing, and then
replace `input()' with an appropriate definition of `YY_INPUT'. You
shouldn't need to (and must not) replace `flex''s `unput()' function.
File: flex, Node: Is there a way to make flex treat NULL like a regular character?, Next: Whenever flex can not match the input it says "flex scanner jammed"., Prev: I am trying to port code from AT&T lex that uses yysptr and yysbuf., Up: FAQ
Is there a way to make flex treat NULL like a regular character?
Yes, `\0' and `\x00' should both do the trick. Perhaps you have an
ancient version of `flex'. The latest release is version 2.5.31.
File: flex, Node: Whenever flex can not match the input it says "flex scanner jammed"., Next: Why doesnt flex have non-greedy operators like perl does?, Prev: Is there a way to make flex treat NULL like a regular character?, Up: FAQ
Whenever flex can not match the input it says "flex scanner jammed".
You need to add a rule that matches the otherwise-unmatched text. e.g.,
%option yylineno
%%
[[a bunch of rules here]]
. printf("bad input character '%s' at line %d\n", yytext, yylineno);
See `%option default' for more information.
File: flex, Node: Why doesnt flex have non-greedy operators like perl does?, Next: Memory leak - 16386 bytes allocated by malloc., Prev: Whenever flex can not match the input it says "flex scanner jammed"., Up: FAQ
Why doesn't flex have non-greedy operators like perl does?
A DFA can do a non-greedy match by stopping the first time it enters an
accepting state, instead of consuming input until it determines that no
further matching is possible (a "jam" state). This is actually easier
to implement than longest leftmost match (which flex does).
But it's also much less useful than longest leftmost match. In
general, when you find yourself wishing for non-greedy matching, that's
usually a sign that you're trying to make the scanner do some parsing.
That's generally the wrong approach, since it lacks the power to do a
decent job. Better is to either introduce a separate parser, or to
split the scanner into multiple scanners using (exclusive) start
conditions.
You might have a separate start state once you've seen the `BEGIN'.
In that state, you might then have a regex that will match `END' (to
kick you out of the state), and perhaps `(.|\n)' to get a single
character within the chunk ...
This approach also has much better error-reporting properties.
File: flex, Node: Memory leak - 16386 bytes allocated by malloc., Next: How do I track the byte offset for lseek()?, Prev: Why doesnt flex have non-greedy operators like perl does?, Up: FAQ
Memory leak - 16386 bytes allocated by malloc.
==============================================
UPDATED 2002-07-10: As of `flex' version 2.5.9, this leak means that
you did not call `yylex_destroy()'. If you are using an earlier version
of `flex', then read on.
The leak is about 16426 bytes. That is, (8192 * 2 + 2) for the
read-buffer, and about 40 for `struct yy_buffer_state' (depending upon
alignment). The leak is in the non-reentrant C scanner only (NOT in the
reentrant scanner, NOT in the C++ scanner). Since `flex' doesn't know
when you are done, the buffer is never freed.
However, the leak won't multiply since the buffer is reused no
matter how many times you call `yylex()'.
If you want to reclaim the memory when you are completely done
scanning, then you might try this:
/* For non-reentrant C scanner only. */
yy_delete_buffer(YY_CURRENT_BUFFER);
yy_init = 1;
Note: `yy_init' is an "internal variable", and hasn't been tested in
this situation. It is possible that some other globals may need
resetting as well.
File: flex, Node: How do I track the byte offset for lseek()?, Next: How do I use my own I/O classes in a C++ scanner?, Prev: Memory leak - 16386 bytes allocated by malloc., Up: FAQ
How do I track the byte offset for lseek()?
===========================================
> We thought that it would be possible to have this number through the
While this is the right idea, it has two problems. The first is that
it's possible that `flex' will request less than `YY_READ_BUF_SIZE'
during an invocation of `YY_INPUT' (or that your input source will
return less even though `YY_READ_BUF_SIZE' bytes were requested). The
second problem is that when refilling its internal buffer, `flex' keeps
some characters from the previous buffer (because usually it's in the
middle of a match, and needs those characters to construct `yytext' for
the match once it's done). Because of this, `yy_c_buf_p -
YY_CURRENT_BUFFER->yy_ch_buf' won't be exactly the number of characters
already read from the current buffer.
An alternative solution is to count the number of characters you've
matched since starting to scan. This can be done by using
`YY_USER_ACTION'. For example,
#define YY_USER_ACTION num_chars += yyleng;
(You need to be careful to update your bookkeeping if you use
`yymore('), `yyless()', `unput()', or `input()'.)
File: flex, Node: How do I use my own I/O classes in a C++ scanner?, Next: How do I skip as many chars as possible?, Prev: How do I track the byte offset for lseek()?, Up: FAQ
How do I use my own I/O classes in a C++ scanner?
=================================================
When the flex C++ scanning class rewrite finally happens, then this
sort of thing should become much easier.
You can do this by passing the various functions (such as
`LexerInput()' and `LexerOutput()') NULL `iostream*''s, and then
dealing with your own I/O classes surreptitiously (i.e., stashing them
in special member variables). This works because the only assumption
about the lexer regarding what's done with the iostream's is that
they're ultimately passed to `LexerInput()' and `LexerOutput', which
then do whatever is necessary with them.
File: flex, Node: How do I skip as many chars as possible?, Next: deleteme00, Prev: How do I use my own I/O classes in a C++ scanner?, Up: FAQ
How do I skip as many chars as possible?
========================================
How do I skip as many chars as possible - without interfering with the
other patterns?
In the example below, we want to skip over characters until we see
the phrase "endskip". The following will _NOT_ work correctly (do you
see why not?)
/* INCORRECT SCANNER */
%x SKIP
%%
<INITIAL>startskip BEGIN(SKIP);
...
<SKIP>"endskip" BEGIN(INITIAL);
<SKIP>.* ;
The problem is that the pattern .* will eat up the word "endskip."
The simplest (but slow) fix is:
<SKIP>"endskip" BEGIN(INITIAL);
<SKIP>. ;
The fix involves making the second rule match more, without making
it match "endskip" plus something else. So for example:
<SKIP>"endskip" BEGIN(INITIAL);
<SKIP>[^e]+ ;
<SKIP>. ;/* so you eat up e's, too */
File: flex, Node: deleteme00, Next: Are certain equivalent patterns faster than others?, Prev: How do I skip as many chars as possible?, Up: FAQ
deleteme00
==========
QUESTION:
When was flex born?
Vern Paxson took over
the Software Tools lex project from Jef Poskanzer in 1982. At that point it
was written in Ratfor. Around 1987 or so, Paxson translated it into C, and
a legend was born :-).
File: flex, Node: Are certain equivalent patterns faster than others?, Next: Is backing up a big deal?, Prev: deleteme00, Up: FAQ
Are certain equivalent patterns faster than others?
> (in a very complicated flex program) caused the program to slow from
> 300K+/min to 100K/min (no other changes were done).
These two are not equivalent. For example, the first can match "footnote."
but the second can only match "footnote". This is almost certainly the
cause in the discrepancy - the slower scanner run is matching more tokens,
and/or having to do more backing up.
> 2. Which of these two are better: [Ff]oot or (F|f)oot ?
From a performance point of view, they're equivalent (modulo presumably
minor effects such as memory cache hit rates; and the presence of trailing
context, see below). From a space point of view, the first is slightly
preferable.
> 3. I have a pattern that look like this:
> pats {p1}|{p2}|{p3}|...|{p50} (50 patterns ORd)
>
> running yet another complicated program that includes the following rule:
> <snext>{and}/{no4}{bb}{pats}
>
> gets me to "too complicated - over 32,000 states"...
I can't tell from this example whether the trailing context is variable-length
or fixed-length (it could be the latter if {and} is fixed-length). If it's
variable length, which flex -p will tell you, then this reflects a basic
performance problem, and if you can eliminate it by restructuring your
scanner, you will see significant improvement.
> so I divided {pats} to {pats1}, {pats2},..., {pats5} each consists of about
> 10 patterns and changed the rule to be 5 rules.
> This did compile, but what is the rule of thumb here ?
The rule is to avoid trailing context other than fixed-length, in which for
a/b, either the 'a' pattern or the 'b' pattern have a fixed length. Use
of the '|' operator automatically makes the pattern variable length, so in
this case '[Ff]oot' is preferred to '(F|f)oot'.
> 4. I changed a rule that looked like this:
> <snext8>{and}{bb}/{ROMAN}[^A-Za-z] { BEGIN...
>
> to the next 2 rules:
> <snext8>{and}{bb}/{ROMAN}[A-Za-z] { ECHO;}
> <snext8>{and}{bb}/{ROMAN} { BEGIN...
>
> Again, I understand the using [^...] will cause a great performance loss
Actually, it doesn't cause any sort of performance loss. It's a surprising
fact about regular expressions that they always match in linear time
regardless of how complex they are.
> but are there any specific rules about it ?
See the "Performance Considerations" section of the man page, and also
the example in MISC/fastwc/.
Vern
File: flex, Node: Is backing up a big deal?, Next: Can I fake multi-byte character support?, Prev: Are certain equivalent patterns faster than others?, Up: FAQ
Is backing up a big deal?
=========================
To: Adoram Rogel <adoram@hybridge.com>
Subject: Re: Flex 2.5.2 performance questions
In-reply-to: Your message of Thu, 19 Sep 96 10:16:04 EDT.
Date: Thu, 19 Sep 96 09:58:00 PDT
From: Vern Paxson <vern>
> a lot about the backing up problem.
> I believe that there lies my biggest problem, and I'll try to improve
> it.
Since you have variable trailing context, this is a bigger performance
problem. Fixing it is usually easier than fixing backing up, which in a
complicated scanner (yours seems to fit the bill) can be extremely
difficult to do correctly.
You also don't mention what flags you are using for your scanner.
-f makes a large speed difference, and -Cfe buys you nearly as much
speed but the resulting scanner is considerably smaller.
> I have an | operator in {and} and in {pats} so both of them are variable
> length.
-p should have reported this.
> Is changing one of them to fixed-length is enough ?
Yes.
> Is it possible to change the 32,000 states limit ?
Yes. I've appended instructions on how. Before you make this change,
though, you should think about whether there are ways to fundamentally
simplify your scanner - those are certainly preferable!
Vern
To increase the 32K limit (on a machine with 32 bit integers), you increase
the magnitude of the following in flexdef.h:
#define JAMSTATE -32766 /* marks a reference to the state that always jams */
#define MAXIMUM_MNS 31999
#define BAD_SUBSCRIPT -32767
#define MAX_SHORT 32700
Adding a 0 or two after each should do the trick.
File: flex, Node: Can I fake multi-byte character support?, Next: deleteme01, Prev: Is backing up a big deal?, Up: FAQ
Can I fake multi-byte character support?
========================================
To: Heeman_Lee@hp.com
Subject: Re: flex - multi-byte support?
In-reply-to: Your message of Thu, 03 Oct 1996 17:24:04 PDT.
Date: Fri, 04 Oct 1996 11:42:18 PDT
From: Vern Paxson <vern>
> I assume as long as my *.l file defines the
> range of expected character code values (in octal format), flex will
> scan the file and read multi-byte characters correctly. But I have no
> confidence in this assumption.
Your lack of confidence is justified - this won't work.
Flex has in it a widespread assumption that the input is processed
one byte at a time. Fixing this is on the to-do list, but is involved,
so it won't happen any time soon. In the interim, the best I can suggest
(unless you want to try fixing it yourself) is to write your rules in
terms of pairs of bytes, using definitions in the first section:
X \xfe\xc2
...
%%
foo{X}bar found_foo_fe_c2_bar();
etc. Definitely a pain - sorry about that.
By the way, the email address you used for me is ancient, indicating you
have a very old version of flex. You can get the most recent, 2.5.4, from
ftp.ee.lbl.gov.
Vern
File: flex, Node: deleteme01, Next: Can you discuss some flex internals?, Prev: Can I fake multi-byte character support?, Up: FAQ
> Here I'm mainly puzzled by 'nxt-chk'. I guess it means 'next-check'. (?)
> However, 'template next-check entries' doesn't make much sense to me. To be
> able to find a good translation I need to know a little bit more about it.
There is a scheme in the Aho/Sethi/Ullman compiler book for compressing
scanner tables. It involves creating two pairs of tables. The first has
"base" and "default" entries, the second has "next" and "check" entries.
The "base" entry is indexed by the current state and yields an index into
the next/check table. The "default" entry gives what to do if the state
transition isn't found in next/check. The "next" entry gives the next
state to enter, but only if the "check" entry verifies that this entry is
correct for the current state. Flex creates templates of series of
next/check entries and then encodes differences from these templates as a
way to compress the tables.
> #: main.c:533
> msgid " %d/%d base-def entries created\n"
>
> The same problem here for 'base-def'.
See above.
Vern
File: flex, Node: unput() messes up yy_at_bol, Next: The | operator is not doing what I want, Prev: Can you discuss some flex internals?, Up: FAQ
unput() messes up yy_at_bol
===========================
To: Xinying Li <xli@npac.syr.edu>
Subject: Re: FLEX ?
In-reply-to: Your message of Wed, 13 Nov 1996 17:28:38 PST.
Date: Wed, 13 Nov 1996 19:51:54 PST
From: Vern Paxson <vern>
> "unput()" them to input flow, question occurs. If I do this after I scan
> a carriage, the variable "YY_CURRENT_BUFFER->yy_at_bol" is changed. That
> means the carriage flag has gone.
You can control this by calling yy_set_bol(). It's described in the manual.
> And if in pre-reading it goes to the end of file, is anything done
> to control the end of curren buffer and end of file?
No, there's no way to put back an end-of-file.
> By the way I am using flex 2.5.2 and using the "-l".
The latest release is 2.5.4, by the way. It fixes some bugs in 2.5.2 and
2.5.3. You can get it from ftp.ee.lbl.gov.
Vern
File: flex, Node: The | operator is not doing what I want, Next: Why can't flex understand this variable trailing context pattern?, Prev: unput() messes up yy_at_bol, Up: FAQ
The | operator is not doing what I want
=======================================
To: Alain.ISSARD@st.com
Subject: Re: Start condition with FLEX
In-reply-to: Your message of Mon, 18 Nov 1996 09:45:02 PST.
Date: Mon, 18 Nov 1996 10:41:34 PST
From: Vern Paxson <vern>
> I am not able to use the start condition scope and to use the | (OR) with
> rules having start conditions.
The problem is that if you use '|' as a regular expression operator, for
example "a|b" meaning "match either 'a' or 'b'", then it must *not* have
any blanks around it. If you instead want the special '|' *action* (which
from your scanner appears to be the case), which is a way of giving two
different rules the same action:
foo |
bar matched_foo_or_bar();
then '|' *must* be separated from the first rule by whitespace and *must*
be followed by a new line. You *cannot* write it as:
foo | bar matched_foo_or_bar();
even though you might think you could because yacc supports this syntax.
The reason for this unfortunately incompatibility is historical, but it's
unlikely to be changed.
Your problems with start condition scope are simply due to syntax errors
from your use of '|' later confusing flex.
Let me know if you still have problems.
Vern
File: flex, Node: Why can't flex understand this variable trailing context pattern?, Next: The ^ operator isn't working, Prev: The | operator is not doing what I want, Up: FAQ
Why can't flex understand this variable trailing context pattern?
In-reply-to: Your message of Sat, 23 Nov 1996 16:50:09 PST.
Date: Sat, 23 Nov 1996 17:07:32 PST
From: Vern Paxson <vern>
> Enclosed is a lex file that "real" lex will process, but I cannot get
> flex to process it. Could you try it and maybe point me in the right direction?
Your problem is that some of the definitions in the scanner use the '/'
trailing context operator, and have it enclosed in ()'s. Flex does not
allow this operator to be enclosed in ()'s because doing so allows undefined
regular expressions such as "(a/b)+". So the solution is to remove the
parentheses. Note that you must also be building the scanner with the -l
option for AT&T lex compatibility. Without this option, flex automatically
encloses the definitions in parentheses.
Vern
File: flex, Node: The ^ operator isn't working, Next: Trailing context is getting confused with trailing optional patterns, Prev: Why can't flex understand this variable trailing context pattern?, Up: FAQ
The ^ operator isn't working
============================
To: Thomas Hadig <hadig@toots.physik.rwth-aachen.de>
Subject: Re: Flex Bug ?
In-reply-to: Your message of Tue, 26 Nov 1996 14:35:01 PST.
Date: Tue, 26 Nov 1996 11:15:05 PST
From: Vern Paxson <vern>
> In my lexer code, i have the line :
> ^\*.* { }
>
> Thus all lines starting with an astrix (*) are comment lines.
> This does not work !
I can't get this problem to reproduce - it works fine for me. Note
though that if what you have is slightly different:
COMMENT ^\*.*
%%
{COMMENT} { }
then it won't work, because flex pushes back macro definitions enclosed
in ()'s, so the rule becomes
(^\*.*) { }
and now that the '^' operator is not at the immediate beginning of the
line, it's interpreted as just a regular character. You can avoid this
behavior by using the "-l" lex-compatibility flag, or "%option lex-compat".
Vern
File: flex, Node: Trailing context is getting confused with trailing optional patterns, Next: Is flex GNU or not?, Prev: The ^ operator isn't working, Up: FAQ
Trailing context is getting confused with trailing optional patterns
Subject: Re: Possible mistake in Flex v2.5 document
In-reply-to: Your message of Fri, 05 Sep 1997 16:07:24 PDT.
Date: Fri, 05 Sep 1997 10:01:54 PDT
From: Vern Paxson <vern>
> In that example you show how to count comment lines when using
> C style /* ... */ comments. My question is, shouldn't you take into
> account a scenario where end of a comment marker occurs inside
> character or string literals?
The scanner certainly needs to also scan character and string literals.
However it does that (there's an example in the man page for strings), the
lexer will recognize the beginning of the literal before it runs across the
embedded "/*". Consequently, it will finish scanning the literal before it
even considers the possibility of matching "/*".
Example:
'([^']*|{ESCAPE_SEQUENCE})'
will match all the text between the ''s (inclusive). So the lexer
considers this as a token beginning at the first ', and doesn't even
attempt to match other tokens inside it.
I thinnk this subtlety is not worth putting in the manual, as I suspect
it would confuse more people than it would enlighten.
Vern
File: flex, Node: ERASEME57, Next: Is there a repository for flex scanners?, Prev: ERASEME56, Up: FAQ
ERASEME57
=========
To: "Marty Leisner" <leisner@sdsp.mc.xerox.com>
Subject: Re: flex limitations
In-reply-to: Your message of Sat, 06 Sep 1997 11:27:21 PDT.
Date: Mon, 08 Sep 1997 11:38:08 PDT
From: Vern Paxson <vern>
> %%
> [a-zA-Z]+ /* skip a line */
> { printf("got %s\n", yytext); }
> %%
What version of flex are you using? If I feed this to 2.5.4, it complains:
"bug.l", line 5: EOF encountered inside an action
"bug.l", line 5: unrecognized rule
"bug.l", line 5: fatal parse error
Not the world's greatest error message, but it manages to flag the problem.
(With the introduction of start condition scopes, flex can't accommodate
an action on a separate line, since it's ambiguous with an indented rule.)
You can get 2.5.4 from ftp.ee.lbl.gov.
Vern
File: flex, Node: Is there a repository for flex scanners?, Next: How can I conditionally compile or preprocess my flex input file?, Prev: ERASEME57, Up: FAQ
Is there a repository for flex scanners?
========================================
Not that we know of. You might try asking on comp.compilers.
File: flex, Node: How can I conditionally compile or preprocess my flex input file?, Next: Where can I find grammars for lex and yacc?, Prev: Is there a repository for flex scanners?, Up: FAQ
How can I conditionally compile or preprocess my flex input file?
Flex doesn't have a preprocessor like C does. You might try using m4,
or the C preprocessor plus a sed script to clean up the result.
File: flex, Node: Where can I find grammars for lex and yacc?, Next: I get an end-of-buffer message for each character scanned., Prev: How can I conditionally compile or preprocess my flex input file?, Up: FAQ
Where can I find grammars for lex and yacc?
===========================================
In the sources for flex and bison.
File: flex, Node: I get an end-of-buffer message for each character scanned., Next: unnamed-faq-62, Prev: Where can I find grammars for lex and yacc?, Up: FAQ
I get an end-of-buffer message for each character scanned.