the '>' character is only treated as a delimited at the outermost
level of the code block, so the directive is parsed correctly.
=head2 C<extract_multiple>
The C<extract_multiple> subroutine takes a string to be processed and a
list of extractors (subroutines or regular expressions) to apply to that string.
In an array context C<extract_multiple> returns an array of substrings
of the original string, as extracted by the specified extractors.
In a scalar context, C<extract_multiple> returns the first
substring successfully extracted from the original string. In both
scalar and void contexts the original string has the first successfully
extracted substring removed from it. In all contexts
C<extract_multiple> starts at the current C<pos> of the string, and
sets that C<pos> appropriately after it matches.
Hence, the aim of of a call to C<extract_multiple> in a list context
is to split the processed string into as many non-overlapping fields as
possible, by repeatedly applying each of the specified extractors
to the remainder of the string. Thus C<extract_multiple> is
a generalized form of Perl's C<split> subroutine.
The subroutine takes up to four optional arguments:
=over 4
=item 1.
A string to be processed (C<$_> if the string is omitted or C<undef>)
=item 2.
A reference to a list of subroutine references and/or qr// objects and/or
literal strings and/or hash references, specifying the extractors
to be used to split the string. If this argument is omitted (or
C<undef>) the list:
[
sub { extract_variable($_[0], '') },
sub { extract_quotelike($_[0],'') },
sub { extract_codeblock($_[0],'{}','') },
]
is used.
=item 3.
An number specifying the maximum number of fields to return. If this
argument is omitted (or C<undef>), split continues as long as possible.
If the third argument is I<N>, then extraction continues until I<N> fields
have been successfully extracted, or until the string has been completely
processed.
Note that in scalar and void contexts the value of this argument is
automatically reset to 1 (under C<-w>, a warning is issued if the argument
has to be reset).
=item 4.
A value indicating whether unmatched substrings (see below) within the
text should be skipped or returned as fields. If the value is true,
such substrings are skipped. Otherwise, they are returned.
=back
The extraction process works by applying each extractor in
sequence to the text string.
If the extractor is a subroutine it is called in a list context and is
expected to return a list of a single element, namely the extracted
text. It may optionally also return two further arguments: a string
representing the text left after extraction (like $' for a pattern
match), and a string representing any prefix skipped before the
extraction (like $` in a pattern match). Note that this is designed
to facilitate the use of other Text::Balanced subroutines with
C<extract_multiple>. Note too that the value returned by an extractor
subroutine need not bear any relationship to the corresponding substring
of the original text (see examples below).
If the extractor is a precompiled regular expression or a string,
it is matched against the text in a scalar context with a leading
'\G' and the gc modifiers enabled. The extracted value is either
$1 if that variable is defined after the match, or else the
complete match (i.e. $&).
If the extractor is a hash reference, it must contain exactly one element.
The value of that element is one of the
above extractor types (subroutine reference, regular expression, or string).
The key of that element is the name of a class into which the successful
return value of the extractor will be blessed.
If an extractor returns a defined value, that value is immediately
treated as the next extracted field and pushed onto the list of fields.
If the extractor was specified in a hash reference, the field is also
blessed into the appropriate class,
If the extractor fails to match (in the case of a regex extractor), or returns an empty list or an undefined value (in the case of a subroutine extractor), it is
assumed to have failed to extract.
If none of the extractor subroutines succeeds, then one
character is extracted from the start of the text and the extraction
subroutines reapplied. Characters which are thus removed are accumulated and
eventually become the next field (unless the fourth argument is true, in which
case they are disgarded).
For example, the following extracts substrings that are valid Perl variables:
@fields = extract_multiple($text,
[ sub { extract_variable($_[0]) } ],
undef, 1);
This example separates a text into fields which are quote delimited,
curly bracketed, and anything else. The delimited and bracketed
parts are also blessed to identify them (the "anything else" is unblessed):
@fields = extract_multiple($text,
[
{ Delim => sub { extract_delimited($_[0],q{'"}) } },
{ Brack => sub { extract_bracketed($_[0],'{}') } },
]);
This call extracts the next single substring that is a valid Perl quotelike
operator (and removes it from $text):
$quotelike = extract_multiple($text,
[
sub { extract_quotelike($_[0]) },
], undef, 1);
Finally, here is yet another way to do comma-separated value parsing:
@fields = extract_multiple($csv_text,
[
sub { extract_delimited($_[0],q{'"}) },
qr/([^,]+)(.*)/,
],
undef,1);
The list in the second argument means:
I<"Try and extract a ' or " delimited string, otherwise extract anything up to a comma...">.
The undef third argument means:
I<"...as many times as possible...">,
and the true value in the fourth argument means
I<"...discarding anything else that appears (i.e. the commas)">.
If you wanted the commas preserved as separate fields (i.e. like split
does if your split pattern has capturing parentheses), you would
just make the last parameter undefined (or remove it).
=head2 C<gen_delimited_pat>
The C<gen_delimited_pat> subroutine takes a single (string) argument and
> builds a Friedl-style optimized regex that matches a string delimited
by any one of the characters in the single argument. For example:
gen_delimited_pat(q{'"})
returns the regex:
(?:\"(?:\\\"|(?!\").)*\"|\'(?:\\\'|(?!\').)*\')
Note that the specified delimiters are automatically quotemeta'd.
A typical use of C<gen_delimited_pat> would be to build special purpose tags
for C<extract_tagged>. For example, to properly ignore "empty" XML elements
(which might contain quoted strings):
my $empty_tag = '<(' . gen_delimited_pat(q{'"}) . '|.)+/>';