home *** CD-ROM | disk | FTP | other *** search
- Path: sparky!uunet!caen!sol.ctr.columbia.edu!news.cs.columbia.edu!mail
- Newsgroups: comp.lang.perl
- Date: Tue, 17 Nov 92 23:06:32 EST
- From: thayer@cs.columbia.edu (Charles Thayer)
- Message-ID: <9211180406.AA03308@cs.columbia.edu>
- Subject: Re: Partial RegExp's
- References: <1992Nov12.003038.18617@netlabs.com>
- <BxGFIy.LuA@news.cso.uiuc.edu>
- Lines: 105
-
- lwall writes:
- chappell@math.uiuc.edu (Glenn Chappell) writes:
- : Here's a question I've been pondering for a few weeks:
- :
- : Inherent in the design of Perl seems to be the idea that you always have
- : enough space in memory for the entirety of any file you'd ever want to
- : work on. But what if you don't? How does one do wonderful things like
- : "split" and various pattern matches & such on files that are just too big?
-
- Yes, I know what you mean. For me it is very tempting to read in a whole
- file, split on white space, then unshift to get tokens. Which is silly
- unless you are looking to match disparate token sequences, such as across
- line boundaries or deep in a parse.
-
- : Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a-
- : neophyte-like-you-but-next-time-read-the-book-okay answer to this
- : question, I'd like to hear it. If, not, an idea:
-
- Not really, but there are the usual ways to break up input.
-
- : Of course, the way you deal with huge files is to read them in chunks.
- : The problem with that is that you miss a pattern match that starts on
- : one chunk and ends on another.
- :
- : Currently, the result of an attempt at a pattern match gives one of two
- : responses: "Got a match" or "Didn't get a match". What if there were a
- : way to tell the pattern matcher that some patterns may extend off the
- : end of the currently available text, and we gave the matcher the ability
- : to give two other reponses: "Got a partial match, and if you give me
- : more data, I may get a match" and "Got a match, which may turn into a
- : bigger match if you give me more data". The matcher would also return
- : the place at which the partial match began.
-
- A better solution would probably be to write a fsm package (finite-state
- machine) or more specifically a little parser package. Which reminds me,
- someone (jeffrey@netcom.com) did recently post a package called marpa.
- Although I haven't used it, the author describes it as a "prototype of
- a hacker's parser for perl."
-
- : Now, from what I know of pattern matching, it seems to me that this would
- : be an easy modification to do. The question, then, is whether it would
- : be worthwhile. So, does anyone think so? Or is it just me?
-
- If that is your specific goal then you may want to do something
- simpler, such as breaking your expression into two (if you know where
- you can expect boundaries.) Let's say you'd like to replace "foo bar"
- with "moo man" across whitespace boundaries:
-
- #!/bin/perl
- $reg1='foo';
- $reg2='bar';
-
- &prog while(<>);
-
- sub prog {
- if ($_ =~ /$reg1/) {
- $_ .= <>;
- s/$reg1([ \t\n]+)$reg2/moo\1man/g;
- &prog;
- } else { print; }
- }
-
- [damn, there I go recursing again. If it's not an eval its recursion]
-
- lw> This is one of the reasons that Mark and I have been talking about
- lw> variants of *, + and {} that would match the shortest string
- lw> possible. Currently there are only two ways to do first-match
- lw> rather than last-match: 1) have the thing you're looking for be the
- lw> first thing in a pattern, or 2) carefully write the intermediate
- lw> stuff to exclude the pattern you're looking for. Both of these
- lw> approaches leave something to be desired.
-
- lw> A thing that might help with option 1 is if we attached the state
- lw> of m//g searches to the searched string rather than to the pattern
- lw> itself. This would let you switch from one pattern to another in
- lw> mid search. One could then do tokenizing without s///, for
- lw> instance. Of course, that doesn't help the
- lw> slurp-in-more-file-if-you-run-out problem.
-
- One thing that might be a simple/useful change to perl would be
- to let one assign a regex to $/ for doing on-the-fly parsing.
- (to remain compatible with old scripts, this magic wouldn't turn
- on unless you did a 'study $/')
-
- lw> I don't see any theoretical problems with regular expression
- lw> operators that backtrack by attempting to match more rather than
- lw> less. There's certainly no problem with oscillation, since any
- lw> given operator always progresses in the same direction.
-
- This is probably one of those things that can be proven to be an
- equivalent set of operations, where providing some operators of the
- other set is extraneous, but possibly convenient.
-
- lw> As for the feature in question, it's just one of those things where
- lw> you try to figure out the most general (and practical) way to do
- lw> it, then think about whether you want to do it at all, and then
- lw> maybe find time to do it. None of these is a trivial undertaking.
- lw> Language design is not a haphazard activity, appearances
- lw> notwithstanding. :-)
-
- /charles
- ==============================================================================
- Charles Thayer, Columbia University, Dept. CS, Tech-staff... my words only.
-
-
-