NetNews Usenet Archive 1992 #27

home *** CD-ROM | disk | FTP | other *** search

/ NetNews Usenet Archive 1992 #27 / NN_1992_27.iso / spool / comp / lang / perl / 7029 < prev next >

Wrap

Internet Message Format | 1992-11-17 | 5.1 KB

Path: sparky!uunet!caen!sol.ctr.columbia.edu!news.cs.columbia.edu!mail Newsgroups: comp.lang.perl Date: Tue, 17 Nov 92 23:06:32 EST From: thayer@cs.columbia.edu (Charles Thayer) Message-ID: <9211180406.AA03308@cs.columbia.edu> Subject: Re: Partial RegExp's References: <1992Nov12.003038.18617@netlabs.com> <BxGFIy.LuA@news.cso.uiuc.edu> Lines: 105 lwall writes: chappell@math.uiuc.edu (Glenn Chappell) writes: : Here's a question I've been pondering for a few weeks: : : Inherent in the design of Perl seems to be the idea that you always have : enough space in memory for the entirety of any file you'd ever want to : work on. But what if you don't? How does one do wonderful things like : "split" and various pattern matches & such on files that are just too big? Yes, I know what you mean. For me it is very tempting to read in a whole file, split on white space, then unshift to get tokens. Which is silly unless you are looking to match disparate token sequences, such as across line boundaries or deep in a parse. : Well, if there's already a standard, ok-I'll-spell-it-all-out-for-a- : neophyte-like-you-but-next-time-read-the-book-okay answer to this : question, I'd like to hear it. If, not, an idea: Not really, but there are the usual ways to break up input. : Of course, the way you deal with huge files is to read them in chunks. : The problem with that is that you miss a pattern match that starts on : one chunk and ends on another. : : Currently, the result of an attempt at a pattern match gives one of two : responses: "Got a match" or "Didn't get a match". What if there were a : way to tell the pattern matcher that some patterns may extend off the : end of the currently available text, and we gave the matcher the ability : to give two other reponses: "Got a partial match, and if you give me : more data, I may get a match" and "Got a match, which may turn into a : bigger match if you give me more data". The matcher would also return : the place at which the partial match began. A better solution would probably be to write a fsm package (finite-state machine) or more specifically a little parser package. Which reminds me, someone (jeffrey@netcom.com) did recently post a package called marpa. Although I haven't used it, the author describes it as a "prototype of a hacker's parser for perl." : Now, from what I know of pattern matching, it seems to me that this would : be an easy modification to do. The question, then, is whether it would : be worthwhile. So, does anyone think so? Or is it just me? If that is your specific goal then you may want to do something simpler, such as breaking your expression into two (if you know where you can expect boundaries.) Let's say you'd like to replace "foo bar" with "moo man" across whitespace boundaries: #!/bin/perl $reg1='foo'; $reg2='bar'; &prog while(<>); sub prog { if ($_ =~ /$reg1/) { $_ .= <>; s/$reg1([ \t\n]+)$reg2/moo\1man/g; &prog; } else { print; } } [damn, there I go recursing again. If it's not an eval its recursion] lw> This is one of the reasons that Mark and I have been talking about lw> variants of *, + and {} that would match the shortest string lw> possible. Currently there are only two ways to do first-match lw> rather than last-match: 1) have the thing you're looking for be the lw> first thing in a pattern, or 2) carefully write the intermediate lw> stuff to exclude the pattern you're looking for. Both of these lw> approaches leave something to be desired. lw> A thing that might help with option 1 is if we attached the state lw> of m//g searches to the searched string rather than to the pattern lw> itself. This would let you switch from one pattern to another in lw> mid search. One could then do tokenizing without s///, for lw> instance. Of course, that doesn't help the lw> slurp-in-more-file-if-you-run-out problem. One thing that might be a simple/useful change to perl would be to let one assign a regex to $/ for doing on-the-fly parsing. (to remain compatible with old scripts, this magic wouldn't turn on unless you did a 'study $/') lw> I don't see any theoretical problems with regular expression lw> operators that backtrack by attempting to match more rather than lw> less. There's certainly no problem with oscillation, since any lw> given operator always progresses in the same direction. This is probably one of those things that can be proven to be an equivalent set of operations, where providing some operators of the other set is extraneous, but possibly convenient. lw> As for the feature in question, it's just one of those things where lw> you try to figure out the most general (and practical) way to do lw> it, then think about whether you want to do it at all, and then lw> maybe find time to do it. None of these is a trivial undertaking. lw> Language design is not a haphazard activity, appearances lw> notwithstanding. :-) /charles ============================================================================== Charles Thayer, Columbia University, Dept. CS, Tech-staff... my words only.