Chip 2001 May

home *** CD-ROM | disk | FTP | other *** search

/ Chip 2001 May / W2KPRK.iso / apps / posix / source / GREP / README~1.MOD < prev next >

Wrap

Text File | 1999-11-17 | 4KB | 78 lines

Three areas must be addressed to provide full Kanji compatibility. Only #1 (for the non-regular expression case) has been implemented directly in our grep/egrep-compatible Boyer-Moore-based code. (1) false middle match (a) meta-free Kanji (b) Kanji regexprs Kanji 16-bit "EUC" data codes (see Jung/Kalash, "Yunikkusu wa Nihongo o Hanasemasu", p. 209, Atlanta Usenix, 1986) have the upper bit on in both bytes, so as to allow intermixing of ASCII while preserving end-of-string detection. 'grep' must beware of matching two Kanji byte pairs in the interior of two unrelated Kanji characters. e.g. text: a (k1 k2) b (k3 k4) (k5 k6) pattern: (k4 k5) is a bad match, given ascii bytes 'a' and 'b', and Kanji characters (k1 k2), (k3 k4), and (k5 k6). The solution for Kanji grep using the traditional algorithm might be to anchor the pattern only at Kanji pair boundaries while scanning forward. Boyer-Moore methods cannot afford this. So we allow false matches, then scan backwards for legality (the first ascii byte in the text occurring before the candidate match disambiguates). Another appealing method, for "layered" processing via regexp(3), is to convert the meta-free Kanji to '(^|[^\000-\177])k1k2', assuming Henry Spencer's code is "8-bit clean". Case (b) (e.g. regexprs like 'k1k2.*k3k4') is similar, though syntax translation may be more difficult. (2) closures Eight-bit egrep '(k1k2)*' [where the '*' may be '+' or '?'], would wrongly apply the closure to the previous byte instead of the byte pair. One solution (without touching the existing 'regexp(3)' or 'e?grep' source) is to simply parenthesize reg exprs 'k1k2*' -> '(k1k2)*'. [only works with egrep syntax, so should occur after the grep->egrep expr xlation]. (3) character classes (a) easy case: [k1k2k3k4k5k6] -- just map to (k1k2|k3k4|k5k6). (b) hard: ranges [k1k2-k3k4] fail for byte-oriented char class code. Kanji interpretation (how do ideograms collate?) is also problematic. Translation to egrep '.*((k1k2)|(k1k2++)...|(k3k4)).*', where '++' denotes "16-bit successor" is conceivable, but farfetched. Now, translations (1) and (2) may be done [messily] w/o touching Spencer's code, while (3) could be farmed out to standard Kanji egrep via the process exec mechanism already established (see pep4grep.doc[123]). But if (3) were done this way (invoking exec()), then the other cases might also be done without recourse to the above xlations [just match "regmust" first, then pass false drops to the Japan Unix std.] However, r.e.'s handled in such a manner would make hybrid Boyer-Moore slow for small files, except for systems running MACH. We could have ad hoc file size vs. exec() tradeoff detectors control things for Kanji (it's already done for Anglo exprs), but previous success has hinged upon having the regexp(3) layer compatible with the r.e. style of the coarser egrep utility. Thus we take the easy way out and make fast grep only apply to simple non-r.e. Kanji. The very best approach remains modification of proprietary Kanji egrep to incorporate Boyer-Moore directly, by doing Boyer-Moore on the buffers first before rescanning with the Kanji r.e. machine. Someday. -- James A. Woods (ames!jaw) Postscript: The several articles in the special issue of UNIX Review (March 1987) have delineated the bewildering variety of codesets (shifted JIS, HP 15/16, many EUC flavors, etc.). A late addition to [ef]?grep Kanji support is capability for intermixed Katakana (SS2). Full testing on real Kanji files has not been done. Comments are welcome.