home *** CD-ROM | disk | FTP | other *** search
Text File | 1993-03-21 | 60.4 KB | 1,496 lines |
- Newsgroups: comp.sources.misc
- From: jkl@osc.edu (Jan Labanowski)
- Subject: v36i023: translit - transliterate foreign alphabets, Part01/10
- Message-ID: <csm-v36i023=translit.163954@sparky.IMD.Sterling.COM>
- X-Md4-Signature: 1fdf62718ac15c13f16020f8f731cbf8
- Date: Fri, 19 Mar 1993 22:40:58 GMT
- Approved: kent@sparky.imd.sterling.com
-
- Submitted-by: jkl@osc.edu (Jan Labanowski)
- Posting-number: Volume 36, Issue 23
- Archive-name: translit/part01
- Environment: UNIX, MS-DOS, VMS
-
- Available-from: kekule.osc.edu (128.146.36.48) in /pub/russian/translit
- Copyright-note: Yes, you have to distribute the complete package.
-
- Translit is a general transliteration program. It transliterates
- between different alphabet representations of different languages.
-
- It is frequently necessary to convert from one representation to another
- representation of the foreign alphabet. E.g., in the Library of Congress
- transliteration, the Russian letter sha is transliterated as two Latin
- letters "sh" while the popular word processors use a code 232 (decimal),
- the RELCOM network uses a code 221, and the KOI7 set uses character "["
- for the same letter. So if your screen driver, printer, word processor,
- etc. uses different codes than the text file which you have, you need to
- transliterate.
-
- The TRANSLIT program is a powerful tool for such tasks. It converts an input
- file in one representation to the output file in another representation using
- appropriate, user defined, transliteration table. Transliteration table allows
- for very elaborate transliteration tasks and includes provisions for plain
- character sequences, character lists, regular expressions (flexible matches),
- SHIFT-OUT/IN sequences and more. The program comes with documentation and
- examples of popular transliteration schemes. The Russian language serves
- as an example. Other files will be added with your collaboration.
-
- The most current version of translit will be available from ftp kekule.osc.edu
- (or ftp 128.146.36.48) in the directory /pub/russian/translit
-
- Via E-mail, first retrieve the file readme.doc. It describes the files in
- the program distribution and has detailed instructions on how to obtain the
- program. Send the message:
-
- send translit/readme.doc from russian
-
- to OSCPOST@osc.edu or OSCPOST@OHSTPY.BITNET. The file readme.doc will be
- forwarded to your mailbox.
-
- Enjoy,
-
- Author coordinates:
- Jan Labanowski
- P.O. Box 21821
- Columbus, OH 43221-0821, USA
- jkl@osc.edu, JKL@OHSTPY.BITNET
- -------
- #! /bin/sh
- # This is a shell archive. Remove anything before this line, then feed it
- # into a shell via "sh file" or similar. To overwrite existing files,
- # type "sh file -c".
- # Contents: translit.1
- # Wrapped by kent@sparky on Fri Mar 19 16:00:08 1993
- PATH=/bin:/usr/bin:/usr/ucb:/usr/local/bin:/usr/lbin ; export PATH
- echo If this archive is complete, you will see the following message:
- echo ' "shar: End of archive 1 (of 10)."'
- if test -f 'translit.1' -a "${1}" != "-c" ; then
- echo shar: Will not clobber existing file \"'translit.1'\"
- else
- echo shar: Extracting \"'translit.1'\" \(56776 characters\)
- sed "s/^X//" >'translit.1' <<'END_OF_FILE'
- X.TH TRANSLIT JKL "23-Jan-1993" JKL "Version 1.0"
- X.DA 20 Jan 1993
- X.SH NAME
- X.IP \fITRANSLIT\fR
- XProgram to transliterate texts in different character sets. The program
- Xconverts input character codes (or sequences of codes) to a different set
- Xof output character codes (or sequences of codes). Intended for
- Xtransliteration to/from phonetic representation of foreign letters with
- XLatin letters from/to special national codes used for these letters.
- XIt supports simple matches, character lists and flexible matches via
- Xregular expressions. The new transliteration schemes are easily added
- Xby creating simple transliteration tables. Multiple character sets
- Xare supported for input and output. It does not yet support UNICODE,
- Xbut some day it will.
- X
- X.SH COPYRIGHT
- XCopyright (c) 1993 Jan Labanowski and JKL Enterprises, Inc.
- X.br
- XYou may distribute the Software only as a complete set of files.
- XYou may distribute the modified Software only if you retain the
- XCopyright notice and you do not delete original code, data, documentation
- Xand associated files.
- XThe Software is copyrighted. You may not sell the software or incorporate
- Xit in the commercial product without written permission from
- XJan Labanowski or JKL Enterprises, Inc. You are allowed to charge for media
- Xand copying if you distribute the whole unaltered package.
- X
- X.SH SYNOPSIS
- X.B translit
- X[
- X.B -i
- X.I inpfile
- X][
- X.B -o
- X.I outfile
- X][
- X.B -d
- X][
- X.B -t
- X.I transtbl \|\||\|\| transtbl
- X]
- X.br
- X
- X.SH OPTIONS
- X.IP "\fB-i\fP \fIinpfile\fP"
- X.I inpfile
- Xis a name of input file to be transliterated.
- XIf "\fB-i\fP" is not specified, the input is taken from
- Xstandard input.
- X.IP "\fB-o\fP \fIoutfile\fP"
- X.I outfile
- Xis an output file, where the transliterated
- Xtext is stored. If "\fB-o\fP" is not specified, the output is
- Xdirected to the standard output. Program will not overwrite the existing
- Xfile. If file exists, you need to delete it first.
- X.IP "\fB-d\fP"
- XSome information on character codes read from transliteration table file
- Xare sent to standard error ("\fIstderr\fP"). Useful when developing
- Xnew transliteration tables.
- X.IP "\fB-t\fP \fItranstbl\fP"
- X.I transtbl
- Xis a transliteration table file which you want to use. The "\fB-t\fP"
- Xoption may be omitted if the \fItranstbl\fR
- Xis specified as the last parameter on the
- Xcommand line. The program first tries to locate \fItranstbl\fR
- Xfile in the current directory, and if not found, it
- Xsearches the directory chosen at compilation/installation time in
- X"\fIpaths.h\fP". If no "\fItranstbl\fP" is given, the default file name
- Xspecified in "\fIpaths.h\fP" is taken. The compile/installation
- Xtime defaults in
- X"\fIpaths.h\fR" for the search directory and the default
- Xfile name can be overiden
- Xby setting environment variables: TRANSP and TRANSF, respectively (see below).
- X
- X.SH ENVIRONMENT VARIABLES
- XThe default path to the directory holding transliteration tables can
- Xbe overiden by setting environment variable TRANSP. The default name
- Xfor the transliteration table can be overiden by setting TRANSF environment
- Xvariable. However, when the transliteration file is given on the command line,
- Xit will overide the defaults and environment setting.
- XHere are some examples of setting environment
- Xvariables for different operating systems:
- X.sp
- X.in +2m
- X.br
- X\fIUN*X System\fR
- X.br
- X.nf
- X If you are using \fIcsh\fR (C-shell):
- X setenv TRANSP /home/john/translit/
- X setenv TRANSF koi8-tex.rus
- X If you are using \fIsh\fR (Bourne Shell):
- X set TRANSP=/home/john/translit/
- X export TRANSP
- X set TRANSF=koi8-tex.rus
- X export TRANSF
- X\fIVAX-VMS System\fR
- X TRANSP:==SYS$USER:[JOHN.TRANSLIT]
- X TRANSF:==KOI8-TEX.TBL
- X\fIPC-DOS or MS-DOS\fR
- X SET TRANSP=C:\|\\\|JOHN\|\\\|TRANSLIT\|\\
- X SET TRANSF=KOI8-TEX.TBL
- X.fi
- X.in -2m
- XNote that the directory path has to include concluding
- Xslashes, \|\\\| or \|/\|\|.
- X
- X
- X.SH EXAMPLES
- X.ta 5m
- X.br
- X cat text.koi8 \|\||\|\| translit koi8-tex.rus > text.tex
- X.br
- Xin UN*X is equivalent to:
- X.sp 1
- X translit -t koi8-tex.rus -o text.tex -i text.koi8
- X.br
- Xand converts file text.koi8 to file text.tex using transliteration
- Xspecified in the file koi8-tex.rus.
- X.sp 1
- X translit -i text.koi8 koi8-cl.rus
- X.br
- Xdisplays the converted text from file text.koi8 on your terminal. The
- Xconversion table is koi8-cl.rus (KOI8 --> Library of Congress).
- X.sp 1
- X translit -i text.alt -t alt-koi8.rus \|\||\|\| translit -o text.tex -t koi8-tex.rus
- X.br
- Xis essentially equivalent to the following two commands in UN*X or MS-DOS:
- X.br
- X translit -i text.alt -o junkfile -t alt-koi8.rus
- X.br
- X translit -i junkfile -o text.tex -t koi8-tex.rus
- X.br
- Xand converts the file in ALT character set to a LaTeX file for printing.
- X.sp
- X translit -i russ.txt pho-koi8.rus \|\||\|\| translit -o russ.tex koi8-tex.rus
- X.br
- Xconverts file russ.txt from phonetic transliteration to LaTeX file russ.tex
- Xfor printing.
- X.sp 2
- X
- X.SH TRANSLITERATION TABLES
- XThe following transliteration files are available with the current
- Xdistribution. Consult the comments in the individual files for details.
- X.IP \fIkoi8-tex.rus\fP
- XConversion table which changes the file in KOI8 (8 bit character set
- Xused by RELCOM news service) to a LaTeX file for printing with
- X\fIAMS\fR WNCYR fonts.
- X.IP \fItex-koi8.rus\fP
- XConversion table for the LaTeX to KOI8 conversion. Note that it will not
- Xhandle complicated cases, since LaTeX is a program, and only TeX can
- Xconvert a LaTeX source to the characters. However, it should work OK
- Xfor simple cases of text only files, and may need some editing for
- Xcomplicated cases.
- X.IP \fIalt-gos.rus\fP
- XThis is a transliteration data file for converting from ALT (Bryabrins
- Xalternativnyj variant used in many popular wordprocessors)
- Xto GOSTSCII 84 (approx. ISO-8859-5?)
- X.IP \fIalt-koi8.rus\fP
- XThis is a transliteration data file for converting from ALT to KOI8.
- XKOI8 is meant to be GOST 19768-74 (as used by RELCOM).
- X.IP \fIgos-alt.rus\fP
- XThis is a transliteration data file for converting GOSTSCII 84
- X(approx. ISO-8859-5?) to ALT (Bryabrins alternativnyj variant)
- X.IP \fIgos-koi8.rus\fP
- XThis is a transliteration data file for converting GOSTSCII 84
- X(approx. ISO-8859-5?) to KOI8 used by RELCOM
- XKOI8 is meant to be GOST 19768-74
- X.IP \fIkoi8-alt.rus\fP
- XThis is a transliteration data file for converting from KOI8.
- XKOI8 is meant to be GOST 19768-74, to ALT (Bryabrins alternativnyj variant)
- X.IP \fIkoi8-gos.rus\fP
- XThis is a transliteration data file for converting from KOI8 (Relcom).
- XKOI8 is meant to be GOST 19768-74, to GOSTSCII 84 (approx. ISO-8859-5)
- X.IP \fIkoi8-7.rus\fP
- XThis file converts from KOI8 to KOI7.
- X.IP \fIkoi7-8.rus\fP
- XThis file converts from KOI7 to KOI8. Before you attempt the conversion,
- Xyou might need to perform a simple edit on your file. You MUST read the
- Xcomments in \fIkoi7-8.rus\fR before you attempt this conversion.
- X.IP \fIkoi7nl-8.rus\fP
- XThis file assumes that there are only Russian letters (no Latin)
- Xin the input file. If you have Latin letters, and you inserted SHIFT-OUT/IN
- Xcharacters, use file \fIkoi7-8.rus\fP.
- X.IP \fIkoi8-lc.rus\fP
- XThis file converts KOI8 to the Library of Congress transliteration.
- XSome extensions are added.
- X.IP \fIkoi8-php.rus\fP
- XThis file converts KOI8 to the Pokrovsky transliteration.
- X.IP \fIphp-koi8.rus\fP
- XThis file converts from Pokrovsky transliteration to KOI8.
- X.IP \fIkoi8-phg.rus\fP
- XThis file converts from KOI8 to GOST transliteration.
- X.IP \fIphg-koi8.rus\fP
- XThis file converts from GOST transliteration to KOI8.
- X.IP \fIpho-koi8.rus\fP
- XThis is a table which will convert from many "phonetic" transliteration
- Xschemes to KOI8. It is elaborate and it takes a lot of time to
- Xtransliterate the file using this table. Some transliterations are
- Xhopeless and internally inconsistent (as humans...), so the results
- Xcannot be bug free.
- XYou might want to modify the file, if your transliteration
- Xpatterns are different than those assumed in this file. You may also want
- Xto simplify this file if the phonetic transliteration you are converting
- Xis a sound one (most are not, e.g., they use e for je and e oborotnoye,
- Xts for c and t-s, h for kha, i for i-kratkoe, etc.).
- X.sp
- X
- X.SH INTRODUCTION
- XIf you do not intend to write your own transliteration tables, you may
- Xskip this description and go directly to the installation and
- Xcopyright sections. However, you might want to read this material anyhow,
- Xto better understand the traps and complexities of transliteration.
- XIt is frequently necessary to transliterate text, i.e., to change one set
- Xof characters (or composite characters, phonemes, etc.) to another set.
- X.PP
- XOn computers, the transliteration operation consists of converting the input
- Xfile in some character set to the output file in another character set.
- X.PP
- XIn the simplest case, the single characters are transliterated, i.e, their
- Xcodes are changed according to some transliteration table. This is called
- Xremapping and, assuming the one-to-one mapping, the task can be accomplished
- Xby a simple pseudo program:
- X.br
- X new_char_code = character_map[old_char_code];
- X.PP
- XIf the one-to-one correspondence does not exist (i.e., some codes may
- Xbe present in one set, but do not have corresponding codes in another set),
- Xprecise transliteration is not possible. In such cases there are 3 obvious
- Xpossibilities:
- X.br
- X 1. skip characters which do not have counterparts,
- X.br
- X 2. retain unchanged codes of these characters,
- X.br
- X 3. convert the codes to multicharacter sequences.
- X.br
- XIn some cases, the file can contain more than one character sets, e.g.,
- Xthe file can contain Latin characters (e.g. English text) and Cyrillic
- Xcharacters (e.g. Russian text). If the character codes assigned to
- Xcharacters in different sets do not overlap, this is still a simple mapping
- Xproblem. This is a case with KOI8 or GOSTCII character tables for Russian,
- Xwhich reserve the lower 127 codes for standard ASCII codes (which include
- Xall Latin characters) and characters with codes above 127 for Cyrillic letters.
- X.PP
- XIf character codes overlap, there is a SHIFT-OUT/SHIFT-IN technique in
- Xwhich the meaning of the character sequence is determined by an opening
- Xcode (or sequence of characters codes). In this case, the meaning of the
- Xseries of characters is determined by the SHIFT-OUT character (or sequence)
- Xwhich precedes them. The SHIFT-IN character (or sequence) following the
- Xseries of characters returns the "reader" to the default or previous status.
- XTo schemes are used:
- X.br
- X (char_set_1)(SHIFT-IN[1])(SHIFT-OUT[2])(char_set_2)...
- X.br
- Xor
- X.br
- X (char_set_1)(SHIFT-OUT[2])(char_set_2)(SHIFT-OUT[1])char_set_1...
- X.br
- X.sp 1
- XSince computer keyboards, screens, printers, software, etc., are by necessity
- Xlanguage specific (the most popular being ASCII), there is a problem of typing
- Xforeign language text which contains letters different than standard Latin
- Xalphabet. For this reason, many transliteration schemes use several Latin
- Xletters to represent a single letter of foreign alphabet, for example:
- X.br
- Xzh is used to represent cyrillic letter zhe, \|\\\|"o may be used to
- Xrepresent the o umlaut, etc.
- X
- XIf there is one-to-one mapping of such sequences to another alphabet, it
- Xis also easy to process. However, it is necessary to substitute longest
- Xsequences first. For example, a frequently used transliteration
- Xfor cyrillic letters:
- X.br
- X.ta 2mL 7mL 11mL 24mL
- X \fIshch\fR --- letter \fBshcza\fR 221 (decimal KOI8 code)
- X.br
- X \fIsh\fR --- letter \fBsha\fR 219
- X.br
- X \fIch\fR --- letter \fBcze\fR 222
- X.br
- X \fIc\fR --- letter \fBtse\fR 195
- X.br
- X \fIh\fR --- letter \fBkha\fR 200
- X.br
- X \fIa\fR --- letter \fBa\fR 193
- X.PP
- XObviously, in this case, we should proceed first with converting all \fIshch\fR
- Xsequences to \fBshcha\fR letter, then two-character \fIsh\fR
- Xand \fIch\fR, and then single
- Xcharacter \fBc\fR and \fBh\fR.
- XGenerally, for the one-to-one transliteration, the longest
- Xsequences should be precessed first, and the order of conversion within
- Xsequences of the same length makes no difference.
- XFor example, converting the word "shchah" to KOI8 should proceed in a following
- Xway:
- X.br
- X \fIshchah\fR --> (221)\fIah\fR, (221)\fIah\fR --> (221)(193)\fIh\fR, (221)(193)\fIh\fR --> (221)(193)(200)
- X.br
- XThere is a multitude of reasons why transliteration is done. I wrote this
- Xprogram having in mind the following ones:
- X.br
- X 1) to print cyrillic text using TeX/LaTeX and cyrillic fonts
- X.br
- X 2) to read KOI8 encoded messages from Russia on my ASCII terminal.
- X.br
- XHowever, I was trying to make it flexible to accommodate other uses.
- X
- X.SH PROGRAM OPERATION
- XThe program converts the input file to an output file using
- Xtransliteration rules from the transliteration rule file which
- Xyou specify with option \fB-t\fR.
- XSome examples of transliteration rule files are enclosed.
- XBefore program can be used, the transliteration rules need to be specified.
- X.PP
- XThese are given as a file which consist of the following parts
- Xdescribed below:
- X.br
- X.in +2m
- X.in +5m
- X.ti -5m
- X1) File format number (it is 1 at this moment)
- X.ti -5m
- X2) Delimiters used to enclose a) simple strings, b) character lists,
- Xc) regular expressions
- X.ti -5m
- X3) Starting sequence for output
- X.ti -5m
- X4) Ending sequence for output
- X.ti -5m
- X5) Number of input "character sets"
- X.ti -5m
- X6) SHIFT-OUT/SHIFT-IN sequences for each input character set
- X.ti -5m
- X7) Number of output "character sets"
- X.ti -5m
- X8) SHIFT-OUT/SHIFT-IN sequences for each output character set
- X.ti -5m
- X9) Transliteration table
- X.in -5m
- X.in -2m
- X.PP
- X\fIGENERAL COMMENTS\fR
- X.br
- XThe transliteration rules file consists of comments and data.
- XThe comments may be included in the file as:
- X.in +5m
- X.ti -2m
- Xa) line comments --- lines starting with ! or # character (# or ! must be
- Xin the first column of a line) are treated as comments and are not
- Xread in by the program.
- X.ti -2m
- Xb) comments following all required entries on the line. They must be
- Xseparated by at least one space from the last data entry on the line
- Xand need not start with any particular character. These comments cannot
- Xbe used within multiline sequences.
- X.br
- X.in -5m
- X.PP
- XThe data entries consist of integer numbers and strings.
- XThe strings may represent:
- X.br
- X a) plain strings
- X.br
- X b) character lists
- X.br
- X c) regular expressions
- X.br
- X.PP
- XAll strings which appear in the file, are processed through the
- X"string processor", which allows entering unprintable characters as codes.
- XThe character code is specified as a backslash "\|\\\|" followed by at least
- X2 digit(s) (i.e., \|\\\|01 produces code=1, but \|\|\\\|1 is passed unchanged). The
- Xfollowing formats are supported:
- X.br
- X \|\\\|0123 character of octal code 123 (when leading zero present)
- X.br
- X \|\\\|123 character of decimal code 123 (when leading digit is not zero)
- X.br
- X \|\\\|0o123 or \|\\\|0O123 character of octal code 123
- X.br
- X \|\\\|0d123 or \|\\\|0D123 character of decimal code 123
- X.br
- X \|\\\|0xA3 or \|\\\|0XA3 or \|\\\|0xa3 character of hexadecimal code A3
- X.br
- X.PP
- XThe allowed digits are 0-7 for octal codes, 0-9 for decimal codes and
- X0-F (and/or 0-f) for hexadecimal codes.
- XIn a situation when code has to be followed by a digit character,
- Xyou need to enter the
- Xdigit as a code. E.g., if you want character \|\\\|0xA3 followed by a letter C,
- Xyou need to specify letter C as a code (\|\\\|0x43 or \|\\\|103 or \|\\\|0o103 or \|\\\|0d67)
- Xand type the sequence as, e.g., \|\\\|0xA3\|\\\|103.
- XCharacter resulting in a code 0 (zero) (e.g., \|\\\|00) is special. It tells:
- X"skip everything what follows me in this string".
- XIt does not make sense to use it, since you can always terminate the
- Xsequence with a delimiter. When you use an empty string as a matching
- Xsequence, remember that it does not match anything.
- X.sp
- XIf the line with entries is too long, you can break it between the
- Xfields.
- XIf the string is too long to fit a line, you can break it before any nonblank
- Xcharacter by the \|\\\| (backslash) followed by white space (i.e., new lines,
- Xspaces, tabs, etc.). The \|\\\| and the following white space will be removed
- Xfrom the string by the string preprocessor. However, you are not allowed
- Xto break the individual character codes (and you probably would not
- Xdo it ever for aestetic purposes).
- XFor example:
- X.br
- X "experi\\
- X.br
- X mental design"
- X.br
- Xis equivalent to:
- X.br
- X "experimental design"
- X.br
- Xwhile:
- X.br
- X "experimental\\
- X.br
- X design"
- X.br
- Xis equivalent to:
- X.br
- X "experimentaldesign"
- X.br
- XIf you need to have \|\\\| followed by a space in your string, you need to
- Xenter either a backslash or a space following it as an explicit character
- Xcode, for example:
- X.br
- X "\|\\\|\|\\\|0o40"
- X.br
- Xwill produce a \|\\\| followed by the space, while the string:
- X.br
- X "\|\\\| "
- X.br
- Xwill be empty.
- X.sp 1
- XThe preprocessor knows only about comments, plain characters, character codes,
- Xand continuation lines. However, some characters and their combinations
- Xmay have a special meaning in lists and regular expressions.
- X.sp 2
- X\fIDETAILS OF FILE STRUCTURE\fR
- X.sp
- X.PP
- X.in +3m
- X.ti -3m
- XAd.1) File format number. This is simply a digit 1 on a line by itself at the
- Xmoment. This entry is included to allow future extensions of the
- Xtransliteration description file without the need to modify older
- Xtransliteration descriptions (program will read data according to
- Xthe current file format number given in the file).
- X.sp
- X.ti -3m
- XAd.2) String delimiters. The subsequent 3 lines specify pairs of
- Xsingle character delimiters for 3 types of text data.
- XThe line format is:
- X.br
- X opening_character closing_character.
- X.br
- XThese are needed to mark the beginning/end and the type of the text data.
- XEach string (text datum) is saved starting from the first character after
- Xopening delimiter, and ends at the last character before the closing
- Xdelimiter. If you need to use the closing delimiter within a string,
- Xyou need to specify it as its code (e.g., if you are using () pair as
- Xdelimiters, specify ")" as \|\\\|0x29). The opening delimiter may be the same
- Xor different from the closing delimiter.
- X.sp
- X.in +2m
- X.ti -2m
- Xa) The first line contains characters used to enclose (bracket)
- Xa \fIplain string\fR. Plain strings are directly matched to input data or
- Xdirectly sent to output.
- XI suggest to stick to " " pair for plain strings.
- XThe ASCII code for " is \|\\\|0d34 = \|\\\|0x22 = \|\\\|0o42 if you need it inside the
- Xstring itself.
- X.sp
- X.ti -2m
- Xb) The second line contains characters to mark the beginning and the end
- Xof the \fIlist\fR. Lists are used to translate single character codes.
- XI suggest [ and ] delimiters for the list (ASCII code of "]" is:
- X\|\\\|0d93 = \|\\\|0x5D = \|\\\|0o135). The lists may include ranges, for example:
- X[a-zA-Z0-9] will include all Latin letters (small and capital) and digits.
- XNote that order is important: [a-d] is equivalent to [abcd], while
- X[d-a] will result in an error. If you want to include "-" (minus) in the
- Xlist, you need to place it as the first or the last character. There are only
- Xtwo special characters on the list, the "-" described above, and the "]"
- Xcharacter. You need to enter the "]" as its code. E.g., for
- XASCII character table [*--] is equivalent to [*+,-], is equivalent to
- X[\|\\\|42\|\\\|43\|\\\|44\|\\\|45]. The order of characters in the list does not matter
- Xunless the input list corresponds to the output list (this will be
- Xexplained later). Empty lists do not make sense.
- X.sp
- X.ti -2m
- Xc) The third line of delimiter specification contains delimiters for
- X\fIregular expression\fRs and \fIsubstitution expression\fRs.
- XThese strings are used for "flexible" matches
- Xto the text in the input file. They are very similar to the ones used in
- XUN*X for searching text in utilities like: grep, sed, vi, awk, etc., though
- Xonly a subset of full UN*X regular expression syntax is used here.
- XI suggest enclosing them within braces { and } (ASCII code for } is
- X\|\\\|0d125 = \|\\\|0x7D = \|\\\|0o175). Actually, regular expressions can only
- Xbe used for input sequences, and for output sequences the {} are
- Xused to enclose substitution sequences. This will be explained
- Xbelow. The description of the
- Xsyntax for regular/substitution expressions is
- Xadapted from the documentation for the regexp package of Henry
- XSpencer, University of Toronto --- this regular expression package
- Xwas incorporated, after minute modifications, into the program.
- X.br
- X.sp 2
- X.ce
- X\fBREGULAR EXPRESSION SYNTAX\fR
- X.br
- XA regular expression is zero or more branches, separated by
- X`\|\||\|\|'. It matches anything that matches one of the branches.
- XThe `\|\||\|\|' simply means "or".
- X.ti +2m
- XA branch is zero or more pieces, concatenated. It matches a
- Xmatch for the first, followed by a match for the second,
- Xetc.
- X.ti +2m
- XA piece is an atom possibly followed by `*', `+', or `?'.
- XAn atom followed by `*' matches a sequence of 0 or more
- Xmatches of the atom. An atom followed by `+' matches a
- Xsequence of 1 or more matches of the atom. An atom followed
- Xby `?' matches zero or one occurrences of atom.
- X.ti +2m
- XAn atom is a regular expression in parentheses (matching a
- Xmatch for the regular expression), a range (see below), `.'
- X(matching any single character), a `\|\\\|' followed by
- Xa single character (matching that character), or a
- Xsingle character with no other significance (matching that
- Xcharacter).
- X.ti +2m
- XA range is a sequence of characters enclosed in `[\|\|]'. It
- Xnormally matches any single character from the sequence. If
- Xthe sequence begins with `^', it matches any single character
- Xnot from the rest of the sequence. If two characters in
- Xthe sequence are separated by `-', this is shorthand for the
- Xfull list of ASCII characters between them (e.g. `[0-9]'
- Xmatches any decimal digit). To include a literal `]' in the
- Xsequence, make it the first character (following a possible
- X`^'). To include a literal `-', make it the first or last
- Xcharacter. The regular expression can contains subexpressions
- Xwhich are enclosed in a (\|\|) pair. These subexpressions are numbered
- X1 to 9 and can be nested. The numbering of subexpressions is
- Xgiven in the order of their opening parentheses "(". For
- Xexample:
- X.br
- X.ta 6mL
- X (111)...(22(333)222(444)222)...(555)
- X.br
- XNote that expression 2 contains within itself expressions 3 and 4.
- X.br
- XThese subexpressions can be referenced in the substitution string which
- Xis described below in the paragraph below, or can be used to delimit
- Xatoms.
- X.in +2m
- XExamples:
- X.in +2m
- X.ti -2m
- X{[\|\\\|0d32\|\\\|0d09]\|\\\|0d10} --- will match space or tab followed by new line
- X.ti -2m
- X{[Tt][Ss]} --- will match TS, Ts, tS and ts
- X.ti -2m
- X{TS\|\||\|\|Ts\|\||\|\|tS\|\||\|\|ts} --- same as above
- X.ti -2m
- X{[\|\\\|0d09-\|\\\|0d15 ][^hH][^uU][a-zA-Z]*[\|\\\|0d09-\|\\\|0d15 ]} --- all words which
- Xdo not start with hu, Hu, hU, HU. There is a space between
- X\|\\\|0d15 and ].
- X.br
- XNote that specifying expressions like {.*} (i.e., match all characters)
- Xdoes not make much sense, since it would mean here: match the whole input
- Xfile. However, expressions like {A.*B} should be acceptable, since they
- Xmatch a pair of A and B, and everything in between them, e.g. for a
- Xstring like: "This is Mr. Allen and this is Mr. Brown." this expression
- Xshould match the string: "Allen and this is Mr. B".
- X.br
- X.in -4m
- XRemember to put a backslash "\|\\\|" in front of the following
- Xcharacters: .\|\|[\|\|(\|\|)\|\||\|\|?\|\|+\|\|*\|\|\|\\\| if you want
- Xtheir literal meaning outside the
- Xrange enclosed in [\|\|]. Inside the range they have their literal meaning.
- XIf you know the syntax of UN*X regular expressions, please note that
- X\|\|^\|\| and \|$\| anchors are not supported and are treated as normal
- Xcharacters (with the exception of \|\|^\|\| negation within [\|\|]).
- X.sp
- X.ce
- X\fBSUBSTITUTION EXPRESSIONS\fR
- X.br
- XAfter finding a match for a regular expression in the input text,
- Xa substitution is made.
- XIt can be a simple substitution where the whole matching string
- Xis replaced by another string, or it may reuse a portion or
- Xthe whole matching string. The subexpressions (the ones enclosed
- Xin parentheses) within the regular
- Xexpression which matched the input text can be referenced in the
- Xsubstitution expression.
- XOnly the following characters have special meaning within substitution
- Xexpression:
- X.in +4m
- X.ta 3m
- X.br
- X.ti -2m
- X& --- will put the whole matching string.
- X.ti -2m
- X\|\\\|1 --- will put the match for the 1st subexpression in (\|\|).
- X.ti -2m
- X\|\\\|2 --- will put the string which matched 2nd subexpression,
- Xetc.
- X.ti -2m
- X\|\\\|9 --- will place in a replacement string the 9th
- Xsubexpression (provided that there was 9 (\|\|) pairs in
- Xthe regular expression)
- X.in -4m
- X.sp
- XOnly 9 subexpressions are allowed.
- XAll other characters and sequences within the substitution expression
- Xwill be placed in a substitution string as written. To be able to put
- Xa single backslash there, you need to put two of them.
- XTo be able to place the unchanged codes of the
- Xabove characters (i.e., to make them literals), you need to precede them
- Xwith a backslash "\|\\\|", i.e., to get & in the output string
- Xyou need to write it as \|\\\|&. Similarly, to place literal
- X\|\\\|1, \|\\\|2, etc., you need to enter it as \|\\\|\|\\\|1, \|\\\|\|\\\|2, etc.
- XNote that characters .+[]()^, etc. which had a special meaning in
- Xthe regular expressions, do not have any special meaning in the
- Xsubstitution expression and will be output as written.
- X.in +2m
- XExample:
- X.br
- XThe regular expression:
- X.in +2m
- X.ti -2m
- X{([Tt])([Ss])} and the corresponding substitution expression {\|\\\|1.\|\\\|2}
- Xputs a period
- Xbetween adjoining letters t and s preserving their letter case.
- X.br
- XThe expression:
- X.ti -2m
- X{([A-Za-z]+)-[ \|\\\|0x09]*([\|\\\|0x0A-\|\\\|0x0D]+)[ \|\\\|0x09]*([A-Za-z,.?;:"\|\\\|)'`!]+)[ \|\\\|0x09]}
- X.br
- Xand the substitution expression {\|\\\|1\|\\\|3\|\\\|2} dehyphenate words (when you
- Xunderstand this one, you are a guru...). For example:
- Xcon- (NL)cert is changed to concert(NL), where NL stands for New
- XLine. It looks for one or more letters (saves them as substring 1)
- Xfollowed by a hyphen (which may be followed by zero or more spaces
- Xor tabs). The hyphen must be followed by a NewLine (ASCII characters
- X0A-0D hex form various new line sequences) and saves NewLine sequence
- Xas a subexpression 2.
- XThen it looks for zero or more tabs and spaces (at the beginning of
- Xthe line). Then it looks for the rest of the hyphenated word and
- Xsaves it as substring 3. The word may have punctuation attached.
- XThen it looks again for some spaces or tabs. The substitution expression
- Xjunks all sequences which were not within (), i.e., hyphen and
- Xspaces/tabs and inserts only substrings but in a different
- Xorder. The \|\\\|1 (word beginning) is followed by \|\\\|3 (word end) and
- Xfollowed by the NewLine --- \|\\\|2. The {\|\\\|2\|\\\|1\|\\\|3} would
- Xbe probably equally good, though you would need to move the punctuation
- Xmatching to the beginning of the regular expression.
- X.in -6m
- X.ti -3m
- XAd.3) Starting sequence. This sequence will be sent to the output before
- Xany text. It is enclosed in the pair of string delimiters. I use it
- Xto output LaTeX preamble. However, it can be empty, if not used.
- XThe (sequence) may contain any characters, including new lines, etc.
- X.nf
- X.ta 2m 4m
- X Example:
- X "" # empty sequence
- X.sp
- X Example:
- X "\|\\\|documentstyle{article}
- X \|\\\|input cyracc
- X \|\\\|begin{document}
- X "
- X is right (note a new line at the end), but
- X.br
- X "\|\\\|documentstyle{article}
- X \|\\\|input cyracc # this comment will be included!
- X \|\\\|begin{document}" # while this will not
- X is wrong.
- X.sp
- X.fi
- X.ti -3m
- XAd.4) Ending sequence. Similar to 1), but will be appended at the end of the
- Xoutput file.
- X.nf
- X For example:
- X "\|\\\|end{document}
- X "
- X.fi
- X.sp
- X.ti -3m
- XAd.5) Number of input character sets. For example, in some incarnation of
- XKOI7, there are two character sets: Latin and Cyrillic. Cyrillic
- Xcharacter sequence follows SHIFT-OUT character (CTRL-N), \|\\\|0x0e,
- Xand is terminated by SHIFT-IN character (CTRL-O), \|\\\|0x0f.
- XAnother way of looking at it is that Latin characters follow
- XCTRL-O and cyrillic ones follow CTRL-N.
- X.sp
- XIf there is only one character set on input you should specify 0
- Xas a number of input char sets,
- Xsince the input file obviously does not contain any SHIFT-OUT/IN
- Xsequences.
- X.sp
- X.ti -3m
- XAd.6) SHIFT-OUT/SHIFT-IN sequences for each input character set.
- XThese lines appear only if you specified nonzero number of character sets.
- XThese lines contain also "nesting sequences", which will be
- Xexplained later in this section.
- XYou do not use "nesting sequences" frequently, and let us assume
- Xfor a moment that nesting data are empty strings.
- XThe strings or regular expressions specified here are matched
- Xwith the contents of input text. If match was found, the matching sequence
- Xis usually deleted from the input text and:
- X.in +4m
- X.ti -2m
- Xa) for SHIFT-OUT sequence: the current input character set number is changed
- Xto the new one corresponding to the SHIFT-OUT sequence, or
- X.ti -2m
- Xb) for SHIFT-IN sequence: the previous input character set number is restored,
- X(i.e., the one which preceded the SHIFT-OUT sequence for the current set).
- XNote that only the SHIFT-IN sequence for the current set is matched.
- XThe SHIFT-IN sequences for other character sets than the current set are
- Xnot matched.
- XThe bracketing of sets is assumed
- Xperfect. If the SHIFT-IN sequence for the current set is an empty string,
- Xthe input set number is changed when SHIFT-OUT sequence of the new set
- Xis detected.
- X.in -4m
- XFor each input character set, you have to specify a line consisting
- Xof 6 strings/expressions separated by spaces:
- X.br
- X SO-match SO-subs NEST-up NEST-down SI-match SI-subs
- X.br
- Xwhere:
- X.br
- X.in +2m
- X.ti -2m
- XSO-match --- the string or regular expression for the SHIFT-OUT sequence
- Xfor the current character set. If detected, the input character set is
- Xchanged to this set.
- X.ti -2m
- XSO-subs --- this is usually an empty string (i.e., the input sequence
- Xmatching SO-match is removed). But it can be a replacement string or
- Xa substitution expression, which will substitute the original matching
- XSHIFT-OUT sequence.
- X.ti -2m
- XNEST-up --- this string (or a regular expression) is usually an empty
- Xstring). However, it can be used to count brackets for detection of SHIFT-IN
- Xbracket, if SHIFT-IN sequence is not unique. Its use is explained below.
- X.ti -2m
- XNEST-down --- a counterpart of NEST-up. It is explained later.
- X.ti -2m
- XSI-match --- when a sequence in an input file matches the string or regular
- Xexpression given as SI-match for a current input character set, the
- Xinput character set number is restored to the previous set. Note, that
- Xonly SI-match for a current set is matched with input characters.
- X.ti -2m
- XSI-subs --- this is usually an empty string (i.e., input sequence which
- Xmatched SI-match is removed), but if it is not, the input characters which
- Xmatched the SI-match are replaced with the SI-subs.
- X.sp
- X.in -2m
- X.br
- XThe KOI7 case described above may be specified as:
- X.nf
- X.ta 5m 10m 15m 20m 25m
- X.nf
- X 2 # 2 input sets
- X ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 1)
- X "\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 "\|\\\|017" ""\0\0\0\0 # Cyrillic(set 2)
- X or
- X 2 # 2 sets
- X "\|\\\|017" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 1)
- X "\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Cyrillic(set 2)
- X.fi
- X.br
- XBefore the input is processed, the program is initialized to the character
- Xset of the first set. In the above case, it is important, since declaration:
- X.nf
- X 2 # 2 sets
- X "\|\\\|016" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Cyrillic(set 1)
- X "\|\\\|017" ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 ""\0\0\0\0 # Latin(set 2)
- X.br
- X.fi
- Xwould be wrong and would mess up the Latin characters preceding
- Xfirst Cyrillic sequence.
- X.sp 1
- XThe nesting sequences are used only for specific situations. I needed them
- Xto write a transliteration table from LaTeX to KOI8.
- XIn LaTeX the { } pair is used for grouping and appears frequently in
- Xthe text. The sequence of cyrillic characters is also a group
- Xin LaTeX.
- XThe SHIFT-OUT sequence for Russian letters in LaTeX is (at least in
- Xmy case): "{\|\\\|cyr ", and the end
- Xof the Russian letters is marked by "}", but the "}" has to be the
- Xbracket matching the opening "{" in "{\|\\\|cyr ", not just any bracket.
- XFor this reason, my SHIFT-OUT/IN entry was in this case:
- X.br
- X "{\|\\\|cyr " "" "{" "}" "}" "" # Cyrillic codes
- X.br
- XWhenever the "{\|\\\|cyr " was found, the program zeroes the counter.
- XIt adds +1 to it, when NEST-up sequence (i.e., the "{" here) is found, and
- Xsubtracts 1 from it, when the NEST-down sequence is found (i.e., the "}").
- XThe checking for a SHIFT-IN sequence (i.e., the "}") for cyrillic set
- Xis done only when
- Xthe counter value is zero (i.e., all pairs inside the cyrillic text are
- Xmatched. In fact, the process is more
- Xcomplicated than that (the counter for an opened character set is
- Xplaced on the stack), but these are details you can find in the code
- Xitself.
- X.sp
- X.ti -3m
- XAd.7) Number of output "character sets". This is analogous to the input case.
- XThe characters sent to output may belong to different sets. For example,
- Xwhen the character (or the sequence) from set 2 is followed by the character
- X(or the sequence) from set 1,
- Xthe program first sends the SHIFT-IN sequence for set 2 (if it is not
- Xempty) and then the SHIFT-OUT sequence for set 1 (if it is not empty). If the
- Xoutput character (or sequence) is assigned to set 0, then no SHIFT-IN/SHIFT-OUT
- Xsequences are sent to output.
- X.br
- XIf there is only one set of output characters, you should specify 0.
- XNote that you may have several input sets and several output sets, though
- Xthis is rare. Usually, you have one input set and many
- Xoutput character sets, or vice versa. Again, if you have only one output set,
- Xyou do not have any SHIFT-IN/SHIFT-OUT sequences, since those are
- Xsend to output only when a set number is changed.
- XBut you are free to experiment.
- X.sp
- X.ti -3m
- XAd.8) SHIFT-OUT/SHIFT-IN sequences for each output character set. It is
- Xsimilar to the input case, however, the NEST-in and NEST-up sequences
- Xare not used here. Again, before any text is sent to output, the
- Xcharacter set specified as the first one is assumed. If SHIFT-OUT/IN
- Xsequences are not used (i.e., you have only one output character set),
- Xyou will not have any SHIFT-OUT/SHIFT-IN data lines.
- XThe KOI8 (single character set containing all Latin and Russian letters)
- Xto KOI7 (the set using overlapping codes switched by SHIFT-OUT/IN sequences)
- Xconversion could be therefore accomplished by the following table:
- X.br
- X 2 # 2 output sets
- X.br
- X ""\0\0\0\0 ""\0\0\0\0 # Latin Letters
- X.br
- X "\|\\\|016" "\|\\\|017" # Russian Letters
- Xcase
- X.sp
- X.ti -3m
- XAd.9) Transliteration table for individual character or their sequences.
- XIt is a core of your transliteration data.
- XThere are 4 columns in the transliteration
- Xtable:
- X.br
- X.in +3m
- X(inp_set_no) (inp_seq) (out_set_no) (out_seq)
- X.br
- X.in -3m
- XThese 4 columns are separated by spaces. The (input_set_number)
- Xcorresponds to the input character set number as specified above for
- Xinput SHIFT-OUT/SHIFT-IN data, or zero.
- XIf zero is used (even if number of input sets is not zero), the
- X(input_sequence) will be always matched, irrespectively of the current
- Xinput character set imposed by the SHIFT-OUT sequence. This is useful,
- Xsince some characters are universal (e.g., new lines, spaces, pluses,
- Xminuses, etc.) irrespectively of the current character set.
- XThe (input_sequence) is the sequence of characters to be matched with
- Xcharacters in the input file, and if found (within the character set
- Xspecified) it is replaced by the (output_sequence) and sent to output
- X(i.e., the matching is interrupted, the (output_sequence) sent to ouput,
- Xthe input file pointer is moved to the first character after the
- Xmatched sequence and matching resumes).
- XThe (output_set_number) specifies the output character set. When the
- Xoutput character set changes during transliteration, the appropriate SHIFT-IN
- Xsequence of the previous set and the current set's SHIFT-OUT sequence is sent
- Xto output. The (output_set_number) may also be zero (even if number of
- Xoutput sets is not zero). In this case, the current output set status
- Xis not changed, and no SHIFT-IN/OUT sequences is sent to output. Lastly, the
- Xoutput set code may be -1, -2 or -3.
- XIn this case, the substitution is performed
- Xwithin input string that matched but the output sequence is not sent to
- Xthe output yet. Depending on the code, the following action is performed:
- X.in +4m
- X.ti -2m
- X-1 --- program makes the substitution in the input string (i.e., substitutes
- Xthe matching string with the input string in the input buffer).
- XIt does not send the output sequence to the output, but
- Xcontinues matching input sequences following the currently
- Xmatched one.
- X.ti -2m
- X-2 --- like code -1, but matching is resumed from the first sequence on
- Xthe list.
- X.ti -2m
- X-3 --- like code -1, but matching is resumed from the input SHIFT-OUT/IN
- Xsequences.
- X.in -4m
- XE.g., if the unprocessed text in the input file is:
- X.br
- X mental procedure was not successful since..........
- X.br
- Xand there was a line in transliteration table:
- X.br
- X 0 "me" -1 "you"
- X.br
- Xthe input text would be changed to:
- X.br
- X yountal procedure was not successful since..........
- X.br
- Xand all remaining matching data would be applied to this text, rather than
- Xoriginal text.
- XThe -2 code backsteps to the point where the matching of
- Xtransliteration starts.
- XThe -3 code backsteps even further, to the point where the
- Xinput SHIFT-OUT and SHIFT-IN sequences are matched.
- XSince the order of sequences to match
- Xis crucial here, for the case of output set code -1/-2/-3
- Xeven one-character input sequences are matched in the order specified.
- XBE CAREFUL HERE. You may create infinite loops. If you use
- Xcode -2/-3, be sure that the resulting sequence after substitution
- Xwith the code -2/-3, will not match previous sequences
- Xwith codes -2/-3.
- X.br
- XThe (output_sequence)
- Xis a sequence which substitutes the corresponding (input_sequence).
- XIf (output_sequence) is "" (i.e., empty string) then (input_sequence)
- Xis effectively deleted.
- XThe (input_sequence)s are compared with input in the order specified
- Xunless backstepping -2/-3 code is used (the matching is done from the
- Xfirst sequence again). I use the code -1 e.g.,
- Xto dehyphenate words when changing to LaTeX.
- XCode -2 is useful if you want to skip next comparisons, and the resulting
- Xsubstitution string will match earlier matching expressions.
- XI do not see any use for the code -3, but you may have one.
- XThe order for multicharacter sequences is
- Xtherefore important (the single character sequences are always compared
- Xafter all multicharacter sequences, and can be therefore put anywhere).
- XThe longer multicharacter sequences should be specified before
- Xshorter ones, unless they are some "preprocessing" steps with codes
- X-1/-2/-3. The order may sometimes be crucial.
- XIf you need single character sequences matched in a specific order,
- Xenter them as regular expressions, i.e., as {c} instead of "c".
- XIn short, the multicharacter input sequences and regular expressions
- Xare matched to input text in the order specified. For the sake of
- Xefficiency, the single character input sequences (with exception of
- Xoutput set code -1/-2/-3) and input lists are handled as a case of remapping
- Xand are matched in the order of character codes associated with them.
- XIf you specify the same single input character twice for a given input set,
- Xthe program will complain.
- XThe following combinations of input and output sequences are allowed:
- X.nf
- X.ta 2m 24m
- X Input Sequence Output Sequence
- X "\fIplain string\fR" only "\fIplain string\fR"
- X [\fIlist\fR] [\fIlist\fR] or "\fIplain string\fR"
- X {\fIregular expression\fR} {\fIsubstitution expression\fR} or
- X.br
- X "\fIplain string\fR"
- X.br
- X.fi
- XWhen match is found, the matching sequence is removed and substituted
- Xwith an output sequence. If this results is changing the current output
- Xcharacter set, the appropriate SHIFT-IN/SHIFT-OUT pair is sent to the
- Xoutput before the transliterated output sequence. If list is
- Xused as the input sequence, you may either use:
- X.br
- X.in +2m
- X.ti -2m
- Xa) plain string as output
- Xsequence. In this case, if current input character belongs to the input list,
- Xit is replaced by the output string. I use it to delete ranges of
- Xcharacters which do not have any corresponding characters in the output
- Xset (e.g., some graphics characters). In this case, the order of
- Xcharacters on the input list is not important.
- X.ti -2m
- Xb) if the output string is also a
- Xlist then it has to contain exactly the same number of characters as
- Xthe input list. In this case, the 1st character from the input list
- Xis replaced by the 1st character from the output list, the 2nd one
- Xby the 2nd one, etc. Therefore, the order of characters is important.
- X.br
- X.in -2m
- XTheoretically, if there is one-to-one correspondence between characters
- Xin the input set and characters in the output set,
- Xyou can make the conversion by
- Xusing a single line consisting of two lists. But it looks ugly... And is
- Xdifficult to read.
- XAnd for the program, the substitution takes the same time, if
- Xthe characters are specified separately, or when they are specified
- Xas matching lists.
- XIf regular expression is used to match the input characters, the matching
- Xsequence may be replaced by a plain string or a substitution string,
- Xwhich was described above.
- X.in +3m
- XExamples:
- X.br
- X.ta 3m 10m 20m 30m 40m
- X 2 "CCCP" 0 ""\0\0\0\0
- X.br
- Xwill delete all occurrences of CCCP from the input file (but not Cccp or
- XCCCp) for input set 2.
- X.sp 1
- X 0 "\|\\\|0xD1" 0 "ya"
- X.br
- Xwill replace all occurrences of character of the code \|\\\|0xD1 with a two
- Xletter sequence "ya".
- X.sp 1
- X 0 \|\\\|0xD1 2 q
- X.br
- Xwill replace all characters \|\\\|0xD1 with a character "q" and output
- XSHIFT-IN/OUT sequence if necessary.
- X.sp 1
- X 2 "q" 0 "\|\\\|0xD1"
- X.br
- Xwill replace letter q (if the current input set is 2) with a code \|\\\|0xD1.
- X.sp 1
- X 0 "\|\\\|0xD1" 2 "ya"
- X.br
- Xwill replace code \|\\\|0xD1 with a sequence ya (assuming that SHIFT-OUT
- Xand SHIFT-IN sequences
- Xfor output set 2 are: {\|\\\|cyr and }, respectively, you will get {\|\\\|cyr ya}).
- X.sp
- XIf a character is not specified in the transliteration table, it will
- Xbe output as is, i.e., it corresponds to a line:
- X.br
- X 0 "c" 0 "c"
- X.br
- Xwhere c is the character. If you want to delete certain characters, you
- Xneed to explicitly specify this, e.g.:
- X.br
- X 0 [a-z] 0 ""
- X.br
- Xwill delete all lower case Latin letters from the text.
- X.in -3m
- XBefore you decide to create your own transliteration file, please examine
- Xexisting transliteration files. Do yourself (and others) a favor --- put
- Xas many comments as possible there. If you allow others to use your
- Xtransliteration files, please include your name and e-mail address
- Xand file creation date.
- X.in -4m
- X.sp 2
- XProgram matches the sequences in a specific order:
- X.in +4m
- X.ti -2m
- X\01) Match/substitute input SHIFT-OUT sequences
- X.ti -2m
- X\02) If matched, save current set and start new one
- X.ti -2m
- X\03) If matched, zero nest counter for NEST sequences
- X.ti -2m
- X\04) Match/substitute current set SHIFT-IN-sequence
- X.ti -2m
- X\05) If matched, restore previous set number
- X.ti -2m
- X\06) If matched, restore previous set nest counter
- X.ti -2m
- X\07) Match/substitute transliteration sequences
- X.ti -2m
- X\08) If matched and code = -1 make substitution in input buffer and
- Xcontinue matching the next sequence.
- X.ti -2m
- X\09) If matched and code = -2 make substitution and goto 7)
- X.ti -2m
- X10) If matched and code = -3 make substitution and goto 1)
- X.ti -2m
- X11) Match (no substitution) NEST-up and NEST-down to input buffer
- X.ti -2m
- X12) If NEST-up matched, increment counter for current set
- X.ti -2m
- X13) If NEST-down matched, decrement counter for current set
- X.ti -2m
- X14) If match in 7) send substitute sequence to output
- X.ti -2m
- X15) If no match in 7) (or code -1) output current input character
- X.ti -2m
- X16) Advance input pointer to point at new characters
- X.ti -2m
- X17) If End of File, break
- X.ti -2m
- X18) Goto 1)
- X.br
- X.fi
- X
- X.PP
- X.SH ASCII CHARACTER CODES
- X.nf
- X.ta 2m 6m 9m 13m 16m 20m 22m 26m 29m 33m 36m 40m
- X dec hx oct ch dec hx oct ch
- X
- X \0\00 00 000 ^@ NUL \064 40 100 @
- X \0\01 01 001 ^A SOH \065 41 101 A
- X \0\02 02 002 ^B STX \066 42 102 B
- X \0\03 03 003 ^C ETX \067 43 103 C
- X \0\04 04 004 ^D EOT \068 44 104 D
- X \0\05 05 005 ^E ENQ \069 45 105 E
- X \0\06 06 006 ^F ACK \070 46 106 F
- X \0\07 07 007 ^G BEL \071 47 107 G
- X \0\08 08 010 ^H BS \072 48 110 H
- X \0\09 09 011 ^I HT \073 49 111 I
- X \010 0a 012 ^J LF \074 4a 112 J
- X \011 0b 013 ^K VT \075 4b 113 K
- X \012 0c 014 ^L FF \076 4c 114 L
- X \013 0d 015 ^M CR \077 4d 115 M
- X \014 0e 016 ^N SO \078 4e 116 N
- X \015 0f 017 ^O SI \079 4f 117 O
- X \016 10 020 ^P DLE \080 50 120 P
- X \017 11 021 ^Q DC1 \081 51 121 Q
- X \018 12 022 ^R DC2 \082 52 122 R
- X \019 13 023 ^S DC3 \083 53 123 S
- X \020 14 024 ^T DC4 \084 54 124 T
- X \021 15 025 ^U NAK \085 55 125 U
- X \022 16 026 ^V SYN \086 56 126 V
- X \023 17 027 ^W ETB \087 57 127 W
- X \024 18 030 ^X CAN \088 58 130 X
- X \025 19 031 ^Y EM \089 59 131 Y
- X \026 1a 032 ^Z SUB \090 5a 132 Z
- X \027 1b 033 ^[ ESC \091 5b 133 [
- X \028 1c 034 ^\\ FS \092 5c 134 \\
- X \029 1d 035 ^] GS \093 5d 135 ]
- X \030 1e 036 ^^ RS \094 5e 136 ^
- X \031 1f 037 ^_ US \095 5f 137 _
- X \032 20 040 SP \096 60 140 `
- X \033 21 041 ! \097 61 141 a
- X \034 22 042 " \098 62 142 b
- X \035 23 043 # \099 63 143 c
- X \036 24 044 $ 100 64 144 d
- X \037 25 045 % 101 65 145 e
- X \038 26 046 & 102 66 146 f
- X \039 27 047 ' 103 67 147 g
- X \040 28 050 ( 104 68 150 h
- X \041 29 051 ) 105 69 151 i
- X \042 2a 052 * 106 6a 152 j
- X \043 2b 053 + 107 6b 153 k
- X \044 2c 054 , 108 6c 154 l
- X \045 2d 055 - 109 6d 155 m
- X \046 2e 056 . 110 6e 156 n
- X \047 2f 057 / 111 6f 157 o
- X \048 30 060 0 112 70 160 p
- X \049 31 061 1 113 71 161 q
- X \050 32 062 2 114 72 162 r
- X \051 33 063 3 115 73 163 s
- X \052 34 064 4 116 74 164 t
- X \053 35 065 5 117 75 165 u
- X \054 36 066 6 118 76 166 v
- X \055 37 067 7 119 77 167 w
- X \056 38 070 8 120 78 170 x
- X \057 39 071 9 121 79 171 y
- X \058 3a 072 : 122 7a 172 z
- X \059 3b 073 ; 123 7b 173 {
- X \060 3c 074 < 124 7c 174 |
- X \061 3d 075 = 125 7d 175 }
- X \062 3e 076 > 126 7e 176 ~
- X \063 3f 077 ? 127 7f 177 DEL
- X
- X.br
- X
- X.SH CONVERSION: DECIMAL<-->OCTAL<-->HEX.
- X.nf
- X.cs R 24
- X 000 000 00 064 100 40 128 200 80 192 300 C0
- X 001 001 01 065 101 41 129 201 81 193 301 C1
- X 002 002 02 066 102 42 130 202 82 194 302 C2
- X 003 003 03 067 103 43 131 203 83 195 303 C3
- X 004 004 04 068 104 44 132 204 84 196 304 C4
- X 005 005 05 069 105 45 133 205 85 197 305 C5
- X 006 006 06 070 106 46 134 206 86 198 306 C6
- X 007 007 07 071 107 47 135 207 87 199 307 C7
- X 008 010 08 072 110 48 136 210 88 200 310 C8
- X 009 011 09 073 111 49 137 211 89 201 311 C9
- X 010 012 0A 074 112 4A 138 212 8A 202 312 CA
- X 011 013 0B 075 113 4B 139 213 8B 203 313 CB
- X 012 014 0C 076 114 4C 140 214 8C 204 314 CC
- X 013 015 0D 077 115 4D 141 215 8D 205 315 CD
- X 014 016 0E 078 116 4E 142 216 8E 206 316 CE
- X 015 017 0F 079 117 4F 143 217 8F 207 317 CF
- X 016 020 10 080 120 50 144 220 90 208 320 D0
- X 017 021 11 081 121 51 145 221 91 209 321 D1
- X 018 022 12 082 122 52 146 222 92 210 322 D2
- X 019 023 13 083 123 53 147 223 93 211 323 D3
- X 020 024 14 084 124 54 148 224 94 212 324 D4
- X 021 025 15 085 125 55 149 225 95 213 325 D5
- X 022 026 16 086 126 56 150 226 96 214 326 D6
- X 023 027 17 087 127 57 151 227 97 215 327 D7
- X 024 030 18 088 130 58 152 230 98 216 330 D8
- X 025 031 19 089 131 59 153 231 99 217 331 D9
- X 026 032 1A 090 132 5A 154 232 9A 218 332 DA
- X 027 033 1B 091 133 5B 155 233 9B 219 333 DB
- X 028 034 1C 092 134 5C 156 234 9C 220 334 DC
- X 029 035 1D 093 135 5D 157 235 9D 221 335 DD
- X 030 036 1E 094 136 5E 158 236 9E 222 336 DE
- X 031 037 1F 095 137 5F 159 237 9F 223 337 DF
- X 032 040 20 096 140 60 160 240 A0 224 340 E0
- X 033 041 21 097 141 61 161 241 A1 225 341 E1
- X 034 042 22 098 142 62 162 242 A2 226 342 E2
- X 035 043 23 099 143 63 163 243 A3 227 343 E3
- X 036 044 24 100 144 64 164 244 A4 228 344 E4
- X 037 045 25 101 145 65 165 245 A5 229 345 E5
- X 038 046 26 102 146 66 166 246 A6 230 346 E6
- X 039 047 27 103 147 67 167 247 A7 231 347 E7
- X 040 050 28 104 150 68 168 250 A8 232 350 E8
- X 041 051 29 105 151 69 169 251 A9 233 351 E9
- X 042 052 2A 106 152 6A 170 252 AA 234 352 EA
- X 043 053 2B 107 153 6B 171 253 AB 235 353 EB
- X 044 054 2C 108 154 6C 172 254 AC 236 354 EC
- X 045 055 2D 109 155 6D 173 255 AD 237 355 ED
- X 046 056 2E 110 156 6E 174 256 AE 238 356 EE
- X 047 057 2F 111 157 6F 175 257 AF 239 357 EF
- X 048 060 30 112 160 70 176 260 B0 240 360 F0
- X 049 061 31 113 161 71 177 261 B1 241 361 F1
- X 050 062 32 114 162 72 178 262 B2 242 362 F2
- X 051 063 33 115 163 73 179 263 B3 243 363 F3
- X 052 064 34 116 164 74 180 264 B4 244 364 F4
- X 053 065 35 117 165 75 181 265 B5 245 365 F5
- X 054 066 36 118 166 76 182 266 B6 246 366 F6
- X 055 067 37 119 167 77 183 267 B7 247 367 F7
- X 056 070 38 120 170 78 184 270 B8 248 370 F8
- X 057 071 39 121 171 79 185 271 B9 249 371 F9
- X 058 072 3A 122 172 7A 186 272 BA 250 372 FA
- X 059 073 3B 123 173 7B 187 273 BB 251 373 FB
- X 060 074 3C 124 174 7C 188 274 BC 252 374 FC
- X 061 075 3D 125 175 7D 189 275 BD 253 375 FD
- X 062 076 3E 126 176 7E 190 276 BE 254 376 FE
- X 063 077 3F 127 177 7F 191 277 BF 255 377 FF
- X.cs R
- X.br
- X.sp
- X.fi
- X
- X.SH INSTALLATION
- XProgram is given in a source form. It was tried under UN*X, VMS and
- XMS-DOS systems and ran. The file \fIreadme.doc\fR contains the details
- Xon how to obtain the whole package. You can retrieve this file
- Xfrom anonymous ftp on kekule.osc.edu in the directory /pub/russian/translit.
- XYou can also obtain it via e-mail by sending a message:
- X.br
- X get translit/readme.doc from russian
- X.br
- Xto OSCPOST@osc.edu or OSCPOST@OHSTPY.BITNET.
- X.sp
- XThe source of the program consists of several files:
- X.br
- X.IP \fIpaths.h\fR
- Xmust be edited before compilation. It contains its
- Xown comments what to do. The defines in this file relate to the operating
- Xsystem you are using and the default path for searching transliteration
- Xtable.
- X.br
- X.IP \fItranslit.c\fR
- XIt contains the main program.
- XThis was intended to be a portable code.
- X.br
- X.IP \fIreg_exp.h\fR
- Xthe include file for regular expression matching
- Xlibrary of Henry Spencer from the University of Toronto. This regular
- Xexpression package was posted to comp.sources.misc (volume 3). Also 4 patches
- Xwere posted (in volumes: 3, 4, 4, 10). I applied the patches to the original
- Xcode and made small modifications to the code, which are marked in the
- Xsource code.
- X.br
- X.IP \fIreg_exp.c\fR
- Xthe regular expression library for compilation and
- Xmatching of regular expressions.
- X.br
- X.IP \fIreg_sub.c\fR
- Xthe regular expression substitution routine.
- X.br
- X.sp
- X.PP
- XBefore you compile this program you have to edit \fIpaths.h\fR.
- XRead comments in the file.
- XDuring compilation, all source code should reside in the
- Xcurrent directory.
- X.br
- XThen you may compile the program under UN*X as (for example):
- X.br
- X cc -o translit translit.c reg_exp.c reg_sub.c
- X.br
- Xand copy the program \fItranslit\fR to some standard directory which is
- Xin users' path (for example: /usr/local/bin). Then you need to copy
- Xtransliteration tables to the directory which you have chosen in \fIpaths.h\fR.
- XIf you get errors, then it is not OK. Please, report them to the author (with
- Xall the gory details: error message, line number, machine, operating system,
- Xetc.).
- X.sp
- XUnder VMS (VAXes) you need to compile it as:
- X.br
- X cc translit
- X.br
- X cc reg_exp
- X.br
- X cc reg_sub
- X.br
- X link translit+reg_exp+reg_sub,sys$library:vaxcrtl/lib
- X.br
- Xand before you can use the program, you need to type (or better put into your
- XLOGIN.COM file) a line:
- X.br
- X translit == "$SYS$USER:[ME.TRA]TRANSLIT.EXE"
- X.br
- Xor whatever is the full path to the \fItranslit\fR executable image which
- Xyou created with LINK. Note the quotes and the $ sign in front of program
- Xpath.
- X.sp
- XOn an IBM-PC I used MicroSoft C 5.1 as:
- X.br
- X.in +2m
- X.ti -1m
- Xcl /FeTRANSLIT /AL /FPc /W1 /F 5000 /Ox /Gs translit.c reg_exp.c reg_sub.c
- X.in -2m
- X.sp 2
- X.SH RULES, CONDITIONS AND AUTHOR'S WHISHES
- XYou can distribute this code and associated files under these conditions:
- X.br
- X.in +4m
- X.ti -2m
- X 1) You will distribute all files (even if you
- Xthink that they are garbage). You may get the complete set from anonymous
- Xftp at kekule.osc.edu in /pub/russian/translit. You can also get the program
- Xand associated files via e-mail. To get the instructions for e-mail
- Xdistribution send a line:
- X.br
- X send translit/readme.doc from russian
- X.br
- Xto OSCPOST@osc.edu or OSCPOST@OHSTPY.BITNET.
- XYou are not allowed to distribute the incomplete distribution. The following
- Xfiles should be present in the distribution:
- X.ta 2m 22n
- X.nf
- X alt-gos.rus - ALT to GOSTCII table
- X alt-koi8.rus - ALT to KOI8 table
- X example.alt.uu - uuencoded example in ALT
- X example.ko8.uu - uuencoded example in KOI8
- X example.pho - phonetic transliteration example
- X example.tex - LaTeX example
- X gos-alt.rus - GOSTCII to ALT table
- X gos-koi8.rus - GOSTCII to KOI8 table
- X koi7-8.rus - KOI7 to KOI8 table
- X koi7nl-8.rus - KOI7 (no Latin) to KOI8 table
- X koi8-7.rus - KOI8 to KOI7 table
- X koi8-alt.rus - KOI8 to ALT table
- X koi8-gos.rus - KOI8 to GOSTCII table
- X koi8-lc.rus - KOI8 to Library of Congress table
- X koi8-phg.rus - KOI8 to GOST transliteration
- X koi8-php.rus - KOI8 to Pokrovsky transliteration
- X koi8-tex.rus - KOI8 to LaTeX conversion
- X order.txt - Order form for ordering the program
- X paths.h - Include file for translit.c
- X phg-koi8.rus - GOST transliteration to KOI8
- X pho-8sim.rus - Simple phonetic to KOI8
- X pho-koi8.rus - Various phonetic to KOI8
- X php-koi8.rus - Pokrovsky to KOI8
- X readme.doc - short description of the files
- X reg_exp.c - regular expression code by Henry Spencer
- X reg_exp.h - include for reg_exp.c and reg_sub.c
- X reg_sub.c - regular expression code by H. Spencer
- X tex-koi8.rus - LaTeX to KOI8
- X translit.c - TRANSLIT main program
- X translit.ps - TRANSLIT manual in PostScript
- X translit.1 - TRANSLIT manual in *roff
- X translit.txt - Plain ASCII TRANSLIT manual
- X.sp 1
- X.fi
- X.ti -2m
- X 2) You may expand/change the files and the program and distribute modified
- Xfiles, provided that you do
- Xnot delete anything (you can always comment the unnecessary portions out)
- Xand clearly mark your changes. Please send the copy of the modified
- Xversion to the author, though you are not required to do so.
- XI will give you all the credit for your enhancements. I simply wish that
- Xthere is a single point of distribution for this code, so it is maintained
- Xto some extent. If you create additional transliteration definition files,
- Xplease, send them to the author if you may. I will add them to the program
- Xdistribution. I want to fix bugs and expand/optimize this code,
- Xbut I need your help.
- XI need your transliteration files for languages which I do not know or
- Xdo not use currently.
- XYour suggestions for improving documentation are most welcome (I am not
- Xa native English speaker).
- X.ti -2m
- X3) You will not charge money for the program and/or associated files,
- Xexcept for media and copying costs. If you want to sell it, contact the author
- Xfirst. Bear in mind
- Xthat the regular expression package by Henry Spencer has some
- Xcopyright restrictions.
- XBut there are other regular expression packages which do not have these
- Xrestrictions (which are not violated by this offering).
- X.ti -2m
- X4) I will gladly help you with advice on compiling this software and
- Xtry to fix bugs when time allows. However, if you want a ready to run
- Xexecutable, you need to order it for a very nominal fee from
- X\fIJKL ENTERPRISES, INC.\fR as described in the file \fIorder.txt\fR
- Xwhich must be a part of a complete distribution.
- X.in -4m
- X
- X.SH AUTHOR
- XJan Labanowski, P.O. Box 21821, Columbus, OH 43221-0821, USA.
- XE-mail: jkl@osc.edu, JKL@OHSTPY.BITNET.
- X
- END_OF_FILE
- if test 56776 -ne `wc -c <'translit.1'`; then
- echo shar: \"'translit.1'\" unpacked with wrong size!
- fi
- # end of 'translit.1'
- fi
- echo shar: End of archive 1 \(of 10\).
- cp /dev/null ark1isdone
- MISSING=""
- for I in 1 2 3 4 5 6 7 8 9 10 ; do
- if test ! -f ark${I}isdone ; then
- MISSING="${MISSING} ${I}"
- fi
- done
- if test "${MISSING}" = "" ; then
- echo You have unpacked all 10 archives.
- rm -f ark[1-9]isdone ark[1-9][0-9]isdone
- else
- echo You still must unpack the following archives:
- echo " " ${MISSING}
- fi
- exit 0
- exit 0 # Just in case...
-