ProfitPress Mega CDROM2 Shareware Freeware (MSDOS)(1992)(Eng)

home *** CD-ROM | disk | FTP | other *** search

/ ProfitPress Mega CDROM2 …eeware (MSDOS)(1992)(Eng) / ProfitPress-MegaCDROM2.B6I / MISC / GNU / GPTX01AS.ZIP / GPTX.TI < prev next >

Wrap

Text File | 1990-08-05 | 36.0 KB | 860 lines

@setfilename gptx.info Copyright @copyright{} 1990 Free Software Foundation, Inc. Francois Pinard <pinard@@iro.umontreal.ca>, 1988. $Id$ This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. @node Top, Usage, , (DIR) @section @code{gptx} - GNU permuted index generator This is the GNU prerelease of @code{gptx}, a permuted index generator. This prerelease has the main goal of providing a @code{ptx} @emph{almost} compatible replacement, able to handle small files quickly, while providing a platform for more development. This version reimplements and extends standard @code{ptx}. In particular, it can produce a readable @dfn{KWIC} without the need of @code{nroff}. This program does not repeat all @code{ptx} disposition quirks (but should it really do?). Also, this version does not yet handle huge input files, that is, those files which do not fit in memory all at once. @menu * Usage:: How to use the program, its options and parameters. * Regexps:: How a regular expression is written and used. * ptx mode:: In which ways @code{ptx} mode is different. * Future:: What are the development lines of this program. @end menu @node Usage, Regexps, Top, Top @subsection How to use this program This tool reads a text file and essentially produces a permuted index, with each keyword in its context. The calling sketch is one of: @example gptx [@var{option}]@dots{} [@var{input}]@dots{} >@var{output} @end example or: @example ptx [@var{option}]@dots{} [@var{input} [@var{output}]] @end example If the program is called as @code{ptx} instead of @code{gptx}, or if @code{-p} option is selected, this implies @code{ptx} compatibility mode, disallowing extensions, introducing some limitations, and changing several of the program's default option values. See @xref{ptx mode} for a list of differences. As usual, each option is represented by an hyphen followed by a single letter. Some options require a parameter in the form of a decimal number or a file name, in which case the parameter follows the option after some whitespace. Option letters may be grouped and tied together as a string which follows only one hyphen; if one of several of them require parameters, they should follow the combined options in the order of appearance of individual letters in the string. Individual options are explained below. When @emph{not} in @code{ptx} compatibility mode, there may be zero, one or several parameters after the options. If there is no parameters, the program reads the standard input. If there is one or several parameters, they give the name of input files, which are all read in turn; a little as if all the input files were concatenated. However, there is a full contextual break between each file; and when automatic referencing is requested, file names and line numbers refer to individual text input files. In all cases, the program produces the permuted index onto the standard output. When in @code{ptx} compatibility mode, besides the options, there may be zero, one or two parameters. If there is no parameters, the program reads the standard input and produces the permuted index onto the standard output. If there is only one parameter, it names the text file to be read instead of the standard input. If two parameters are given, they give respectively the name of the file to read and the name of the file to produce. Note that for @emph{any} file named as the value of an option or as an input text file, a single dash @kbd{-} may be used, in which case standard input is assumed. However, it would not make sense to use this convention more than once per program invocation. @menu * General options:: Options which affect general program behaviour. * Charset selection:: Underlying character set considerations. * Input processing:: Input fields, contexts, and keyword selection. * Output formatting:: Types of output format, and sizing the fields. @end menu @node General options, Charset selection, , Usage @subsubsection General options @table @code @item -p This requests @code{ptx} behaviour, as far as we understand it. This option is selected by default when the program is installed under the name @code{ptx}. This option is not available once the program is operating in @code{ptx} compatibility mode. @item -C Prints a short note about the Copyright and copying conditions. @end table @node Charset selection, Input processing, General options , Usage @subsubsection Charset selection As it is setup now, the program assumes that the input file is coded using 8-bit ISO 8859-1 code, also known as Latin-1 character set, @emph{unless} if it is compiled for MS-DOS, in which case it uses the character set of the IBM-PC. Compared to 7-bit ASCII, the set of characters which are letters is then different, this fact alters the behaviour of regular expression matching. Thus, the default regular expression for a keyword allows foreign or diacriticized letters. Keyword sorting, however, is still crude; it obeys the underlying character set ordering quite blindly. @table @code @item -f Fold lower case letters to upper case for sorting. @end table @node Input processing, Output formatting, Charset selection, Usage @subsubsection Word selection @table @code @item -b @var{file} This option is an alternative way to option @code{-W} for describing which characters make up words. This option introduces the name of a file which contains a list of characters which can@emph{not} be part of one word, this file is called the @dfn{Break file}. Any character which is not part of the Break file is a word constituent. If both options @code{-b} and @code{-W} are specified, then @code{-W} has precedence and @code{-b} is ignored. In normal mode, the only way to avoid newline as a break character is to write all the break characters in the file with no newline at all, not even at the end of the file. In @code{ptx} compatibility mode, spaces, tabs and newlines are always considered as break characters even if not included in the Break file. @item -i @var{file} The file associated with this option contains a list of words which will never be taken as keywords in concordance output. It is called the @dfn{Ignore file}. The file contains exactly one word in each line; the end of line separation of words is not subject to the value of the @code{-S} option. If not specified, there might be a default Ignore file. Default Ignore files are not necessarily the same in normal mode or in @code{ptx} compatibility mode. Unless changed by the local installation, there is @emph{no} default Ignore file in normal mode, and the Ignore file is @code{/usr/lib/eign} in @code{ptx} compatibility mode. If you want to deactivate a default Ignore file, use @code{/dev/null} instead. @item -o @var{file} The file associated with this option contains a list of words which will be retained in concordance output, any word not mentionned in this file is ignored. The file is called the @dfn{Only file}. The file contains exactly one word in each line; the end of line separation of words is not subject to the value of the @code{-S} option. There is no default for the Only file. In the case there are both an Only file and an Ignore file, a word will be subject to be a keyword only if it is given in the Only file and not given in the Ignore file. @item -r On each input line, the leading sequence of non white characters will be taken to be a reference that has the purpose of identifying this input line on the produced permuted index. See @xref{Output formatting} for more information about reference production. Using this option change the default value for option @code{-S}. Using this option, the program does not try very hard to remove references from contexts in output, but it succeeds in doing so @emph{when} the context ends exactly at the newline. If option @code{-r} is used with @code{-S} default value, or when in @code{ptx} compatibility mode, this condition is always met and references are completely excluded from the output contexts. @item -S @var{regexp} This option selects which regular expression will describe the end of a line or the end of a sentence. In fact, there is other distinction between end of lines or end of sentences than the effect of this regular expression, and input line boundaries have no special significance outside this option. By default, in @code{ptx} compatibility mode or if @code{-r} option is used, end of lines are used; in this case, the @var{regexp} used is very simple: @example \n @end example In normal mode and if @code{-r} option is not used, by default, end of sentences are used; the precise @var{regex} is imported from GNU emacs: @example [.?!][]\"')@}]*\$$\\|\t\\| \$[ \t\n]* @end example An empty REGEXP is equivalent to completly disabling end of line or end of sentence recognition. In this case, the whole file is considered to be a single big line or sentence. The user might want to disallow all truncation flag generation as well, through option @code{-F ""}. On regular expression writing and usage, see @xref{Regexps}. When the keywords happen to be near the beginning of the input line or sentence, this often creates an unused area at the beginning of the output context line; when the keywords happen to be near the end of the input line or sentence, this often creates an unused area at the end of the output context line. The program tries to fill those unused areas by wrapping around context in them; the tail of the input line or sentence is used to fill the unused area on the left of the output line; the head of the input line or sentence is used to fill the unused area on the right of the output line. This option is not available when the program is operating @code{ptx} compatibility mode. @item -W @var{regexp} This option selects which regular expression will describe each keyword. By default, in @code{ptx} compatibility mode, a word is anything which ends with a space, a tab or a newline; the @var{regexp} used is @code{[^ \t\n]+}. In normal mode, a word is a sequence of letters; the @var{regexp} used is @code{\w+}. An empty REGEXP is equivalent to not using this option, letting the default dive in. On regular expression writing and usage, see @xref{Regexps}. This option is not available when the program is operating @code{ptx} compatibility mode. @end table @node Output formatting, , Input processing, Usage @subsubsection Output formatting Output format is mainly controlled by @code{-O} and @code{-T} options, described in the table below. However, when neither @code{-O} nor @code{-T} is selected, and if we are not running in @code{ptx} compatibility mode, the program choose an output format suited for a dumb terminal. This is the default format when working in normal mode. Each keyword occurrence is output to the center of one line, surrounded by its left and rigth contexts. Each field is properly justified, so the concordance output could readily be observed. As a special feature, if automatic references are selected by option @code{-A} and are output before the left context, that is, if option @code{-R} is @emph{not} selected, then a colon is added after the reference; this nicely interface with GNU Emacs @code{next-error} processing. In this default output format, each white space character, like newline and tab, is merely changed to exactly one space, with no special attempt to compress consecutive spaces. This might change in the future. Except for those white space characters, every other character of the underlying set of 256 characters is transmitted verbatim. Output format is further controlled by the following options. @table @code @item -g @var{number} Select the size of the minimum white gap between the fields on the output line. @item -w @var{number} Select the output maximum width of each final line. If references are used, they are included or excluded from the output maximum width depending on the value of option @code{-R}. If this option is not selected, that is, when references are output before the left context, the output maximum width takes into account the maximum length of all references. If this options is selected, that is, when references are output after the right context, the output maximum width does not take into account the space taken by references, nor the gap that precedes them. @item -A Select automatic references. Each input line will have an automatic reference made up of the file name, an open parenthesis, the line ordinal and a close parenthesis. However, the file name will be empty when standard input is being read. If both @code{-A} and @code{-r} are selected, then the input reference is still read and skipped, but the automatic reference is used at output time, overriding the input reference. This option is not available when the program is operating @code{ptx} compatibility mode. @item -R In default output format, when option @code{-R} is not used, any reference produced by the effect of options @code{-r} or @code{-A} are given to the far right of output lines, after the right context. In default output format, when option @code{-R} is specified, references are rather given to the beginning of each output line, before the left context. For any other output format, option @code{-R} is almost ignored, except for the fact that the width of references is @emph{not} taken into account in total output width given by @code{-w} whenever @code{-R} is selected. This option is not explicitely selectable when the program is operating in @code{ptx} compatibility mode. However, in this case, it is always implicitely selected. @item -F @var{string} This option will request that any truncation in the output be reported using the string @var{string}. Most output fields theoretically extend towards the beginning or the end of the current line, or current sentence, as selected with option @code{-S}. But there is a maximum allowed output line width, changeable through option @code{-w}, which is further divided into space for various output fields. When a field has to be truncated because cannot extend until the beginning or the end of the current line to fit in the, then a truncation occurs. By default, the string used is a single slash, as in @code{-F /}. @var{string} may have more than one character, as in @code{-F ...}. Also, in the particular case @var{string} is empty (@code{-F ""}), truncation flagging is disabled, and no truncation marks are appended in this case. This option is not available when the program is operating @code{ptx} compatibility mode. @item -O Choose an output format suitable for @code{nroff} or @code{troff} processing. Each output line will look like: @example .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}" @end example so it will be possible to write an @samp{.xx} roff macro to take care of the output typesetting. This is the default output format when working in @code{ptx} compatibility mode. In this output format, each non-graphical character, like newline and tab, is merely changed to exactly one space, with no special attempt to compress consecutive spaces. Each quote character: @kbd{"} is doubled so it will be correctly processed by @code{nroff} or @code{troff}. All characters having their eight bit set are turned into spaces in this version. It is expectable that diacriticized characters will be correctly expressed in @code{roff} terms if I learn how to do this. So, let me know how to improve this special character processing. This option is not available when the program is operating @code{ptx} compatibility mode. In fact, it then becomes the default and sole output format. @item -T Choose an output format suitable for @TeX{} processing. Each output line will look like: @example \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@} @end example @noindent so it will be possible to write write a @code{\xx} definition to take care of the output typesetting. Note that when references are not being produced, that is, neither option @code{-A} nor option @code{-r} is selected, the last parameter of each @code{\xx} call is inhibited. In this output format, some special characters, like @kbd{$}, @kbd{%}, @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a backslash. Curly brackets @kbd{@{}, @kbd{@}} are also protected with a backslash, but also enclosed in a pair of dollar signs to force mathematical mode. The backslash itself produces the sequence @code{\backslash@{@}}. Circumflex and tilde diacritics produce the sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other diacriticized characters of the underlying character set produce an appropriate @TeX{} sequence as far as possible. The other non-graphical characters, like newline and tab, and all others characters which are not part of ASCII, are merely changed to exactly one space, with no special attempt to compress consecutive spaces. Let me know how to improve this special character processing for @TeX{}. This option is not available when the program is operating @code{ptx} compatibility mode. @end table @node Regexps, ptx mode, Usage, Top @subsection Syntax of Regular Expressions @c This node is taken from the GNU emacs 18.55 manual. The best thing @c would be that it is linked from here. But, to obviate various usages @c in installation, it is simpler to take a mere copy of it for now. @c @c I also removed \s@var{code} and \S@var{code} documentation, which is @c not compiled into regex.c, and the reference to GNU emacs Syntax @c node. Some references to the Emacs buffer should be changed too. Regular expressions have a syntax in which a few characters are special constructs and the rest are @dfn{ordinary}. An ordinary character is a simple regular expression which matches that character and nothing else. The special characters are @samp{$}, @samp{^}, @samp{.}, @samp{*}, @samp{+}, @samp{?}, @samp{[}, @samp{]} and @samp{\}; no new special characters will be defined. Any other character appearing in a regular expression is ordinary, unless a @samp{\} precedes it.@refill For example, @samp{f} is not a special character, so it is ordinary, and therefore @samp{f} is a regular expression that matches the string @samp{f} and no other string. (It does @i{not} match the string @samp{ff}.) Likewise, @samp{o} is a regular expression that matches only @samp{o}.@refill Any two regular expressions @var{a} and @var{b} can be concatenated. The result is a regular expression which matches a string if @var{a} matches some amount of the beginning of that string and @var{b} matches the rest of the string.@refill As a simple example, we can concatenate the regular expressions @samp{f} and @samp{o} to get the regular expression @samp{fo}, which matches only the string @samp{fo}. Still trivial. To do something nontrivial, you need to use one of the special characters. Here is a list of them. @table @kbd @item .@: @r{(Period)} is a special character that matches any single character except a newline. Using concatenation, we can make regular expressions like @samp{a.b} which matches any three-character string which begins with @samp{a} and ends with @samp{b}.@refill @item * is not a construct by itself; it is a suffix, which means the preceding regular expression is to be repeated as many times as possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so @samp{fo*} matches one @samp{f} followed by any number of @samp{o}s. The case of zero @samp{o}s is allowed: @samp{fo*} does match @samp{f}.@refill @samp{*} always applies to the @i{smallest} possible preceding expression. Thus, @samp{fo*} has a repeating @samp{o}, not a repeating @samp{fo}.@refill The matcher processes a @samp{*} construct by matching, immediately, as many repetitions as can be found. Then it continues with the rest of the pattern. If that fails, backtracking occurs, discarding some of the matches of the @samp{*}-modified construct in case that makes it possible to match the rest of the pattern. For example, matching @samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first tries to match all three @samp{a}s; but the rest of the pattern is @samp{ar} and there is only @samp{r} left to match, so this try fails. The next alternative is for @samp{a*} to match only two @samp{a}s. With this choice, the rest of the regexp matches successfully.@refill @item + Is a suffix character similar to @samp{*} except that it requires that the preceding expression be matched at least once. So, for example, @samp{ca+r} will match the strings @samp{car} and @samp{caaaar} but not the string @samp{cr}, whereas @samp{ca*r} would match all three strings.@refill @item ? Is a suffix character similar to @samp{*} except that it can match the preceding expression either once or not at all. For example, @samp{ca?r} will match @samp{car} or @samp{cr}; nothing else. @item [ @dots{} ] @samp{[} begins a @dfn{character set}, which is terminated by a @samp{]}. In the simplest case, the characters between the two form the set. Thus, @samp{[ad]} matches either one @samp{a} or one @samp{d}, and @samp{[ad]*} matches any string composed of just @samp{a}s and @samp{d}s (including the empty string), from which it follows that @samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr}, @samp{caddaar}, etc.@refill Character ranges can also be included in a character set, by writing two characters with a @samp{-} between them. Thus, @samp{[a-z]} matches any lower-case letter. Ranges may be intermixed freely with individual characters, as in @samp{[a-z$%.]}, which matches any lower case letter or @samp{$}, @samp{%} or period.@refill Note that the usual special characters are not special any more inside a character set. A completely different set of special characters exists inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill To include a @samp{]} in a character set, you must make it the first character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. To include a @samp{-}, write @samp{---}, which is a range containing only @samp{-}. To include @samp{^}, make it other than the first character in the set.@refill @item [^ @dots{} ] @samp{[^} begins a @dfn{complement character set}, which matches any character except the ones specified. Thus, @samp{[^a-z0-9A-Z]} matches all characters @i{except} letters and digits.@refill @samp{^} is not special in a character set unless it is the first character. The character following the @samp{^} is treated as if it were first (@samp{-} and @samp{]} are not special there). Note that a complement character set can match a newline, unless newline is mentioned as one of the characters not to match. @item ^ is a special character that matches the empty string, but only if at the beginning of a line in the text being matched. Otherwise it fails to match anything. Thus, @samp{^foo} matches a @samp{foo} which occurs at the beginning of a line. @item $ is similar to @samp{^} but matches only at the end of a line. Thus, @samp{xx*$} matches a string of one @samp{x} or more at the end of a line. @item \ has two functions: it quotes the special characters (including @samp{\}), and it introduces additional special constructs. Because @samp{\} quotes special characters, @samp{\$} is a regular expression which matches only @samp{$}, and @samp{\[} is a regular expression which matches only @samp{[}, and so on.@refill @end table Note: for historical compatibility, special characters are treated as ordinary ones if they are in contexts where their special meanings make no sense. For example, @samp{*foo} treats @samp{*} as ordinary since there is no preceding expression on which the @samp{*} can act. It is poor practice to depend on this behavior; better to quote the special character anyway, regardless of where is appears.@refill For the most part, @samp{\} followed by any character matches only that character. However, there are several exceptions: characters which, when preceded by @samp{\}, are special constructs. Such characters are always ordinary when encountered on their own. Here is a table of @samp{\} constructs. @table @kbd @item \| specifies an alternative. Two regular expressions @var{a} and @var{b} with @samp{\|} in between form an expression that matches anything that either @var{a} or @var{b} will match.@refill Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar} but no other string.@refill @samp{\|} applies to the largest possible surrounding expressions. Only a surrounding @samp{$ @dots{} $} grouping can limit the grouping power of @samp{\|}.@refill Full backtracking capability exists to handle multiple uses of @samp{\|}. @item $ @dots{} $ is a grouping construct that serves three purposes: @enumerate @item To enclose a set of @samp{\|} alternatives for other operations. Thus, @samp{$foo\|bar$x} matches either @samp{foox} or @samp{barx}. @item To enclose a complicated expression for the postfix @samp{*} to operate on. Thus, @samp{ba$na$*} matches @samp{bananana}, etc., with any (zero or more) number of @samp{na} strings.@refill @item To mark a matched substring for future reference. @end enumerate This last application is not a consequence of the idea of a parenthetical grouping; it is a separate feature which happens to be assigned as a second meaning to the same @samp{$ @dots{} $} construct because there is no conflict in practice between the two meanings. Here is an explanation of this feature: @item \@var{digit} after the end of a @samp{$ @dots{} $} construct, the matcher remembers the beginning and end of the text matched by that construct. Then, later on in the regular expression, you can use @samp{\} followed by @var{digit} to mean ``match the same text matched the @var{digit}'th time by the @samp{$ @dots{} $} construct.''@refill The strings matching the first nine @samp{$ @dots{} $} constructs appearing in a regular expression are assigned numbers 1 through 9 in order that the open-parentheses appear in the regular expression. @samp{\1} through @samp{\9} may be used to refer to the text matched by the corresponding @samp{$ @dots{} $} construct. For example, @samp{$.*$\1} matches any newline-free string that is composed of two identical halves. The @samp{$.*$} matches the first half, which may be anything, but the @samp{\1} that follows must match the same exact text. @item \` matches the empty string, provided it is at the beginning of the buffer. @item \' matches the empty string, provided it is at the end of the buffer. @item \b matches the empty string, provided it is at the beginning or end of a word. Thus, @samp{\bfoo\b} matches any occurrence of @samp{foo} as a separate word. @samp{\bballs?\b} matches @samp{ball} or @samp{balls} as a separate word.@refill @item \B matches the empty string, provided it is @i{not} at the beginning or end of a word. @item \< matches the empty string, provided it is at the beginning of a word. @item \> matches the empty string, provided it is at the end of a word. @item \w matches any word-constituent character. The editor syntax table determines which characters these are. @item \W matches any character that is not a word-constituent. @end table Here is a complicated regexp, used by Emacs to recognize the end of a sentence together with any whitespace that follows. It is given in Lisp syntax to enable you to distinguish the spaces from the tab characters. In Lisp syntax, the string constant begins and ends with a double-quote. @samp{\"} stands for a double-quote as part of the regexp, @samp{\\} for a backslash as part of the regexp, @samp{\t} for a tab and @samp{\n} for a newline. @example "[.?!][]\"')]*\$$\\|\t\\| \$[ \t\n]*" @end example @noindent This contains four parts in succession: a character set matching period, @samp{?} or @samp{!}; a character set matching close-brackets, quotes or parentheses, repeated any number of times; an alternative in backslash-parentheses that matches end-of-line, a tab or two spaces; and a character set matching whitespace characters, repeated any number of times. @node ptx mode, Future, Regexps, Top @subsection @code{ptx} compatibility mode This section outlines the differences between this program and standard @code{ptx}. There is also a @code{ptx} compatibility mode in this program which is activated implicitely when the program is called under the name @code{ptx} or explicitely through the usage of option @code{-p}. For someone used to standard @code{ptx}, here are some points worth noticing when not using @code{ptx} compatibility mode: @itemize @bullet @item In normal mode, concordance output is not formatted for @code{troff} or @code{nroff}. By default, output is rather formatted for a dumb terminal. @code{troff} or @code{nroff} output may still be selected through option @code{-O}. @item In normal mode, unless @code{-R} option is used, the maximum reference width is subtracted from the total output line width. In @code{ptx} compatibility mode, width of references are not taken into account in the output line width computations. @item In normal mode, all 256 characters, even @kbd{NUL}s, are read and processed from input file with no adverse effect. No attempt is made to limit this in @code{ptx} compatibility mode. However, standard @code{ptx} does not accept 8-bit characters, a few control characters are rejected, and the tilde @kbd{~} is condemned. @item In normal mode, input lines may be of infinite length. No attempt is made to limit this in @code{ptx} compatibility mode. However, standard @code{ptx} processes only the first 200 characters in each line. @item In normal mode, the break (non-word) characters default to be every character except letters. In @code{ptx} compatibility mode, the break characters default to space, tab and newline only. @item In some circumstances, output lines are filled a little more completely in normal mode than in @code{ptx} compatibility mode. Even in @code{ptx} mode, there are some slight disposition glitches this program does not completely reproduce, even if it comes quite close. @item The Ignore file default in @code{ptx} compatibility mode is not the same as in normal mode. In default installation, default Ignore files are @file{/usr/lib/eign} in @code{ptx} compatibility mode, and nothing in normal mode. @item Standard @code{ptx} disallows specifying both the Ignore file and the Only file at the same time. This version allows both, and specifying an Only file does not inhibit processing the Ignore file. @end itemize @node Future, , ptx mode, Top @subsection Development guidelines This should evolve towards a concordance package for GNU, able to tackle true, real, big concordance jobs, while being fast and of easy use for little jobs. The start point is standard @code{ptx}. Because several packages of this kind are awfully slow, I should reasonnably try to keep speed in mind. On the other end, I do not want to burden myself too much about interactive query for now; so, a future reorientation along this topic might require some work. Here is a @emph{What To Do Next} list, in expected execution order. @enumerate @item Increase short term usability: @itemize @bullet @item Support the program for the GNU community. As directed by user comments, test and debug the whole thing more fully, and on bigger examples. Solve portability glitches as long as this do not induce too ugly things in the code. @item Provide sample macros in the documentation. @item Understand and mimic `-t' option, if I can. @item See how TeX mode could be made more useful, and if a texinfo mode would mean something to someone. @item Sort keywords intelligently for Latin-1 code. See how to interface this character set with various output formats. Also, introduce options to inverse-sort and possibly to reverse-sort. @item Improve speed for Ignore and Only tables. Consider hashing instead of sorting. Consider playing with obstacks to digest them. @item Provide better handling of format effectors obtained from input, and also attempt white space compression on output which would still maximize full output width usage. @end itemize @item Provide multiple language support. Most of the boosting work should go along the line of fast recognition of multiple and complex boundaries, which define various `languages'. Each such language has its own rules for words, sentences, paragraphs, and reporting requests. This is less difficult than I first thought: @itemize @bullet @item Learn how to use getopt, or write something if necessary. Recognize language modifiers with each option. At least -b, -i, -o, -W, -S, and also new language switcher options, will have such modifiers. Modifiers on language switchers will allow or disallow language transitions. @item Complete the transformation of underlying variables into arrays in the code. @item Implement a heap of positions in the input file. There is one entry in the heap for each compiled regexp; it is initialized by a re_search after each regexp compile. Regexps reschedule themselves in the heap when their position passes while scanning input. In this way, looking simultaneously for a lot of regexps should not be too inefficient, once the scanning starts. If this works ok, maybe consider accepting regexps in Only and Ignore tables. @item Merge with language processing boundary processing options, really integrating -S processing as a special case. Maybe, implement several level of boundaries. See how to implement a stack of languages, for handling quotations. See if more sophisticated references could be handled as another special case of a language. @end itemize @item Tackle other aspects, in a more long term view: @itemize @bullet @item Add options for statistics, frequency lists, referencing, and all other prescreening tools and subsidiary tasks of concordance production. @item Develop an interactive mode. Even better, construct a GNU emacs interface. I'm looking at Gene Myers <gene@@cs.arizona.edu> suffix arrays as a possible implementation along those ideas. @item Implement hooks so word classification and tagging should be merged in. See how to effectively hook in lemmatisation or other morphological features. It is far from being clear by now how to interface this correctly, so some experimentation is mandatory. @item Profile and speed up the whole thing. @item Make it work on small address space machines. Consider three levels of hugeness for files, and three corresponding algorithms to make optimal use of memory. The first case is when all the input files and all the word references fit in memory: this is the case currently implemented. The second case is when the files cannot fit all together in memory, but the word references do. The third case is when even the word references cannot fit in memory. @item There also are subsidiary developments for in-core incremental sort routines as well as for a huge external sort package. The need for more flexible sort packages comes partly from the fact that linguists use kinds of keys which compare in unusual and more sophisticated ways. @end itemize @end enumerate