home *** CD-ROM | disk | FTP | other *** search
- @setfilename gptx.info
- Copyright @copyright{} 1990 Free Software Foundation, Inc.
- Francois Pinard <pinard@@iro.umontreal.ca>, 1988.
-
- $Id$
-
- This program is free software; you can redistribute it and/or modify
- it under the terms of the GNU General Public License as published by
- the Free Software Foundation; either version 1, or (at your option)
- any later version.
-
- This program is distributed in the hope that it will be useful, but
- WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- General Public License for more details.
-
- You should have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-
-
- @node Top, Usage, , (DIR)
- @section @code{gptx} - GNU permuted index generator
-
- This is the GNU prerelease of @code{gptx}, a permuted index generator.
- This prerelease has the main goal of providing a @code{ptx}
- @emph{almost} compatible replacement, able to handle small files
- quickly, while providing a platform for more development.
-
- This version reimplements and extends standard @code{ptx}. In
- particular, it can produce a readable @dfn{KWIC} without the need of
- @code{nroff}. This program does not repeat all @code{ptx} disposition
- quirks (but should it really do?). Also, this version does not yet
- handle huge input files, that is, those files which do not fit in memory
- all at once.
-
- @menu
- * Usage:: How to use the program, its options and parameters.
- * Regexps:: How a regular expression is written and used.
- * ptx mode:: In which ways @code{ptx} mode is different.
- * Future:: What are the development lines of this program.
- @end menu
-
-
- @node Usage, Regexps, Top, Top
- @subsection How to use this program
-
- This tool reads a text file and essentially produces a permuted index, with
- each keyword in its context. The calling sketch is one of:
-
- @example
- gptx [@var{option}]@dots{} [@var{input}]@dots{} >@var{output}
- @end example
-
- or:
-
- @example
- ptx [@var{option}]@dots{} [@var{input} [@var{output}]]
- @end example
-
- If the program is called as @code{ptx} instead of @code{gptx}, or if
- @code{-p} option is selected, this implies @code{ptx} compatibility
- mode, disallowing extensions, introducing some limitations, and changing
- several of the program's default option values. See @xref{ptx mode} for
- a list of differences.
-
- As usual, each option is represented by an hyphen followed by a single
- letter. Some options require a parameter in the form of a decimal number
- or a file name, in which case the parameter follows the option after some
- whitespace. Option letters may be grouped and tied together as a string
- which follows only one hyphen; if one of several of them require
- parameters, they should follow the combined options in the order of
- appearance of individual letters in the string. Individual options are
- explained below.
-
- When @emph{not} in @code{ptx} compatibility mode, there may be zero, one
- or several parameters after the options. If there is no parameters, the
- program reads the standard input. If there is one or several
- parameters, they give the name of input files, which are all read in
- turn; a little as if all the input files were concatenated. However,
- there is a full contextual break between each file; and when automatic
- referencing is requested, file names and line numbers refer to
- individual text input files. In all cases, the program produces the
- permuted index onto the standard output.
-
- When in @code{ptx} compatibility mode, besides the options, there may be
- zero, one or two parameters. If there is no parameters, the program
- reads the standard input and produces the permuted index onto the
- standard output. If there is only one parameter, it names the text file
- to be read instead of the standard input. If two parameters are given,
- they give respectively the name of the file to read and the name of the
- file to produce.
-
- Note that for @emph{any} file named as the value of an option or as an
- input text file, a single dash @kbd{-} may be used, in which case
- standard input is assumed. However, it would not make sense to use this
- convention more than once per program invocation.
-
-
- @menu
- * General options:: Options which affect general program behaviour.
- * Charset selection:: Underlying character set considerations.
- * Input processing:: Input fields, contexts, and keyword selection.
- * Output formatting:: Types of output format, and sizing the fields.
- @end menu
-
-
- @node General options, Charset selection, , Usage
- @subsubsection General options
-
- @table @code
-
- @item -p
- This requests @code{ptx} behaviour, as far as we understand it. This
- option is selected by default when the program is installed under the
- name @code{ptx}.
-
- This option is not available once the program is operating in @code{ptx}
- compatibility mode.
-
- @item -C
- Prints a short note about the Copyright and copying conditions.
-
- @end table
-
-
- @node Charset selection, Input processing, General options , Usage
- @subsubsection Charset selection
-
- As it is setup now, the program assumes that the input file is coded
- using 8-bit ISO 8859-1 code, also known as Latin-1 character set,
- @emph{unless} if it is compiled for MS-DOS, in which case it uses the
- character set of the IBM-PC. Compared to 7-bit ASCII, the set of
- characters which are letters is then different, this fact alters the
- behaviour of regular expression matching. Thus, the default regular
- expression for a keyword allows foreign or diacriticized letters.
- Keyword sorting, however, is still crude; it obeys the underlying
- character set ordering quite blindly.
-
- @table @code
-
- @item -f
- Fold lower case letters to upper case for sorting.
-
- @end table
-
-
- @node Input processing, Output formatting, Charset selection, Usage
- @subsubsection Word selection
-
- @table @code
-
- @item -b @var{file}
-
- This option is an alternative way to option @code{-W} for describing
- which characters make up words. This option introduces the name of a
- file which contains a list of characters which can@emph{not} be part of
- one word, this file is called the @dfn{Break file}. Any character which
- is not part of the Break file is a word constituent. If both options
- @code{-b} and @code{-W} are specified, then @code{-W} has precedence and
- @code{-b} is ignored.
-
- In normal mode, the only way to avoid newline as a break character is to
- write all the break characters in the file with no newline at all, not
- even at the end of the file. In @code{ptx} compatibility mode, spaces,
- tabs and newlines are always considered as break characters even if not
- included in the Break file.
-
- @item -i @var{file}
-
- The file associated with this option contains a list of words which will
- never be taken as keywords in concordance output. It is called the
- @dfn{Ignore file}. The file contains exactly one word in each line; the
- end of line separation of words is not subject to the value of the
- @code{-S} option.
-
- If not specified, there might be a default Ignore file. Default Ignore
- files are not necessarily the same in normal mode or in @code{ptx}
- compatibility mode. Unless changed by the local installation, there is
- @emph{no} default Ignore file in normal mode, and the Ignore file is
- @code{/usr/lib/eign} in @code{ptx} compatibility mode. If you want to
- deactivate a default Ignore file, use @code{/dev/null} instead.
-
- @item -o @var{file}
-
- The file associated with this option contains a list of words which will
- be retained in concordance output, any word not mentionned in this file
- is ignored. The file is called the @dfn{Only file}. The file contains
- exactly one word in each line; the end of line separation of words is
- not subject to the value of the @code{-S} option.
-
- There is no default for the Only file. In the case there are both an
- Only file and an Ignore file, a word will be subject to be a keyword
- only if it is given in the Only file and not given in the Ignore file.
-
- @item -r
- On each input line, the leading sequence of non white characters will be
- taken to be a reference that has the purpose of identifying this input
- line on the produced permuted index. See @xref{Output formatting} for
- more information about reference production. Using this option change
- the default value for option @code{-S}.
-
- Using this option, the program does not try very hard to remove
- references from contexts in output, but it succeeds in doing so
- @emph{when} the context ends exactly at the newline. If option
- @code{-r} is used with @code{-S} default value, or when in @code{ptx}
- compatibility mode, this condition is always met and references are
- completely excluded from the output contexts.
-
- @item -S @var{regexp}
- This option selects which regular expression will describe the end of a
- line or the end of a sentence. In fact, there is other distinction
- between end of lines or end of sentences than the effect of this regular
- expression, and input line boundaries have no special significance
- outside this option. By default, in @code{ptx} compatibility mode or if
- @code{-r} option is used, end of lines are used; in this case, the
- @var{regexp} used is very simple:
-
- @example
- \n
- @end example
-
- In normal mode and if @code{-r} option is not used, by default, end of
- sentences are used; the precise @var{regex} is imported from GNU emacs:
-
- @example
- [.?!][]\"')@}]*\\($\\|\t\\| \\)[ \t\n]*
- @end example
-
- An empty REGEXP is equivalent to completly disabling end of line or end
- of sentence recognition. In this case, the whole file is considered to
- be a single big line or sentence. The user might want to disallow all
- truncation flag generation as well, through option @code{-F ""}. On
- regular expression writing and usage, see @xref{Regexps}.
-
- When the keywords happen to be near the beginning of the input line or
- sentence, this often creates an unused area at the beginning of the
- output context line; when the keywords happen to be near the end of the
- input line or sentence, this often creates an unused area at the end of
- the output context line. The program tries to fill those unused areas
- by wrapping around context in them; the tail of the input line or
- sentence is used to fill the unused area on the left of the output line;
- the head of the input line or sentence is used to fill the unused area
- on the right of the output line.
-
- This option is not available when the program is operating @code{ptx}
- compatibility mode.
-
- @item -W @var{regexp}
- This option selects which regular expression will describe each keyword.
- By default, in @code{ptx} compatibility mode, a word is anything which
- ends with a space, a tab or a newline; the @var{regexp} used is @code{[^
- \t\n]+}.
-
- In normal mode, a word is a sequence of letters; the
- @var{regexp} used is @code{\w+}.
-
- An empty REGEXP is equivalent to not using this option, letting the
- default dive in. On regular expression writing and usage, see
- @xref{Regexps}.
-
- This option is not available when the program is operating @code{ptx}
- compatibility mode.
-
- @end table
-
-
- @node Output formatting, , Input processing, Usage
- @subsubsection Output formatting
-
- Output format is mainly controlled by @code{-O} and @code{-T} options,
- described in the table below. However, when neither @code{-O} nor
- @code{-T} is selected, and if we are not running in @code{ptx}
- compatibility mode, the program choose an output format suited for a
- dumb terminal. This is the default format when working in normal mode.
- Each keyword occurrence is output to the center of one line, surrounded
- by its left and rigth contexts. Each field is properly justified, so
- the concordance output could readily be observed. As a special feature,
- if automatic references are selected by option @code{-A} and are output
- before the left context, that is, if option @code{-R} is @emph{not}
- selected, then a colon is added after the reference; this nicely
- interface with GNU Emacs @code{next-error} processing. In this default
- output format, each white space character, like newline and tab, is
- merely changed to exactly one space, with no special attempt to compress
- consecutive spaces. This might change in the future. Except for those
- white space characters, every other character of the underlying set of
- 256 characters is transmitted verbatim.
-
- Output format is further controlled by the following options.
-
- @table @code
-
- @item -g @var{number}
- Select the size of the minimum white gap between the fields on the output
- line.
-
- @item -w @var{number}
- Select the output maximum width of each final line. If references are
- used, they are included or excluded from the output maximum width
- depending on the value of option @code{-R}. If this option is not
- selected, that is, when references are output before the left context,
- the output maximum width takes into account the maximum length of all
- references. If this options is selected, that is, when references are
- output after the right context, the output maximum width does not take
- into account the space taken by references, nor the gap that precedes
- them.
-
- @item -A
- Select automatic references. Each input line will have an automatic
- reference made up of the file name, an open parenthesis, the line
- ordinal and a close parenthesis. However, the file name will be empty
- when standard input is being read. If both @code{-A} and @code{-r} are
- selected, then the input reference is still read and skipped, but the
- automatic reference is used at output time, overriding the input
- reference.
-
- This option is not available when the program is operating @code{ptx}
- compatibility mode.
-
- @item -R
- In default output format, when option @code{-R} is not used, any
- reference produced by the effect of options @code{-r} or @code{-A} are
- given to the far right of output lines, after the right context. In
- default output format, when option @code{-R} is specified, references
- are rather given to the beginning of each output line, before the left
- context. For any other output format, option @code{-R} is almost
- ignored, except for the fact that the width of references is @emph{not}
- taken into account in total output width given by @code{-w} whenever
- @code{-R} is selected.
-
- This option is not explicitely selectable when the program is operating
- in @code{ptx} compatibility mode. However, in this case, it is always
- implicitely selected.
-
- @item -F @var{string}
- This option will request that any truncation in the output be reported
- using the string @var{string}. Most output fields theoretically extend
- towards the beginning or the end of the current line, or current
- sentence, as selected with option @code{-S}. But there is a maximum
- allowed output line width, changeable through option @code{-w}, which is
- further divided into space for various output fields. When a field has
- to be truncated because cannot extend until the beginning or the end of
- the current line to fit in the, then a truncation occurs. By default,
- the string used is a single slash, as in @code{-F /}.
-
- @var{string} may have more than one character, as in @code{-F ...}.
- Also, in the particular case @var{string} is empty (@code{-F ""}),
- truncation flagging is disabled, and no truncation marks are appended in
- this case.
-
- This option is not available when the program is operating @code{ptx}
- compatibility mode.
-
- @item -O
- Choose an output format suitable for @code{nroff} or @code{troff}
- processing. Each output line will look like:
-
- @example
- .xx "@var{tail}" "@var{before}" "@var{keyword_and_after}" "@var{head}" "@var{ref}"
- @end example
-
- so it will be possible to write an @samp{.xx} roff macro to take care of
- the output typesetting. This is the default output format when working
- in @code{ptx} compatibility mode.
-
- In this output format, each non-graphical character, like newline and
- tab, is merely changed to exactly one space, with no special attempt to
- compress consecutive spaces. Each quote character: @kbd{"} is doubled
- so it will be correctly processed by @code{nroff} or @code{troff}. All
- characters having their eight bit set are turned into spaces in this
- version. It is expectable that diacriticized characters will be
- correctly expressed in @code{roff} terms if I learn how to do this. So,
- let me know how to improve this special character processing.
-
- This option is not available when the program is operating @code{ptx}
- compatibility mode. In fact, it then becomes the default and sole output
- format.
-
- @item -T
- Choose an output format suitable for @TeX{} processing. Each output
- line will look like:
-
- @example
- \xx @{@var{tail}@}@{@var{before}@}@{@var{keyword}@}@{@var{after}@}@{@var{head}@}@{@var{ref}@}
- @end example
-
- @noindent
- so it will be possible to write write a @code{\xx} definition to take
- care of the output typesetting. Note that when references are not being
- produced, that is, neither option @code{-A} nor option @code{-r} is
- selected, the last parameter of each @code{\xx} call is inhibited.
-
- In this output format, some special characters, like @kbd{$}, @kbd{%},
- @kbd{&}, @kbd{#} and @kbd{_} are automatically protected with a
- backslash. Curly brackets @kbd{@{}, @kbd{@}} are also protected with a
- backslash, but also enclosed in a pair of dollar signs to force
- mathematical mode. The backslash itself produces the sequence
- @code{\backslash@{@}}. Circumflex and tilde diacritics produce the
- sequence @code{^\@{ @}} and @code{~\@{ @}} respectively. Other
- diacriticized characters of the underlying character set produce an
- appropriate @TeX{} sequence as far as possible. The other non-graphical
- characters, like newline and tab, and all others characters which are
- not part of ASCII, are merely changed to exactly one space, with no
- special attempt to compress consecutive spaces. Let me know how to
- improve this special character processing for @TeX{}.
-
- This option is not available when the program is operating @code{ptx}
- compatibility mode.
-
- @end table
-
-
- @node Regexps, ptx mode, Usage, Top
- @subsection Syntax of Regular Expressions
-
- @c This node is taken from the GNU emacs 18.55 manual. The best thing
- @c would be that it is linked from here. But, to obviate various usages
- @c in installation, it is simpler to take a mere copy of it for now.
- @c
- @c I also removed \s@var{code} and \S@var{code} documentation, which is
- @c not compiled into regex.c, and the reference to GNU emacs Syntax
- @c node. Some references to the Emacs buffer should be changed too.
-
- Regular expressions have a syntax in which a few characters are special
- constructs and the rest are @dfn{ordinary}. An ordinary character is a
- simple regular expression which matches that character and nothing else.
- The special characters are @samp{$}, @samp{^}, @samp{.}, @samp{*},
- @samp{+}, @samp{?}, @samp{[}, @samp{]} and @samp{\}; no new special
- characters will be defined. Any other character appearing in a regular
- expression is ordinary, unless a @samp{\} precedes it.@refill
-
- For example, @samp{f} is not a special character, so it is ordinary, and
- therefore @samp{f} is a regular expression that matches the string @samp{f}
- and no other string. (It does @i{not} match the string @samp{ff}.) Likewise,
- @samp{o} is a regular expression that matches only @samp{o}.@refill
-
- Any two regular expressions @var{a} and @var{b} can be concatenated. The
- result is a regular expression which matches a string if @var{a} matches
- some amount of the beginning of that string and @var{b} matches the rest of
- the string.@refill
-
- As a simple example, we can concatenate the regular expressions @samp{f}
- and @samp{o} to get the regular expression @samp{fo}, which matches only
- the string @samp{fo}. Still trivial. To do something nontrivial, you
- need to use one of the special characters. Here is a list of them.
-
- @table @kbd
- @item .@: @r{(Period)}
- is a special character that matches any single character except a newline.
- Using concatenation, we can make regular expressions like @samp{a.b} which
- matches any three-character string which begins with @samp{a} and ends with
- @samp{b}.@refill
-
- @item *
- is not a construct by itself; it is a suffix, which means the
- preceding regular expression is to be repeated as many times as
- possible. In @samp{fo*}, the @samp{*} applies to the @samp{o}, so
- @samp{fo*} matches one @samp{f} followed by any number of @samp{o}s.
- The case of zero @samp{o}s is allowed: @samp{fo*} does match
- @samp{f}.@refill
-
- @samp{*} always applies to the @i{smallest} possible preceding
- expression. Thus, @samp{fo*} has a repeating @samp{o}, not a
- repeating @samp{fo}.@refill
-
- The matcher processes a @samp{*} construct by matching, immediately,
- as many repetitions as can be found. Then it continues with the rest
- of the pattern. If that fails, backtracking occurs, discarding some
- of the matches of the @samp{*}-modified construct in case that makes
- it possible to match the rest of the pattern. For example, matching
- @samp{ca*ar} against the string @samp{caaar}, the @samp{a*} first
- tries to match all three @samp{a}s; but the rest of the pattern is
- @samp{ar} and there is only @samp{r} left to match, so this try fails.
- The next alternative is for @samp{a*} to match only two @samp{a}s.
- With this choice, the rest of the regexp matches successfully.@refill
-
- @item +
- Is a suffix character similar to @samp{*} except that it requires that
- the preceding expression be matched at least once. So, for example,
- @samp{ca+r} will match the strings @samp{car} and @samp{caaaar}
- but not the string @samp{cr}, whereas @samp{ca*r} would match all
- three strings.@refill
-
- @item ?
- Is a suffix character similar to @samp{*} except that it can match the
- preceding expression either once or not at all. For example,
- @samp{ca?r} will match @samp{car} or @samp{cr}; nothing else.
-
- @item [ @dots{} ]
- @samp{[} begins a @dfn{character set}, which is terminated by a
- @samp{]}. In the simplest case, the characters between the two form
- the set. Thus, @samp{[ad]} matches either one @samp{a} or one
- @samp{d}, and @samp{[ad]*} matches any string composed of just
- @samp{a}s and @samp{d}s (including the empty string), from which it
- follows that @samp{c[ad]*r} matches @samp{cr}, @samp{car}, @samp{cdr},
- @samp{caddaar}, etc.@refill
-
- Character ranges can also be included in a character set, by writing
- two characters with a @samp{-} between them. Thus, @samp{[a-z]}
- matches any lower-case letter. Ranges may be intermixed freely with
- individual characters, as in @samp{[a-z$%.]}, which matches any lower
- case letter or @samp{$}, @samp{%} or period.@refill
-
- Note that the usual special characters are not special any more inside
- a character set. A completely different set of special characters
- exists inside character sets: @samp{]}, @samp{-} and @samp{^}.@refill
-
- To include a @samp{]} in a character set, you must make it the first
- character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. To
- include a @samp{-}, write @samp{---}, which is a range containing only
- @samp{-}. To include @samp{^}, make it other than the first character
- in the set.@refill
-
- @item [^ @dots{} ]
- @samp{[^} begins a @dfn{complement character set}, which matches any
- character except the ones specified. Thus, @samp{[^a-z0-9A-Z]}
- matches all characters @i{except} letters and digits.@refill
-
- @samp{^} is not special in a character set unless it is the first
- character. The character following the @samp{^} is treated as if it
- were first (@samp{-} and @samp{]} are not special there).
-
- Note that a complement character set can match a newline, unless
- newline is mentioned as one of the characters not to match.
-
- @item ^
- is a special character that matches the empty string, but only if at
- the beginning of a line in the text being matched. Otherwise it fails
- to match anything. Thus, @samp{^foo} matches a @samp{foo} which occurs
- at the beginning of a line.
-
- @item $
- is similar to @samp{^} but matches only at the end of a line. Thus,
- @samp{xx*$} matches a string of one @samp{x} or more at the end of a line.
-
- @item \
- has two functions: it quotes the special characters (including
- @samp{\}), and it introduces additional special constructs.
-
- Because @samp{\} quotes special characters, @samp{\$} is a regular
- expression which matches only @samp{$}, and @samp{\[} is a regular
- expression which matches only @samp{[}, and so on.@refill
- @end table
-
- Note: for historical compatibility, special characters are treated as
- ordinary ones if they are in contexts where their special meanings make no
- sense. For example, @samp{*foo} treats @samp{*} as ordinary since there is
- no preceding expression on which the @samp{*} can act. It is poor practice
- to depend on this behavior; better to quote the special character anyway,
- regardless of where is appears.@refill
-
- For the most part, @samp{\} followed by any character matches only
- that character. However, there are several exceptions: characters
- which, when preceded by @samp{\}, are special constructs. Such
- characters are always ordinary when encountered on their own. Here
- is a table of @samp{\} constructs.
-
- @table @kbd
- @item \|
- specifies an alternative.
- Two regular expressions @var{a} and @var{b} with @samp{\|} in
- between form an expression that matches anything that either @var{a} or
- @var{b} will match.@refill
-
- Thus, @samp{foo\|bar} matches either @samp{foo} or @samp{bar}
- but no other string.@refill
-
- @samp{\|} applies to the largest possible surrounding expressions. Only a
- surrounding @samp{\( @dots{} \)} grouping can limit the grouping power of
- @samp{\|}.@refill
-
- Full backtracking capability exists to handle multiple uses of @samp{\|}.
-
- @item \( @dots{} \)
- is a grouping construct that serves three purposes:
-
- @enumerate
- @item
- To enclose a set of @samp{\|} alternatives for other operations.
- Thus, @samp{\(foo\|bar\)x} matches either @samp{foox} or @samp{barx}.
-
- @item
- To enclose a complicated expression for the postfix @samp{*} to operate on.
- Thus, @samp{ba\(na\)*} matches @samp{bananana}, etc., with any (zero or
- more) number of @samp{na} strings.@refill
-
- @item
- To mark a matched substring for future reference.
-
- @end enumerate
-
- This last application is not a consequence of the idea of a
- parenthetical grouping; it is a separate feature which happens to be
- assigned as a second meaning to the same @samp{\( @dots{} \)} construct
- because there is no conflict in practice between the two meanings.
- Here is an explanation of this feature:
-
- @item \@var{digit}
- after the end of a @samp{\( @dots{} \)} construct, the matcher remembers the
- beginning and end of the text matched by that construct. Then, later on
- in the regular expression, you can use @samp{\} followed by @var{digit}
- to mean ``match the same text matched the @var{digit}'th time by the
- @samp{\( @dots{} \)} construct.''@refill
-
- The strings matching the first nine @samp{\( @dots{} \)} constructs appearing
- in a regular expression are assigned numbers 1 through 9 in order that the
- open-parentheses appear in the regular expression. @samp{\1} through
- @samp{\9} may be used to refer to the text matched by the corresponding
- @samp{\( @dots{} \)} construct.
-
- For example, @samp{\(.*\)\1} matches any newline-free string that is
- composed of two identical halves. The @samp{\(.*\)} matches the first
- half, which may be anything, but the @samp{\1} that follows must match
- the same exact text.
-
- @item \`
- matches the empty string, provided it is at the beginning
- of the buffer.
-
- @item \'
- matches the empty string, provided it is at the end of
- the buffer.
-
- @item \b
- matches the empty string, provided it is at the beginning or
- end of a word. Thus, @samp{\bfoo\b} matches any occurrence of
- @samp{foo} as a separate word. @samp{\bballs?\b} matches
- @samp{ball} or @samp{balls} as a separate word.@refill
-
- @item \B
- matches the empty string, provided it is @i{not} at the beginning or
- end of a word.
-
- @item \<
- matches the empty string, provided it is at the beginning of a word.
-
- @item \>
- matches the empty string, provided it is at the end of a word.
-
- @item \w
- matches any word-constituent character. The editor syntax table
- determines which characters these are.
-
- @item \W
- matches any character that is not a word-constituent.
- @end table
-
- Here is a complicated regexp, used by Emacs to recognize the end of a
- sentence together with any whitespace that follows. It is given in Lisp
- syntax to enable you to distinguish the spaces from the tab characters. In
- Lisp syntax, the string constant begins and ends with a double-quote.
- @samp{\"} stands for a double-quote as part of the regexp, @samp{\\} for a
- backslash as part of the regexp, @samp{\t} for a tab and @samp{\n} for a
- newline.
-
- @example
- "[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
- @end example
-
- @noindent
- This contains four parts in succession: a character set matching period,
- @samp{?} or @samp{!}; a character set matching close-brackets,
- quotes or parentheses, repeated any number of times; an alternative in
- backslash-parentheses that matches end-of-line, a tab or two spaces; and a
- character set matching whitespace characters, repeated any number of times.
-
-
- @node ptx mode, Future, Regexps, Top
- @subsection @code{ptx} compatibility mode
-
- This section outlines the differences between this program and standard
- @code{ptx}. There is also a @code{ptx} compatibility mode in this
- program which is activated implicitely when the program is called under
- the name @code{ptx} or explicitely through the usage of option
- @code{-p}. For someone used to standard @code{ptx}, here are some
- points worth noticing when not using @code{ptx} compatibility mode:
-
- @itemize @bullet
-
- @item
- In normal mode, concordance output is not formatted for @code{troff} or
- @code{nroff}. By default, output is rather formatted for a dumb
- terminal. @code{troff} or @code{nroff} output may still be selected
- through option @code{-O}.
-
- @item
- In normal mode, unless @code{-R} option is used, the maximum reference
- width is subtracted from the total output line width. In @code{ptx}
- compatibility mode, width of references are not taken into account in
- the output line width computations.
-
- @item
- In normal mode, all 256 characters, even @kbd{NUL}s, are read and
- processed from input file with no adverse effect. No attempt is made to
- limit this in @code{ptx} compatibility mode. However, standard
- @code{ptx} does not accept 8-bit characters, a few control characters
- are rejected, and the tilde @kbd{~} is condemned.
-
- @item
- In normal mode, input lines may be of infinite length. No attempt is
- made to limit this in @code{ptx} compatibility mode. However, standard
- @code{ptx} processes only the first 200 characters in each line.
-
- @item
- In normal mode, the break (non-word) characters default to be every
- character except letters. In @code{ptx} compatibility mode, the break
- characters default to space, tab and newline only.
-
- @item
- In some circumstances, output lines are filled a little more completely
- in normal mode than in @code{ptx} compatibility mode. Even in
- @code{ptx} mode, there are some slight disposition glitches this
- program does not completely reproduce, even if it comes quite close.
-
- @item
- The Ignore file default in @code{ptx} compatibility mode is not the same
- as in normal mode. In default installation, default Ignore files are
- @file{/usr/lib/eign} in @code{ptx} compatibility mode, and nothing in
- normal mode.
-
- @item
- Standard @code{ptx} disallows specifying both the Ignore file and the
- Only file at the same time. This version allows both, and specifying an
- Only file does not inhibit processing the Ignore file.
-
- @end itemize
-
-
- @node Future, , ptx mode, Top
- @subsection Development guidelines
-
- This should evolve towards a concordance package for GNU, able to tackle
- true, real, big concordance jobs, while being fast and of easy use for
- little jobs. The start point is standard @code{ptx}. Because several
- packages of this kind are awfully slow, I should reasonnably try to keep
- speed in mind. On the other end, I do not want to burden myself too
- much about interactive query for now; so, a future reorientation along
- this topic might require some work.
-
- Here is a @emph{What To Do Next} list, in expected execution order.
-
- @enumerate
-
- @item
- Increase short term usability:
-
- @itemize @bullet
-
- @item
- Support the program for the GNU community. As directed by user
- comments, test and debug the whole thing more fully, and on bigger
- examples. Solve portability glitches as long as this do not induce too
- ugly things in the code.
-
- @item
- Provide sample macros in the documentation.
-
- @item
- Understand and mimic `-t' option, if I can.
-
- @item
- See how TeX mode could be made more useful, and if a texinfo mode would
- mean something to someone.
-
- @item
- Sort keywords intelligently for Latin-1 code. See how to interface this
- character set with various output formats. Also, introduce options to
- inverse-sort and possibly to reverse-sort.
-
- @item
- Improve speed for Ignore and Only tables. Consider hashing instead of
- sorting. Consider playing with obstacks to digest them.
-
- @item
- Provide better handling of format effectors obtained from input, and
- also attempt white space compression on output which would still
- maximize full output width usage.
-
- @end itemize
-
- @item
- Provide multiple language support.
-
- Most of the boosting work should go along the line of fast recognition
- of multiple and complex boundaries, which define various `languages'.
- Each such language has its own rules for words, sentences, paragraphs,
- and reporting requests. This is less difficult than I first thought:
-
- @itemize @bullet
-
- @item
- Learn how to use getopt, or write something if necessary. Recognize
- language modifiers with each option. At least -b, -i, -o, -W, -S, and
- also new language switcher options, will have such modifiers. Modifiers
- on language switchers will allow or disallow language transitions.
-
- @item
- Complete the transformation of underlying variables into arrays in the
- code.
-
- @item
- Implement a heap of positions in the input file. There is one entry in
- the heap for each compiled regexp; it is initialized by a re_search
- after each regexp compile. Regexps reschedule themselves in the heap
- when their position passes while scanning input. In this way, looking
- simultaneously for a lot of regexps should not be too inefficient, once
- the scanning starts. If this works ok, maybe consider accepting regexps
- in Only and Ignore tables.
-
- @item
- Merge with language processing boundary processing options, really
- integrating -S processing as a special case. Maybe, implement several
- level of boundaries. See how to implement a stack of languages, for
- handling quotations. See if more sophisticated references could be
- handled as another special case of a language.
-
- @end itemize
-
- @item
- Tackle other aspects, in a more long term view:
-
- @itemize @bullet
-
- @item
- Add options for statistics, frequency lists, referencing, and all other
- prescreening tools and subsidiary tasks of concordance production.
-
- @item
- Develop an interactive mode. Even better, construct a GNU emacs
- interface. I'm looking at Gene Myers <gene@@cs.arizona.edu> suffix
- arrays as a possible implementation along those ideas.
-
- @item
- Implement hooks so word classification and tagging should be merged in.
- See how to effectively hook in lemmatisation or other morphological
- features. It is far from being clear by now how to interface this
- correctly, so some experimentation is mandatory.
-
- @item
- Profile and speed up the whole thing.
-
- @item
- Make it work on small address space machines. Consider three levels of
- hugeness for files, and three corresponding algorithms to make optimal
- use of memory. The first case is when all the input files and all the
- word references fit in memory: this is the case currently implemented.
- The second case is when the files cannot fit all together in memory, but
- the word references do. The third case is when even the word references
- cannot fit in memory.
-
- @item
- There also are subsidiary developments for in-core incremental sort
- routines as well as for a huge external sort package. The need for more
- flexible sort packages comes partly from the fact that linguists use
- kinds of keys which compare in unusual and more sophisticated ways.
-
- @end itemize
-
- @end enumerate
-