home *** CD-ROM | disk | FTP | other *** search
- Info file gptx.info, produced by Makeinfo, -*- Text -*- from input
- file gptx.texinfo.
-
- Copyright (C) 1990 Free Software Foundation, Inc. Francois Pinard
- <pinard@iro.umontreal.ca>, 1988.
-
- $Id$
-
- This program is free software; you can redistribute it and/or
- modify it under the terms of the GNU General Public License as
- published by the Free Software Foundation; either version 1, or (at
- your option) any later version.
-
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- General Public License for more details.
-
- You should have received a copy of the GNU General Public License
- along with this program; if not, write to the Free Software
- Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
-
-
- File: gptx.info, Node: Top, Next: Usage, Up: (DIR)
-
- `gptx' - GNU permuted index generator
- =====================================
-
- This is the GNU prerelease of `gptx', a permuted index generator.
- This prerelease has the main goal of providing a `ptx' *almost*
- compatible replacement, able to handle small files quickly, while
- providing a platform for more development.
-
- This version reimplements and extends standard `ptx'. In
- particular, it can produce a readable "KWIC" without the need of
- `nroff'. This program does not repeat all `ptx' disposition quirks
- (but should it really do?). Also, this version does not yet handle
- huge input files, that is, those files which do not fit in memory all
- at once.
-
- * Menu:
-
- * Usage:: How to use the program, its options and parameters.
- * Regexps:: How a regular expression is written and used.
- * ptx mode:: In which ways `ptx' mode is different.
- * Future:: What are the development lines of this program.
-
-
- File: gptx.info, Node: Usage, Next: Regexps, Prev: Top, Up: Top
-
- How to use this program
- -----------------------
-
- This tool reads a text file and essentially produces a permuted
- index, with each keyword in its context. The calling sketch is one of:
-
- gptx [OPTION]... [INPUT]... >OUTPUT
-
- or:
-
- ptx [OPTION]... [INPUT [OUTPUT]]
-
- If the program is called as `ptx' instead of `gptx', or if `-p'
- option is selected, this implies `ptx' compatibility mode,
- disallowing extensions, introducing some limitations, and changing
- several of the program's default option values. See *Note ptx mode::
- for a list of differences.
-
- As usual, each option is represented by an hyphen followed by a
- single letter. Some options require a parameter in the form of a
- decimal number or a file name, in which case the parameter follows
- the option after some whitespace. Option letters may be grouped and
- tied together as a string which follows only one hyphen; if one of
- several of them require parameters, they should follow the combined
- options in the order of appearance of individual letters in the
- string. Individual options are explained below.
-
- When *not* in `ptx' compatibility mode, there may be zero, one or
- several parameters after the options. If there is no parameters, the
- program reads the standard input. If there is one or several
- parameters, they give the name of input files, which are all read in
- turn; a little as if all the input files were concatenated. However,
- there is a full contextual break between each file; and when
- automatic referencing is requested, file names and line numbers refer
- to individual text input files. In all cases, the program produces
- the permuted index onto the standard output.
-
- When in `ptx' compatibility mode, besides the options, there may
- be zero, one or two parameters. If there is no parameters, the
- program reads the standard input and produces the permuted index onto
- the standard output. If there is only one parameter, it names the
- text file to be read instead of the standard input. If two
- parameters are given, they give respectively the name of the file to
- read and the name of the file to produce.
-
- Note that for *any* file named as the value of an option or as an
- input text file, a single dash `-' may be used, in which case
- standard input is assumed. However, it would not make sense to use
- this convention more than once per program invocation.
-
- * Menu:
-
- * General options:: Options which affect general program behaviour.
- * Charset selection:: Underlying character set considerations.
- * Input processing:: Input fields, contexts, and keyword selection.
- * Output formatting:: Types of output format, and sizing the fields.
-
-
- File: gptx.info, Node: General options, Next: Charset selection, Up: Usage
-
- General options
- ...............
-
- `-p'
- This requests `ptx' behaviour, as far as we understand it. This
- option is selected by default when the program is installed
- under the name `ptx'.
-
- This option is not available once the program is operating in
- `ptx' compatibility mode.
-
- `-C'
- Prints a short note about the Copyright and copying conditions.
-
-
- File: gptx.info, Node: Charset selection, Next: Input processing, Prev: General options, Up: Usage
-
- Charset selection
- .................
-
- As it is setup now, the program assumes that the input file is
- coded using 8-bit ISO 8859-1 code, also known as Latin-1 character
- set, *unless* if it is compiled for MS-DOS, in which case it uses the
- character set of the IBM-PC. Compared to 7-bit ASCII, the set of
- characters which are letters is then different, this fact alters the
- behaviour of regular expression matching. Thus, the default regular
- expression for a keyword allows foreign or diacriticized letters.
- Keyword sorting, however, is still crude; it obeys the underlying
- character set ordering quite blindly.
-
- `-f'
- Fold lower case letters to upper case for sorting.
-
-
- File: gptx.info, Node: Input processing, Next: Output formatting, Prev: Charset selection, Up: Usage
-
- Word selection
- ..............
-
- `-b FILE'
- This option is an alternative way to option `-W' for describing
- which characters make up words. This option introduces the name
- of a file which contains a list of characters which can*not* be
- part of one word, this file is called the "Break file". Any
- character which is not part of the Break file is a word
- constituent. If both options `-b' and `-W' are specified, then
- `-W' has precedence and `-b' is ignored.
-
- In normal mode, the only way to avoid newline as a break
- character is to write all the break characters in the file with
- no newline at all, not even at the end of the file. In `ptx'
- compatibility mode, spaces, tabs and newlines are always
- considered as break characters even if not included in the Break
- file.
-
- `-i FILE'
- The file associated with this option contains a list of words
- which will never be taken as keywords in concordance output. It
- is called the "Ignore file". The file contains exactly one word
- in each line; the end of line separation of words is not subject
- to the value of the `-S' option.
-
- If not specified, there might be a default Ignore file. Default
- Ignore files are not necessarily the same in normal mode or in
- `ptx' compatibility mode. Unless changed by the local
- installation, there is *no* default Ignore file in normal mode,
- and the Ignore file is `/usr/lib/eign' in `ptx' compatibility
- mode. If you want to deactivate a default Ignore file, use
- `/dev/null' instead.
-
- `-o FILE'
- The file associated with this option contains a list of words
- which will be retained in concordance output, any word not
- mentionned in this file is ignored. The file is called the
- "Only file". The file contains exactly one word in each line;
- the end of line separation of words is not subject to the value
- of the `-S' option.
-
- There is no default for the Only file. In the case there are
- both an Only file and an Ignore file, a word will be subject to
- be a keyword only if it is given in the Only file and not given
- in the Ignore file.
-
- `-r'
- On each input line, the leading sequence of non white characters
- will be taken to be a reference that has the purpose of
- identifying this input line on the produced permuted index. See
- *Note Output formatting:: for more information about reference
- production. Using this option change the default value for
- option `-S'.
-
- Using this option, the program does not try very hard to remove
- references from contexts in output, but it succeeds in doing so
- *when* the context ends exactly at the newline. If option `-r'
- is used with `-S' default value, or when in `ptx' compatibility
- mode, this condition is always met and references are completely
- excluded from the output contexts.
-
- `-S REGEXP'
- This option selects which regular expression will describe the
- end of a line or the end of a sentence. In fact, there is other
- distinction between end of lines or end of sentences than the
- effect of this regular expression, and input line boundaries
- have no special significance outside this option. By default,
- in `ptx' compatibility mode or if `-r' option is used, end of
- lines are used; in this case, the REGEXP used is very simple:
-
- \n
-
- In normal mode and if `-r' option is not used, by default, end
- of sentences are used; the precise REGEX is imported from GNU
- emacs:
-
- [.?!][]\"')}]*\\($\\|\t\\| \\)[ \t\n]*
-
- An empty REGEXP is equivalent to completly disabling end of line
- or end of sentence recognition. In this case, the whole file is
- considered to be a single big line or sentence. The user might
- want to disallow all truncation flag generation as well, through
- option `-F ""'. On regular expression writing and usage, see
- *Note Regexps::.
-
- When the keywords happen to be near the beginning of the input
- line or sentence, this often creates an unused area at the
- beginning of the output context line; when the keywords happen
- to be near the end of the input line or sentence, this often
- creates an unused area at the end of the output context line.
- The program tries to fill those unused areas by wrapping around
- context in them; the tail of the input line or sentence is used
- to fill the unused area on the left of the output line; the head
- of the input line or sentence is used to fill the unused area on
- the right of the output line.
-
- This option is not available when the program is operating `ptx'
- compatibility mode.
-
- `-W REGEXP'
- This option selects which regular expression will describe each
- keyword. By default, in `ptx' compatibility mode, a word is
- anything which ends with a space, a tab or a newline; the REGEXP
- used is `[^ \t\n]+'.
-
- In normal mode, a word is a sequence of letters; the REGEXP used
- is `\w+'.
-
- An empty REGEXP is equivalent to not using this option, letting
- the default dive in. On regular expression writing and usage,
- see *Note Regexps::.
-
- This option is not available when the program is operating `ptx'
- compatibility mode.
-
-
- File: gptx.info, Node: Output formatting, Prev: Input processing, Up: Usage
-
- Output formatting
- .................
-
- Output format is mainly controlled by `-O' and `-T' options,
- described in the table below. However, when neither `-O' nor `-T' is
- selected, and if we are not running in `ptx' compatibility mode, the
- program choose an output format suited for a dumb terminal. This is
- the default format when working in normal mode. Each keyword
- occurrence is output to the center of one line, surrounded by its
- left and rigth contexts. Each field is properly justified, so the
- concordance output could readily be observed. As a special feature,
- if automatic references are selected by option `-A' and are output
- before the left context, that is, if option `-R' is *not* selected,
- then a colon is added after the reference; this nicely interface with
- GNU Emacs `next-error' processing. In this default output format,
- each white space character, like newline and tab, is merely changed
- to exactly one space, with no special attempt to compress consecutive
- spaces. This might change in the future. Except for those white
- space characters, every other character of the underlying set of 256
- characters is transmitted verbatim.
-
- Output format is further controlled by the following options.
-
- `-g NUMBER'
- Select the size of the minimum white gap between the fields on
- the output line.
-
- `-w NUMBER'
- Select the output maximum width of each final line. If
- references are used, they are included or excluded from the
- output maximum width depending on the value of option `-R'. If
- this option is not selected, that is, when references are output
- before the left context, the output maximum width takes into
- account the maximum length of all references. If this options
- is selected, that is, when references are output after the right
- context, the output maximum width does not take into account the
- space taken by references, nor the gap that precedes them.
-
- `-A'
- Select automatic references. Each input line will have an
- automatic reference made up of the file name, an open
- parenthesis, the line ordinal and a close parenthesis. However,
- the file name will be empty when standard input is being read.
- If both `-A' and `-r' are selected, then the input reference is
- still read and skipped, but the automatic reference is used at
- output time, overriding the input reference.
-
- This option is not available when the program is operating `ptx'
- compatibility mode.
-
- `-R'
- In default output format, when option `-R' is not used, any
- reference produced by the effect of options `-r' or `-A' are
- given to the far right of output lines, after the right context.
- In default output format, when option `-R' is specified,
- references are rather given to the beginning of each output
- line, before the left context. For any other output format,
- option `-R' is almost ignored, except for the fact that the
- width of references is *not* taken into account in total output
- width given by `-w' whenever `-R' is selected.
-
- This option is not explicitely selectable when the program is
- operating in `ptx' compatibility mode. However, in this case,
- it is always implicitely selected.
-
- `-F STRING'
- This option will request that any truncation in the output be
- reported using the string STRING. Most output fields
- theoretically extend towards the beginning or the end of the
- current line, or current sentence, as selected with option `-S'.
- But there is a maximum allowed output line width, changeable
- through option `-w', which is further divided into space for
- various output fields. When a field has to be truncated because
- cannot extend until the beginning or the end of the current line
- to fit in the, then a truncation occurs. By default, the string
- used is a single slash, as in `-F /'.
-
- STRING may have more than one character, as in `-F ...'. Also,
- in the particular case STRING is empty (`-F ""'), truncation
- flagging is disabled, and no truncation marks are appended in
- this case.
-
- This option is not available when the program is operating `ptx'
- compatibility mode.
-
- `-O'
- Choose an output format suitable for `nroff' or `troff'
- processing. Each output line will look like:
-
- .xx "TAIL" "BEFORE" "KEYWORD_AND_AFTER" "HEAD" "REF"
-
- so it will be possible to write an `.xx' roff macro to take care
- of the output typesetting. This is the default output format
- when working in `ptx' compatibility mode.
-
- In this output format, each non-graphical character, like
- newline and tab, is merely changed to exactly one space, with no
- special attempt to compress consecutive spaces. Each quote
- character: `"' is doubled so it will be correctly processed by
- `nroff' or `troff'. All characters having their eight bit set
- are turned into spaces in this version. It is expectable that
- diacriticized characters will be correctly expressed in `roff'
- terms if I learn how to do this. So, let me know how to improve
- this special character processing.
-
- This option is not available when the program is operating `ptx'
- compatibility mode. In fact, it then becomes the default and
- sole output format.
-
- `-T'
- Choose an output format suitable for TeX processing. Each
- output line will look like:
-
- \xx {TAIL}{BEFORE}{KEYWORD}{AFTER}{HEAD}{REF}
-
- so it will be possible to write write a `\xx' definition to take
- care of the output typesetting. Note that when references are
- not being produced, that is, neither option `-A' nor option `-r'
- is selected, the last parameter of each `\xx' call is inhibited.
-
- In this output format, some special characters, like `$', `%',
- `&', `#' and `_' are automatically protected with a backslash.
- Curly brackets `{', `}' are also protected with a backslash, but
- also enclosed in a pair of dollar signs to force mathematical
- mode. The backslash itself produces the sequence
- `\backslash{}'. Circumflex and tilde diacritics produce the
- sequence `^\{ }' and `~\{ }' respectively. Other diacriticized
- characters of the underlying character set produce an
- appropriate TeX sequence as far as possible. The other
- non-graphical characters, like newline and tab, and all others
- characters which are not part of ASCII, are merely changed to
- exactly one space, with no special attempt to compress
- consecutive spaces. Let me know how to improve this special
- character processing for TeX.
-
- This option is not available when the program is operating `ptx'
- compatibility mode.
-
-
- File: gptx.info, Node: Regexps, Next: ptx mode, Prev: Usage, Up: Top
-
- Syntax of Regular Expressions
- -----------------------------
-
- Regular expressions have a syntax in which a few characters are
- special constructs and the rest are "ordinary". An ordinary
- character is a simple regular expression which matches that character
- and nothing else. The special characters are `$', `^', `.', `*',
- `+', `?', `[', `]' and `\'; no new special characters will be
- defined. Any other character appearing in a regular expression is
- ordinary, unless a `\' precedes it.
-
- For example, `f' is not a special character, so it is ordinary,
- and therefore `f' is a regular expression that matches the string `f'
- and no other string. (It does not match the string `ff'.) Likewise,
- `o' is a regular expression that matches only `o'.
-
- Any two regular expressions A and B can be concatenated. The
- result is a regular expression which matches a string if A matches
- some amount of the beginning of that string and B matches the rest of
- the string.
-
- As a simple example, we can concatenate the regular expressions
- `f' and `o' to get the regular expression `fo', which matches only
- the string `fo'. Still trivial. To do something nontrivial, you
- need to use one of the special characters. Here is a list of them.
-
- `. (Period)'
- is a special character that matches any single character except
- a newline. Using concatenation, we can make regular expressions
- like `a.b' which matches any three-character string which begins
- with `a' and ends with `b'.
-
- `*'
- is not a construct by itself; it is a suffix, which means the
- preceding regular expression is to be repeated as many times as
- possible. In `fo*', the `*' applies to the `o', so `fo*'
- matches one `f' followed by any number of `o's. The case of
- zero `o's is allowed: `fo*' does match `f'.
-
- `*' always applies to the smallest possible preceding
- expression. Thus, `fo*' has a repeating `o', not a repeating
- `fo'.
-
- The matcher processes a `*' construct by matching, immediately,
- as many repetitions as can be found. Then it continues with the
- rest of the pattern. If that fails, backtracking occurs,
- discarding some of the matches of the `*'-modified construct in
- case that makes it possible to match the rest of the pattern.
- For example, matching `ca*ar' against the string `caaar', the
- `a*' first tries to match all three `a's; but the rest of the
- pattern is `ar' and there is only `r' left to match, so this try
- fails. The next alternative is for `a*' to match only two `a's.
- With this choice, the rest of the regexp matches successfully.
-
- `+'
- Is a suffix character similar to `*' except that it requires
- that the preceding expression be matched at least once. So, for
- example, `ca+r' will match the strings `car' and `caaaar' but
- not the string `cr', whereas `ca*r' would match all three strings.
-
- `?'
- Is a suffix character similar to `*' except that it can match
- the preceding expression either once or not at all. For
- example, `ca?r' will match `car' or `cr'; nothing else.
-
- `[ ... ]'
- `[' begins a "character set", which is terminated by a `]'. In
- the simplest case, the characters between the two form the set.
- Thus, `[ad]' matches either one `a' or one `d', and `[ad]*'
- matches any string composed of just `a's and `d's (including the
- empty string), from which it follows that `c[ad]*r' matches
- `cr', `car', `cdr', `caddaar', etc.
-
- Character ranges can also be included in a character set, by
- writing two characters with a `-' between them. Thus, `[a-z]'
- matches any lower-case letter. Ranges may be intermixed freely
- with individual characters, as in `[a-z$%.]', which matches any
- lower case letter or `$', `%' or period.
-
- Note that the usual special characters are not special any more
- inside a character set. A completely different set of special
- characters exists inside character sets: `]', `-' and `^'.
-
- To include a `]' in a character set, you must make it the first
- character. For example, `[]a]' matches `]' or `a'. To include
- a `-', write `--', which is a range containing only `-'. To
- include `^', make it other than the first character in the set.
-
- `[^ ... ]'
- `[^' begins a "complement character set", which matches any
- character except the ones specified. Thus, `[^a-z0-9A-Z]'
- matches all characters except letters and digits.
-
- `^' is not special in a character set unless it is the first
- character. The character following the `^' is treated as if it
- were first (`-' and `]' are not special there).
-
- Note that a complement character set can match a newline, unless
- newline is mentioned as one of the characters not to match.
-
- `^'
- is a special character that matches the empty string, but only
- if at the beginning of a line in the text being matched.
- Otherwise it fails to match anything. Thus, `^foo' matches a
- `foo' which occurs at the beginning of a line.
-
- `$'
- is similar to `^' but matches only at the end of a line. Thus,
- `xx*$' matches a string of one `x' or more at the end of a line.
-
- `\'
- has two functions: it quotes the special characters (including
- `\'), and it introduces additional special constructs.
-
- Because `\' quotes special characters, `\$' is a regular
- expression which matches only `$', and `\[' is a regular
- expression which matches only `[', and so on.
-
- Note: for historical compatibility, special characters are treated
- as ordinary ones if they are in contexts where their special meanings
- make no sense. For example, `*foo' treats `*' as ordinary since
- there is no preceding expression on which the `*' can act. It is
- poor practice to depend on this behavior; better to quote the special
- character anyway, regardless of where is appears.
-
- For the most part, `\' followed by any character matches only that
- character. However, there are several exceptions: characters which,
- when preceded by `\', are special constructs. Such characters are
- always ordinary when encountered on their own. Here is a table of
- `\' constructs.
-
- `\|'
- specifies an alternative. Two regular expressions A and B with
- `\|' in between form an expression that matches anything that
- either A or B will match.
-
- Thus, `foo\|bar' matches either `foo' or `bar' but no other
- string.
-
- `\|' applies to the largest possible surrounding expressions.
- Only a surrounding `\( ... \)' grouping can limit the grouping
- power of `\|'.
-
- Full backtracking capability exists to handle multiple uses of
- `\|'.
-
- `\( ... \)'
- is a grouping construct that serves three purposes:
-
- 1. To enclose a set of `\|' alternatives for other operations.
- Thus, `\(foo\|bar\)x' matches either `foox' or `barx'.
-
- 2. To enclose a complicated expression for the postfix `*' to
- operate on. Thus, `ba\(na\)*' matches `bananana', etc.,
- with any (zero or more) number of `na' strings.
-
- 3. To mark a matched substring for future reference.
-
- This last application is not a consequence of the idea of a
- parenthetical grouping; it is a separate feature which happens
- to be assigned as a second meaning to the same `\( ... \)'
- construct because there is no conflict in practice between the
- two meanings. Here is an explanation of this feature:
-
- `\DIGIT'
- after the end of a `\( ... \)' construct, the matcher remembers
- the beginning and end of the text matched by that construct.
- Then, later on in the regular expression, you can use `\'
- followed by DIGIT to mean "match the same text matched the
- DIGIT'th time by the `\( ... \)' construct."
-
- The strings matching the first nine `\( ... \)' constructs
- appearing in a regular expression are assigned numbers 1 through
- 9 in order that the open-parentheses appear in the regular
- expression. `\1' through `\9' may be used to refer to the text
- matched by the corresponding `\( ... \)' construct.
-
- For example, `\(.*\)\1' matches any newline-free string that is
- composed of two identical halves. The `\(.*\)' matches the
- first half, which may be anything, but the `\1' that follows
- must match the same exact text.
-
- `\`'
- matches the empty string, provided it is at the beginning of the
- buffer.
-
- `\''
- matches the empty string, provided it is at the end of the buffer.
-
- `\b'
- matches the empty string, provided it is at the beginning or end
- of a word. Thus, `\bfoo\b' matches any occurrence of `foo' as a
- separate word. `\bballs?\b' matches `ball' or `balls' as a
- separate word.
-
- `\B'
- matches the empty string, provided it is not at the beginning or
- end of a word.
-
- `\<'
- matches the empty string, provided it is at the beginning of a
- word.
-
- `\>'
- matches the empty string, provided it is at the end of a word.
-
- `\w'
- matches any word-constituent character. The editor syntax table
- determines which characters these are.
-
- `\W'
- matches any character that is not a word-constituent.
-
- Here is a complicated regexp, used by Emacs to recognize the end
- of a sentence together with any whitespace that follows. It is given
- in Lisp syntax to enable you to distinguish the spaces from the tab
- characters. In Lisp syntax, the string constant begins and ends with
- a double-quote. `\"' stands for a double-quote as part of the
- regexp, `\\' for a backslash as part of the regexp, `\t' for a tab
- and `\n' for a newline.
-
- "[.?!][]\"')]*\\($\\|\t\\| \\)[ \t\n]*"
-
- This contains four parts in succession: a character set matching
- period, `?' or `!'; a character set matching close-brackets, quotes
- or parentheses, repeated any number of times; an alternative in
- backslash-parentheses that matches end-of-line, a tab or two spaces;
- and a character set matching whitespace characters, repeated any
- number of times.
-
-
- File: gptx.info, Node: ptx mode, Next: Future, Prev: Regexps, Up: Top
-
- `ptx' compatibility mode
- ------------------------
-
- This section outlines the differences between this program and
- standard `ptx'. There is also a `ptx' compatibility mode in this
- program which is activated implicitely when the program is called
- under the name `ptx' or explicitely through the usage of option `-p'.
- For someone used to standard `ptx', here are some points worth
- noticing when not using `ptx' compatibility mode:
-
- * In normal mode, concordance output is not formatted for `troff'
- or `nroff'. By default, output is rather formatted for a dumb
- terminal. `troff' or `nroff' output may still be selected
- through option `-O'.
-
- * In normal mode, unless `-R' option is used, the maximum
- reference width is subtracted from the total output line width.
- In `ptx' compatibility mode, width of references are not taken
- into account in the output line width computations.
-
- * In normal mode, all 256 characters, even `NUL's, are read and
- processed from input file with no adverse effect. No attempt is
- made to limit this in `ptx' compatibility mode. However,
- standard `ptx' does not accept 8-bit characters, a few control
- characters are rejected, and the tilde `~' is condemned.
-
- * In normal mode, input lines may be of infinite length. No
- attempt is made to limit this in `ptx' compatibility mode.
- However, standard `ptx' processes only the first 200 characters
- in each line.
-
- * In normal mode, the break (non-word) characters default to be
- every character except letters. In `ptx' compatibility mode,
- the break characters default to space, tab and newline only.
-
- * In some circumstances, output lines are filled a little more
- completely in normal mode than in `ptx' compatibility mode.
- Even in `ptx' mode, there are some slight disposition glitches
- this program does not completely reproduce, even if it comes
- quite close.
-
- * The Ignore file default in `ptx' compatibility mode is not the
- same as in normal mode. In default installation, default Ignore
- files are `/usr/lib/eign' in `ptx' compatibility mode, and
- nothing in normal mode.
-
- * Standard `ptx' disallows specifying both the Ignore file and the
- Only file at the same time. This version allows both, and
- specifying an Only file does not inhibit processing the Ignore
- file.
-
-
- File: gptx.info, Node: Future, Prev: ptx mode, Up: Top
-
- Development guidelines
- ----------------------
-
- This should evolve towards a concordance package for GNU, able to
- tackle true, real, big concordance jobs, while being fast and of easy
- use for little jobs. The start point is standard `ptx'. Because
- several packages of this kind are awfully slow, I should reasonnably
- try to keep speed in mind. On the other end, I do not want to burden
- myself too much about interactive query for now; so, a future
- reorientation along this topic might require some work.
-
- Here is a *What To Do Next* list, in expected execution order.
-
- 1. Increase short term usability:
-
- * Support the program for the GNU community. As directed by
- user comments, test and debug the whole thing more fully,
- and on bigger examples. Solve portability glitches as long
- as this do not induce too ugly things in the code.
-
- * Provide sample macros in the documentation.
-
- * Understand and mimic `-t' option, if I can.
-
- * See how TeX mode could be made more useful, and if a
- texinfo mode would mean something to someone.
-
- * Sort keywords intelligently for Latin-1 code. See how to
- interface this character set with various output formats.
- Also, introduce options to inverse-sort and possibly to
- reverse-sort.
-
- * Improve speed for Ignore and Only tables. Consider hashing
- instead of sorting. Consider playing with obstacks to
- digest them.
-
- * Provide better handling of format effectors obtained from
- input, and also attempt white space compression on output
- which would still maximize full output width usage.
-
- 2. Provide multiple language support.
-
- Most of the boosting work should go along the line of fast
- recognition of multiple and complex boundaries, which define
- various `languages'. Each such language has its own rules for
- words, sentences, paragraphs, and reporting requests. This is
- less difficult than I first thought:
-
- * Learn how to use getopt, or write something if necessary.
- Recognize language modifiers with each option. At least
- -b, -i, -o, -W, -S, and also new language switcher options,
- will have such modifiers. Modifiers on language switchers
- will allow or disallow language transitions.
-
- * Complete the transformation of underlying variables into
- arrays in the code.
-
- * Implement a heap of positions in the input file. There is
- one entry in the heap for each compiled regexp; it is
- initialized by a re_search after each regexp compile.
- Regexps reschedule themselves in the heap when their
- position passes while scanning input. In this way, looking
- simultaneously for a lot of regexps should not be too
- inefficient, once the scanning starts. If this works ok,
- maybe consider accepting regexps in Only and Ignore tables.
-
- * Merge with language processing boundary processing options,
- really integrating -S processing as a special case. Maybe,
- implement several level of boundaries. See how to
- implement a stack of languages, for handling quotations.
- See if more sophisticated references could be handled as
- another special case of a language.
-
- 3. Tackle other aspects, in a more long term view:
-
- * Add options for statistics, frequency lists, referencing,
- and all other prescreening tools and subsidiary tasks of
- concordance production.
-
- * Develop an interactive mode. Even better, construct a GNU
- emacs interface. I'm looking at Gene Myers
- <gene@cs.arizona.edu> suffix arrays as a possible
- implementation along those ideas.
-
- * Implement hooks so word classification and tagging should
- be merged in. See how to effectively hook in lemmatisation
- or other morphological features. It is far from being
- clear by now how to interface this correctly, so some
- experimentation is mandatory.
-
- * Profile and speed up the whole thing.
-
- * Make it work on small address space machines. Consider
- three levels of hugeness for files, and three corresponding
- algorithms to make optimal use of memory. The first case
- is when all the input files and all the word references fit
- in memory: this is the case currently implemented. The
- second case is when the files cannot fit all together in
- memory, but the word references do. The third case is when
- even the word references cannot fit in memory.
-
- * There also are subsidiary developments for in-core
- incremental sort routines as well as for a huge external
- sort package. The need for more flexible sort packages
- comes partly from the fact that linguists use kinds of keys
- which compare in unusual and more sophisticated ways.
-
-
-
- Tag Table:
- Node: Top856
- Node: Usage1818
- Node: General options4537
- Node: Charset selection4995
- Node: Input processing5781
- Node: Output formatting11223
- Node: Regexps18062
- Node: ptx mode28171
- Node: Future30648
- End Tag Table
-