IXPARSE
Section: User Commands (1)
Updated: August 24, 1993
Index
Return to Main Contents
NAME
ixparse - generate and convert text processing information files
SYNOPSIS
/usr/bin/ixparse
[ -aAbChHfgnNprUvwWx ]
[ -ttype ]
[ -Dfile ] [ -Ffile ] [ -Sfile ]
[ -Llanguage ]
[ -M# ] [ -P# ]
[ -ystring ] [ file ... ]
DESCRIPTION
Given a list of files, or a stream on standard input,
ixparse
generates one of four types of profiling information
on standard output.
With the -v option,
ixparse
can also generate profiling information
for each input file; the output
is put into separate files named by adding an extension
to the input file's name
(see below for the extensions).
The four types of profile are:
weighting domain (a binary format defined
by the Indexing Kit's IXWeightingDomain class),
histogram,
description,
and
Attribute Reader Format.
The binary weighting domain format is undocumented.
A description is a short summary that
can be derived from some file formats,
such as UNIX manual pages.
Attribute Reader Format is described
in the Indexing Kit documentation in the
NEXTSTEP General Reference.
Histogram format is described below.
Weighting domain files can be used with
ixbuild(1)
or again with
ixparse
to alter the weighting of tokens in the index
or profile.
For example, a weighting domain could
be generated for all the source files in a
development project:
-
ixparse -w *.[cm] >project.weight
and that file could be used again with
ixparse:
-
ixparse -Hp -Dproject.weight MyObject.m
The result would be a histogram where the
weights of words are skewed such that if two
words occur the same number of times in
MyObject.m, those occurring less frequently in
the entire set of source files (that is, in
the domain file project.weight)
have higher weights.
In addition to generating profiling
information for text files,
ixparse
can read existing
profiles in
weighting domain format,
histogram format, and
NEXTSTEP Release 2 Word Frequency Table (WFTable) format,
converting that information to one of the other
formats.
HISTOGRAM FORMAT
Each line of a file in histogram format has the form:
- token weight rank
token is the token or word in the index,
weight is its weight (frequency) in the domain,
and rank is its cardinal rank in the domain
(1 == most common, 2 = second most common, and so on).
rank is only present in histograms produced
by converting from weighting domains.
The fields of the line are separated by single spaces;
be sure to search backward from the end of a line to
find the token, as it is possible for the token to
contain embedded spaces or tabs.
OPTIONS
- --
-
List these options.
The following options select input and output formats.
Only one of the input options
-t, -h, -w, and -x
and one of the output options
-H, -g, -W, and -b
can be specified.
- -ttype
-
Interpret input as of file type type
(for exampe, -trtf for Rich Text Format).
By default,
ixparse
attempts to determine the file type
for each file automatically.
- -w
-
Interpret input as weighting domain format.
- -h
-
Interpret input as histogram format.
- -x
-
Interpret input as NEXTSTEP Release 2 WFTable format.
- -H
-
Generate output in histogram format.
This is the default.
- -g
-
Generate output as descriptions of file contents.
- -W
-
Generate output in weighting domain format.
- -b
-
Generate output in Attribute Reader Format.
- -v
-
Vector mode. Generate an output file for each input file.
Histogram and Attribute Reader Format files
have an extension of .histogram
(this is a bug; Attribute Reader Format files should use .arf).
weighting domain format files have an extension of .weight.
Description files have an extension of .description.
The remaining options control other parsing switches and
weighting calculations.
- -a
-
Use absolute weighting.
The weight of a token (word) is its number
of occurrences in the input.
- -A
-
Don't fold plural word forms.
The default is to do plural folding.
- -C
-
Don't fold case to lower case.
The default is to fold case.
- -Dfile
-
Use the supplied weighting domain file (default .index.domain).
This is used for generating peculiarity weighting.
- -f
-
Use frequency weighting (number of occurrences / total tokens).
- -Ffile
-
Use the supplied file type table file (default .index.ftype).
See the
ixbuild(1)
manual page for more information on file type tables.
- -Llanguage
-
Parse files as though they contain text in the language language.
If no language is specified, the system default language is used.
- -M#
-
Use the supplied minimum weight;
words below this weight are dropped from the index.
The default is no minimum weight.
This option excludes use of the -P option.
- -n
-
Sort histogram output by name rather than weight.
- -N
-
Do not sort histogram output.
- -p
-
Use peculiarity weighting in conjunction
with a weighting domain (see -D).
- -P#
-
Use the supplied percentage passed;
words below this percentage are dropped from the index.
The default is 100% passed.
This option excludes use of the -M option.
- -r
-
Reduce words to stems; writer -> write.
The default is not to do this.
- -Sfile
-
Use the supplied stop words file
(default .index.swords).
See the
ixbuild(1)
manual page for more information on stop words files.
- -U
-
Disable uniquing in Attribute Reader Format.
See the Attribute Reader Format documentation
for more information.
- -ystring
-
Use the supplied punctuation string to delimit words;
for example, -y".,; ".
SEE ALSO
ixbuild(1), ixsearch(1),
Indexing Kit Documentation in NEXTSTEP General Reference
BUGS
ixparse
doesn't read data in Attribute Reader Format.
ixparse
filters files from various formats during parsing.
It should make the intermediate filtered formats
available as output options.
Sorting options don't apply when converting from domain
to histogram formats.
Output files generated by vector mode in Attribute Reader Format
should use .arf as their extension, not .historam.
Index
- NAME
-
- SYNOPSIS
-
- DESCRIPTION
-
- HISTOGRAM FORMAT
-
- OPTIONS
-
- SEE ALSO
-
- BUGS
-
This document was created by
man2html,
using the manual pages.
Time: 17:18:44 GMT, March 25, 2025