htmlchek version 4.1, February 20 1995
Name:
htmlchek.awk, htmlchek.pl - Syntactically checks HTML 2.0 or 3.0
files for a number of possible errors; can do local link cross-
reference checking, and generate a rudimentary reference-dependency
map. Runs under awk or perl. Includes a number of supplemental
utilities for HTML file processing:
dehtml - Removes all HTML markup, preliminary to spell check.
entify - Replaces high Latin-1 alphabetic characters with
ampersand entities for safe 7-bit transport.
metachar - Trivial program to protect HTML/SGML metacharacters
"&<>" in plain text that is to be included in an HTML file.
makemenu - Makes simple menu for HTML files, based on each file's
; can also make a simple table of contents based on
<H1>-<H6> headings.
xtraclnk.pl - Extracts links/anchors from HTML files; isolates
text contained in <A> and <TITLE> elements.
Typical Command Lines:
awk -f htmlchek.awk [options] infiles.html > outfile.check
perl htmlchek.pl [options] infiles.html > outfile.check
The options are in the form "option=value" (see the sections ``Command-line
Options'' and ``Language Customization Options'' below). The following is
an alternative invocation of htmlchek.awk under Unix (to ensure, as far as
possible, that the program is not run under incompatible ``old awk''):
sh htmlchek.sh [options] infiles.html > outfile.check
(If the files htmlchek.awk, htmlchek.pl, or htmlchek.sh are not in the
current directory, the pathname to where they are located will have to
be prefixed -- but see ``shell scripts'' below.)
Description:
This program checks for quite a number of possible defects in the HTML
(Hyper-Text Mark-up Language) version 2.0 SGML files used on the
World-Wide Web. (Files with Netscape extensions, or with features from
the preliminary Arena/HTML 3.0 document, can also be checked by
specifying the appropriate options, as explained below.) Diagnostic
messages are output to STDOUT and so generally appear on the
terminal/window, unless they are redirected to an output file, as is
done in the examples given above (of course, this all depends on the
operating system -- the Macintosh doesn't even have a "command line" as
such, but you can set up "droplets" with MacPerl).
The output of htmlchek is divided into two parts, for each file
checked:
First, if any possible problems are detected, these are signaled by
messages (one per line) on each problem. (Note that lines of output
which signal errors and warnings all contain the character `!'.)
"ERROR!"
This string is included when there is a definite error in the
input HTML source code. Sometimes multiple error messages can be
generated by a single error (see ``Limitations'' below), in which
case only the first message may be significant.
"Warning!"
This string is included in messages which point out stylistically
deprecated HTML coding, or the absence of certain recommended
features. Such messages are intended to be more or less advisory.
Second, at the end of each file's output, diagnostics are generated as
to the tags used in the file and the options used with each tag, along
with possible additional global warnings (these final diagnostics/
warnings can be longer than 80 columns).
A very limited form of cross-reference checking (making sure that
file-local <...HREF="#..."> references actually exist) is automatically
performed within each file; for larger-scale cross-reference checking
see the appropriate section below.
If you process more than one file at a time (by specifying multiple
files or wildcards on the command line, e.g. ``perl htmlchek.pl *.html''
or ``awk -f htmlchek.awk *.html''), then errors are located by filename
and line number.
HTML Error Messages:
Most of the error and warning messages should be fairly self-evident,
assuming a familiarity with the basic HTML language documentation; the
following is a basic glossary of terms used (note that tag "options" are
what are called "attributes" in SGML):
An "element" is <X>...</X> (for example, <A HREF="#page2">Page 2</A>).
A "tag name" is <X...> (for example, "A" in <A HREF="#page2">).
An "option" is <...Y="..."> (for example, "HREF" in <A HREF="#page2">).
An "option value" is <...="Z"> (for example, "#page2" in <A HREF="#page2">).
One warning that may be obscure, "Jump from header level H0", means
that the first heading in the file is not <H1>; to be consistent with a
system of sub-sections, the first heading should be <H1>, there should
only be one <H1> heading in a file, and the heading level should never
increase in value by more than 1 between two successive headers, as in
the following scheme:
_____________________________________________________
| [whole document] |
| H1 ------------------------------------------------ |
|_____________________________________________________|
| [first-level subdivisions] |
| H2 ------------ | H2 ------------ | H2 ------------ |
|_________________|_________________|_________________|
| [second-level subdivisions] |
| H3 --- | H3 --- | H3 --- | H3 --- | H3 --- | H3 --- |
|________|________|________|________|________|________|
etc.
To check whether or not the headings in a file reflect the file's
logical organization, run the file through makemenu with the toc=1
command-line option, and see what you get.
An error that can sometimes be counter-intuitive is a ``<LI> outside
list'' or ``<DT>/<DD> outside <DL>...</DL>'' error: in the sequence
<UL><B><LI></B></UL>, the <LI> is actually not in the list, since it is
not immediately contained within the <UL>...</UL> element (but is rather
immediately contained within the non-list <B>...</B> element).
The htmlchek program performs a fairly comprehensive job of checking
for HTML errors, but does not always exactly follow the official
standard. Bad stylistic practices are warned against, as well as actual
HTML errors, and in some cases htmlchek is stricter than the standard,
in order to accommodate the peculiarities of some browsers (the idea is
that HTML code should be ruggedized for the real world, not just
SGML-ically correct -- see below under ``nowswarn=1'', ``nonrecurpair='',
``metachar='', and SHORTTAG). And htmlchek it is also laxer than the
standard in allowing <ADDRESS>, <HR>, and <H1>-<H6> headings within
<LI> and <DD> list items (since these tags can occur legitimately
within a <BLOCKQUOTE>...</BLOCKQUOTE> element itself in a list item --
though they are not supposed to occur directly within a list item).
Similarly htmlchek does not check for <IMG> directly in <PRE> (which
is not allowed by the official standard), since the standard does
allow <IMG> indirectly in <PRE>. See further under ``Limitations''
below.
Command-line Options:
Options are in the form "option=value" (where the `=' should not have
spaces on either side of it); options should be specified on the command
line PRECEDING any names of HTML files to check (see ``Typical Command
Lines'' above). Options which follow filenames will not necessarily
take effect (they are silently ignored in Posix-compliant awk, and
generate an error in perl). Also, misspelled options will be silently
ignored in awk. (On Unix, the shell scripts htmlchek.sh and htmlchkp.sh
automatically check for command-line option errors, so you don't have to
worry about these problems.)
Options that affect the definition of the HTML language used to
interpret and check files are discussed in the ``Language Customization
Options'' section below; the other options are "inline=1", "nowswarn=",
"sugar=", "xref=", "map=", "refsfile=", "append=", "dirprefix=",
"usebase=", and "subtract=".
Output Options:
The three options "inline=1", "nowswarn=", and "sugar=1" control features
of the output of htmlchek:
inline=1
If this command-line option is specified, the output contains
htmlchek messages intermixed with the lines of the original HTML input
file, with error and warning messages placed after the lines to which
they apply. Lines which belong to the htmlchek output, rather than
being copied from the input, begin with "HTMLCHEK:". The inline=1
option is incompatible with (and overrides) the sugar=1 option.
nowswarn=1
If this option is specified, it turns off messages that warn you
about inappropriate whitespace (which may confuse browsers) in
elements commonly rendered with underlining. These warning
messages can be numerous enough to make it difficult to pick out other
warning and error messages.
sugar=1
If this option is specified, then "filename: linenumber:" is
prefixed to non-file-final error and warning messages (for
compatibility with editors such as emacs which use diagnostic output
which is formatted in this way, from Unix tools such as ``cc'' and
``lint'').
Cross-reference Checking Output Options:
These options are connected with details of multi-file cross-reference
checking; if you intend to do such cross-reference checking using the
run?chek.sh shell scripts under Unix, you can ignore these options and
jump to the next section below.
xref=1
If this option is specified, cross-reference checking is performed
on the files that are checked. If the refsfile= option is not
specified, then the results (unresolved locations and references) are
put at the end of the STDOUT output. If refsfile=``prefixname'' is
specified, then the cross-reference checking results are put in
separate files: locations in the checked files which were not
referenced from within these files are in a file named
``prefixname.NAME'', references from the checked files which are not
to locations found within the files are in ``prefixname.HREF'', and
references to in-line images are in ``prefixname.SRC''. (See further
discussion under ``Cross-reference checking'' below.)
map=1
If this option is specified along with xref=1, information about
which files refer to which other files and resources (i.e. a
dependency map) is generated; this can be quite large. If
refsfile=``prefixname'' is specified as well, the dependency map will
be placed in a separate file called ``prefixname.MAP''.
refsfile=``prefixname''
If this option is specified along with xref=1, then the output of
internal cross-reference checking is put in separate files. If
refsfile= is specified without xref=1 also being specified, then raw
lists of non-cross-checked references are output to separate files
(this was used for external cross-reference checking in earlier
versions of htmlchek, as can still be done using the rducfil?.sh
scripts, if desired). All HREF="..." references contained in the HTML
files being checked are output to a file named ``prefixname.HREF'',
all references to in-line images contained in the HTML files are
output to a file ``prefixname.SRC'', and the destination locations
specified in the HTML files are output to a file ``prefixname.NAME''.
(Note that <...HREF="..."> references to non-inline images will be
found in the .HREF file, not the .SRC file.)
append=1
If this option is specified along with refsfile=, then the resulting
three files (or four, if the optional .MAP file is generated) will be
appended to, if they already exist from a previous run, instead of
being replaced. This may be useful for cross-reference checking of
files which are not in a single sub-directory tree on a single machine
(see ``External Cross-Reference Checking'' below). A blank line is
added to each file at the beginning of each run, so that the output
due to successive runs can be separated (but these are not preserved
when the rducfil?.sh scripts are run).
Cross-reference Checking URL Prefix Options:
dirprefix=``pathname''
When "xref=1" and/or "refsfile=" is also specified, then the value
of ``pathname'' (which should be a valid absolute or quasi-absolute
URL pathname beginning) is prefixed to destination locations and
relative URL's. This can be useful in cross-reference checking, in
order to resolve relative URL's to absolute URL's when the files you
are checking sometimes cross-refer with absolute URL's (see the next
section below).
usebase=1
When "usebase=1" is specified, the URL specified in <BASE
HREF="..."> in each file is assumed to be the name of the file (and
the "dirprefix=", if any, is ignored in the processing of the file
after the <BASE> is found). This only takes affect after the <BASE>
tag is encountered in each file, so that <BASE HREF="..."> should be
the first of the tags with NAME, HREF, ID, etc. options in each file.
subtract=``pathname''
If this option is specified, then ``pathname'' is removed from the
beginnings of filenames specified on the command line, before the
dirprefix= or usebase=1 prefix is added, for URL purposes. This can
be useful for running cross-reference checking on files not located in
the current directory. If a filename is given which does not begin
with the specified prefix, then htmlchek stops with an error.
Cross-Reference Checking:
Here ``cross-reference checking'' does not mean traversing the Web and
finding out whether off-site remote URL's actually exist. It only means
gathering together all the locations and references in a local HTML
file, or a collection of local HTML files, and finding all the locations
which are unreferenced within these files, and all the references which
are not to a location within this collection. (A dependency map of what
HTML files reference what resources can also be optionally generated.)
You should generally delay cross-reference checking until you have more
or less debugged your HTML files and corrected syntactically malformed
references.
The programs htmlchek.awk and htmlchek.pl do cross-reference checking
when the xref=1 command-line option is specified. Under Unix or Posix
1003.2, the shell scripts runachek.sh (for cross-reference checking with
htmlchek.awk) and runpchek.sh (for cross-reference checking with
htmlchek.pl) take care of many of the details of specifying htmlchek
command-line options for cross-reference checking, and invoke the Unix
find utility to look at all the *.html files in a directory hierarchy.
These scripts have the following syntax (where ``dirprefix'' and
``outfileprefix'' stand for the first two shell script command-line
parameters, the presence of which is obligatory):
sh runachek.sh dirprefix outfileprefix [directory] [options]
sh runpchek.sh dirprefix outfileprefix [directory] [options]
The third, optional, command-line parameter ``directory'' is the path to
the top directory of the tree in which all .html files are to be checked
(for example, "$HOME/public_html"); if this is parameter is not present,
then the current directory is used by default. This path should not end
with a trailing `/' character. If you specify an incorrect directory
path and get an errormessage from the Unix `find' utility, then the awk or
Perl interpreter may end up looking for input from STDIN (the keyboard);
press control-D to get back to your shell.
The first parameter ``dirprefix'' should be either the null string (''),
or an absolute or quasi-absolute URL pathname beginning. What
``dirprefix'' should be specified as, depends on how the files you are
checking cross-refer to each other with <... HREF="..."> links. In the
situation in which the files refer to each other strictly with simple
relative URL's (i.e. which do not begin with "//" or "/" -- ignoring the
optional access-method prefix), such as "subdir/otherfile.html#section1",
then you don't need the ``dirprefix'' mechanism, and you can get away
with specifying the first parameter of run?chek.sh as the null string
(and skip the rest of this paragraph). However, if you have
non-relative URL's in your cross references, then you need to specify a
``dirprefix'' (note that if there are files in more than one directory,
and files in subordinate directories refer to files further up the
hierarchy, then you may want to use non-relative URL's, since while
"../" is legitimate in a URL, relative URL's beginning with "../" can
sometimes cause problems). The value to use for ``dirprefix'' should be
the string used, in cross-references among your files with non-relative
URL's, to refer to the root of the tree of files being checked (i.e. the
value of the optional ``directory'' parameter, or, if this is not
specified, the current directory when run?chek.sh is being run).
Whichever type of non-relative URL your documents use for this purpose
(whether a host-local reference like "/~myself/subdir/", a full
reference with access method like "http://myhost.edu/~myself/subdir/",
or any intermediate form), you should use the appropriate prefix as your
``dirprefix'' string; if your files use an inconsistent mixture of these
different reference types, then no single ``dirprefix'' can work, and
cross-reference checking will partially fail. Finally, if each file has
its own name specified in a <BASE HREF="..."> reference, you can let
``dirprefix'' be the null string, and use the option usebase=1.
The second parameter ``outfileprefix'' is the name of the files (with
the extensions ".ERR", ".NAME", ".HREF", and ".SRC" -- and also ".MAP"
if the map=1 parameter is included on the command line) in which the
output of the HTML-checking and cross-referencing process will be put.
After these parameters, optional parameters that follow on the
remainder of the command line can be any of the "option=value" pairs
discussed in the ``Command-line Options'' section above or the
``Language Customization Options'' section below (except for refsfile=,
dirprefix=, xref=1, and subtract=, which are specified within the
run?chek.sh scripts, and listfile=).
So the following are some typical command lines (remember that putting
the name of a shell script first in a command line, as in the first
example, implies that you have set execute permission by running chmod):
runpchek.sh http://uts.cc.utexas.edu/~churchh/ check configfile=example.cfg
sh runachek.sh '' out $HOME/public_html map=1 &
The second example shows how cross-reference checking may be run as a
background process.
If no error has occurred, then when the shell script has finished,
non-cross-referencing errorcheck data is in the file ``outfileprefix.ERR'',
locations in the checked files which were not referenced from within
these files are in a file named ``outfileprefix.NAME'', references from
the checked files which are not to locations found within the files are
in ``outfileprefix.HREF'', and references to in-line images are in
``outfileprefix.SRC''.
If ``outfileprefix.HREF'' and ``outfileprefix.NAME'' have file lengths
greater than one, this does not necessarily signal an error:
``outfileprefix.HREF'' will contain not only `dangling' references to
local HTML files, but also references to non-inline images, sounds,
etc., and external references (to files not in the directory tree being
checked, including files on other WWW sites) as well. Similarly, the
file ``outfileprefix.NAME'' contains locations which are not referenced
locally, but these locations might be referenced from outside the local
directory tree.
It would be nice to check for the existence of local images listed in
``outfileprefix.SRC'' (and also the local non-inline images in
``outfileprefix.HREF''), but in general the references to these images
are in URL format there (rather than in local filesystem format), so
that there is no way to do this at the Unix shell level.
External Cross-Reference Checking:
If you have several directory trees of HTML files which cross-refer,
and each hierarchy needs a different ``dirprefix'', you can still do
cross-reference checking, if you run cross-reference checking for each
individual directory tree, specifying the same output file names for
each run, and use "append=1" as one of the options. (You can even do
cross-reference checking across multiple machines, if you have an
account on each machine, and transfer the cumulative .NAME, .HREF, and
.SRC files to each machine before running local cross-reference checking
on that machine -- of course, the ``dirprefix'' string on each machine
will have to include a hostname for this to work.) Under Unix, the
rducfil?.sh shell scripts will then reduce the resulting .HREF and .NAME
files, so that they only contain unresolved references, by removing all
items which are contained in both the files (these shell scripts will
also sort the .SRC file and collapse duplicate entries). The
rducfil?.sh scripts take only one command-line parameter, the common
prefix of the .NAME, .HREF, and .SRC files to be resolved.
Another way of doing cross-reference checking across multiple
directory trees on a single machine is to generate an external file
containing a list of the HTML files to be checked; this external list
can then be specified by the listfile= or lf= command-line option, as
explained in the ``MS-DOS'' section below (though the listfile=/lf=
parameter is not MS-DOS specific).
Language Customization Options:
By default, htmlchek checks HTML files more or less according to
version 1.24 of the HTML 2.0 standard (with some departures,
as discussed above). However, htmlchek is not limited to checking
HTML files according to a single language definition.
Language Extensions (Arena/HTML3, Netscape):
The following options add extensions to the HTML language checked for:
arena=1 or html3=1 or htmlplus=1
Specifying any of these options means that files are checked
according to a preliminary (December 1994) version of the emerging
HTML 3.0 specification, and not as HTML 2.0. (Note that htmlchek
doesn't check for the differences between MATH and non-MATH.)
netscape=1
Specifying this option means that the Netscape extensions do not
generate errormessages.
Since the HTML language will continue to evolve, the HTML 3.0 definition
is still preliminary, and the Netscape extensions document is unclear on
some points (and uses the word "tag" rather confusingly) -- therefore,
the language definitions coded in the htmlchek program are clearly not
cast in stone. For this reason I have also provided htmlchek with the
following command-line or configuration file options to customize many
features of how htmlchek treats individual tags, and thus the language
that is checked for:
Tag definition options:
nonpair=
Defines a tag or a list of tags as non-pairing (i.e. only <X> is
encountered, never </X>). If more than one tag is to be defined as
non-pairing, then they should be separated by commas: "nonpair=x,Y,z".
(On the command line, there can be no whitespace on either side of the
equals sign or commas; in the configuration file the syntax is less
strict.) The alphabetic case of the tag names does not matter, as
seen in this example, but the case of the option does ("NONPAIR=..."
will not work -- on VMS I think this means you'll have to quote the
whole "option=value" unit).
Non-pairing tags in HTML 2.0 include <BR>, <HR>, <IMG>, and <LINK>.
loosepair=
Defines a tag or a comma-separated list of tags as optionally
pairing (a <X> can be followed by a matching </X>, but need not be).
Optionally pairing tags in HTML 2.0 include <P>, <LI>, <DD>, <DT>,
and <OPTION>.
strictpair=
Defines a tag or a comma-separated list of tags as obligatorily
pairing (a <X> must always be followed by a matching </X>). (So
"strictpair=p" would cause <P> to be checked as a paragraph
container.) Most tags in HTML are of this type.
nonrecurpair=
Defines a tag or a comma-separated list of tags as obligatorily
pairing, and in addition specifies that each tag is non-self-nesting
-- i.e. one occurrence of an <X>...</X> element can never occur inside
another occurrence of <X>...</X> (no matter how many intervening
levels of structure there are). Thus since <A> is a non-self-nesting
tag, the sequence <A>...<B>...<A>...</A>...</B>...</A> is forbidden.
Tags which are specially defined as non-self-nesting in HTML 2.0
are <A> and <FORM>; also, a number of other tags turn out to be
non-self-nesting, (the headings <H1>-<H6>, <ADDRESS>, <PRE>, <DT>,
<MENU>, and <DIR>)
Declaring an obligatorily-pairing tag to be non-self-nesting is a
powerful technique for detecting missing closing tags, which
unintendedly result in an element being much bigger than it should be
(the other checks in htmlchek may only detect such errors much later
on, possibly at the end of the document, while a self-nesting error
will generally show up close to the site where the missing closing tag
should be). For this reason, and because self-nesting is actually by
mistake in almost all cases, I have defined most of the HTML 2.0
obligatorily-pairing tags as non-self-nesting in htmlchek, although
this is stricter than the official standard (to restore the "official"
behavior, the configuration file html2dtd.cfg can be used, as
discussed below).
If a new tag is declared with any of the preceding four options, it
becomes a "known" tag to htmlchek. The options in the following
sub-section should only be applied to tags which have been declared in
this way (or are already known to htmlchek), or the results may not be
what you expect.
Other tag behavior options:
lowlevelpair=
Defines an obligatorily pairing tag, or a comma-separated list of
such tags, as low-level markup. Low-level markup elements can
generally only include each other (and not things such as lists,
headings, paragraphs, and blockquotes).
Low-level markup tags in HTML 2.0 include <A>, <B>, <EM>, etc. (By
special dispensation, the <A> element is allowed to contain <H1>-<H6>
headings, though a warning is generated.)
lowlevelnonpair=
Defines a non-obligatorily-pairing tag, or a comma-separated list
of such tags, to be allowable within low-level markup and non-block
elements.
Non-obligatorily-pairing low-level markup tags in HTML 2.0 are
<BR>, and <IMG>. (By special dispensation, a <PRE> element is allowed
to contain <HR>, and <ADDRESS> is allowed to contain <P>.)
nonblock=
Defines a pairing tag, or a comma-separated list of such tags, to
only contain low-level markup (the difference from lowlevelpair= is
that nonblock= elements cannot contain each other). Making an
optionally-pairing tag (such as <P> in the default definition) a
non-block tag will not in general work, since htmlchek will not assume
an implicit closing tag (such as </P>) before lists, headings,
blockquotes, etc. (<DT> and <LI> do work, since they're confined to
lists).
Non-block tags in HTML 2.0 include <DT>, headings <H1>-<H6>, <PRE>,
and <ADDRESS> (with the exceptions noted under the lowlevelnonpair=
option), and also <LI> within a <MENU> or <DIR> list.
deprecated=
Defines a tag or a comma-separated list of tags as deprecated and
obsolescent. If such a tag occurs in the file, there is a warning
message in the file-final tag diagnostics.
Deprecated tags in HTML 2.0 include <LISTING>, <PLAINTEXT>, and
<XMP> (note that htmlchek doesn't use the special deprecated
tag-insensitive pseudo-SGML-"CDATA" mode in parsing within these
elements).
tagopts=
Defines allowed options for tags. Uses a different syntax than the
above options to htmlchek; here comma separated "tag,option" pairs are
themselves separated by colons. So to allow the <P> tag to have the
options ALIGN and NOWRAP, one could specify "tagopts=P,align:p,nowrap"
on the command line, or in the configuration file.
novalopts=
Defines allowed options for tags, with the same syntax as tagopts=;
the difference is that options defined with novalopts= are not
required to have a value (like the HTML 2.0 options COMPACT, ISMAP,
etc.).
reqopts=
Defines required options for tags. Uses the same syntax as
tagopts=, and causes an implicit tagopts= definition. So
"reqopts=IMG,WIDTH:img,height" means that IMG tags are required to
have WIDTH and HEIGHT options (which will be included in HTML 3.0, and
can greatly speed the display of documents in Netscape).
dlstrict=1 or dlstrict=2 or dlstrict=3
This option controls how <DT> and <DD> tags are distributed within
a <DL>...</DL> list. With dlstrict=3, every <DD> must be immediately
preceded by a <DT> (as in previous drafts of the HTML 2.0 standard;
the SGML "content model" is "(DT,DD?)+"). With dlstrict=2, <DD> can
be indirectly preceded by <DT> (SGML content model "(DT,DD*)+" or
"DT,(DT|DD)*"). With dlstrict=1 (the default behaviour of htmlchek)
<DD> and <DT> can be freely intermixed in the list (SGML content model
"(DT|DD)+").
Beware that some of the above definitions have the effect of undefining
things that are incompatible with what you are defining (to avoid
logical inconsistencies). For example, if you define "lowlevelpair=p",
then the tag <P> will be undefined as a loosely-pairing tag (since this
is incompatible with ``lowlevelpair'' status). This means it will be
treated as an unknown tag, unless you add an explicit "strictpair=p" or
"nonrecurpair=p" declaration.
General parsing configuration options:
metachar=1 or metachar=2 or metachar=3
This option controls how htmlchek responds to `<' and `>'
characters in tags. If metachar=3 is specified, then these characters
are allowed within comments and quoted option values (following the
SGML syntax), so that <IMG SRC="leftarrow.gif" ALT="->"> or <!-- <HR>
--> etc. would not cause errormessages. The default value,
metachar=2, does not allow `<' and `>' in tags or comments (so that
`>' inside a quoted option value will be interpreted as prematurely
ending the tag); this more accurately reflects the behaviour of some
HTML browsers. Finally, metachar=1 restricts comments further by
requiring them to be on a single line (another limitation of some
browsers); the warning "Complex comment" is then generated for
multi-line <!-- --> constructs.
nogtwarn=1
If this option is specified, no warnings are generated for loose
`>' characters outside of tags. Such loose `>' characters are bad
style (it is better to use ">"), and warning about them can be a
useful error-detecting technique, but they are not actually incorrect
SGML.
Configuration File:
Since it is cumbersome to specify long strings on the command line,
there is an alternative configuration file mechanism. Specifying
configfile=``filename'' on the command line will cause htmlchek to read
in options from the file (cf= is recognized as an abbreviated synonym for
configfile=). The same "option=value" units that are recognized on the
command line should be specified one per line in the configuration file
(note that all lines in the configuration file which do not contain the
`=' character are treated as comment lines and silently ignored).
Two sample configuration files are included in the htmlchek
distribution, example.cfg and html2dtd.cfg. If html2dtd.cfg is invoked
(by using configfile=html2dtd.cfg on the command line), then htmlchek
conforms more strictly to the official HTML 2.0 DTD (following the SGML
treatment of the `<' and `>' characters, and allowing low-level mark-up
tags to self-nest).
There are some differences between specifying options on the command
line and in the configuration file. On the command line, if there are
multiple instances of the same "xxx=" option, all but the last will be
silently ignored, but in the configuration file such multiple
definitions will have cumulative effect. Also the relative order of
evaluation on the command line is undefined (if you have both
"strictpair=p" and a "nonrecurpair=p" definitions on the command line,
you don't know which will override the other), while the order of
statements in a configuration file is significant, since later
definitions will override previous ones. Also, there can be no spaces
or tabs around the `=', `,' or `:' characters on the command line, but
this requirement is relaxed in the configuration file.
You can include definitions both on the command line and in the
configuration file, in which case command line definitions will override
those in the configfile= (specify an "arena=off" on the command line to
override an "arena=1" in the configuration file, and similarly with
html3=, htmlplus=, and netscape=). The internal definitions invoked by
"arena=1" etc. and "netscape=1" will override definitions specified in
the configuration file, but not those on the command line.
Note that the options discussed in the ``Command-line Options'' section
above (append=, dirprefix=, refsfile=, sugar=, and usebase=) cannot be
specified in the configuration file (nor, obviously, can configfile= or
cf= itself be specified there). This is because the configfile= is a
language definition file, not a user preference file. (If I ever
implement a user preference file in a future version of htmlchek, it
will be separate from the configfile=.) Since nowswarn= is actually a
language configuration option, it can be specified in the configuration
file.
Summary of htmlchek commmand-line options:
append=1 Cross-reference Checking Output option
arena=1 or arena=off Language Customization option
configfile=filename
or cf=filename Language Customization option
deprecated=tag1,tag2... Language Customization option
dirprefix=urlprefix Cross-reference Checking URL Prefix option
dlstrict=1 or dlstrict=2
or dlstrict=3 Language Customization option
html3=1 or html3=off Language Customization option
htmlplus=1 or htmlplus=off Language Customization option
inline=1 Output option
listfile=filename
or lf=filename Listfile option
loosepair=tag1,tag2... Language Customization option
lowlevelnonpair=tag1,tag2... Language Customization option
lowlevelpair=tag1,tag2... Language Customization option
map=1 Cross-reference Checking Output option
metachar=1 or metachar=2
or metachar=3 Language Customization option
netscape=1 or netscape=off Language Customization option
nogtwarn=1 Language Customization option
nonblock=tag1,tag2... Language Customization option
nonpair=tag1,tag2... Language Customization option
nonrecurpair=tag1,tag2... Language Customization option
novalopts=tag1,opt1:tag2,opt2... Language Customization option
nowswarn=1 Output option
refsfile=filenameprefix Cross-reference Checking Output option
reqopts=tag1,opt1:tag2,opt2... Language Customization option
strictpair=tag1,tag2... Language Customization option
subtract=pathprefix Cross-reference Checking URL Prefix option
sugar=1 Output option
tagopts=tag1,opt1:tag2,opt2... Language Customization option
usebase=1 Cross-reference Checking URL Prefix option
xref=1 Cross-reference Checking Output option
Supplemental HTML-file processing programs: dehtml, entify, and metachar
dehtml
dehtml removes all HTML markup from a file so you can spell-check
the darn thing. The commoner ampersand entities are translated to the
appropriate single characters, so you can spell check if you're
writing in a non-English language, and your spelling checker
understands 8-bit Latin-1 alphabetic characters. Note that dehtml
makes no pretensions to being an intelligent HTML-to-text translator;
it completely ignores everything within <...>, and passes everything
outside <...> through completely unaltered (except known ampersand
entities).
Typical command lines:
awk -f dehtml.awk infile.html > outfile.txt
perl dehtml.awk infile.html > outfile.txt
The shell script file dehtml.sh runs dehtml.awk using the best
available interpreter (under Unix):
sh dehtml.sh infile.html > outfile.txt
This program processes all files on the command line to STDOUT; to
process a number of files individually, use the iteration mechanism of
your shell; for example:
for a in *.html ; do awk -f dehtml.awk $a > otherdir/$a ; done
in Unix sh, or:
for %a in (*.htm) do call dehtml %a otherdir\%a
in MS-DOS, where dehtml.bat is the following one-line batch file:
gawk -f dehtml.awk %1 > %2
While dehtml isn't primarily an error-checking program, if it does
happen to find errors connected with its functioning (or encounter
HTML code beyond its capacity to handle), then the error messages are
on lines beginning "&&^" which are intermixed with the non-error
output.
entify
The relatively tiny entify program translates Latin-1 high
alphabetic characters in a file to HTML ampersand entities for safety
when moving the file through non-8-bit-safe transport mechanisms
(principally non-Mime RFC-822 e-mail and Usenet). This is for the
greater convenience of those writing European languages with editors
which use Latin-1 characters; entify can be run just before
distributing an HTML file externally.
Typical command line:
awk -f entify.awk infile.8bit > outfile.html
perl entify.pl infile.8bit > outfile.html
(Note that entify doesn't help in checking whether an HTML file is OK,
but is rather used as a precautionary measure to prevent the file from
being mangled by archaic 7-bit software.)
metachar
This relatively trivial script protects the HTML/SGML metacharacters
`&', `<' and `>' by replacing them with the appropriate ampersand entity
references; it is useful for importing plain text into an HTML file.
Typical command lines:
awk -f metachar.awk infile.text > outfile.htmltext
perl metachar.pl infile.text > outfile.htmltext
Supplemental link extraction programs: makemenu and xtraclnk.pl
makemenu:
This program creates a simple menu for HTML files specified on the
command line; the text in each input file's <TITLE>... element
is placed in a link to that file in the output menu file. If the toc=1
command-line option is specified, makemenu also includes a simple table
of contents for each input file in the menu, based on the file's
-