home *** CD-ROM | disk | FTP | other *** search
- htmlsrpl version 1.11, January 22 1995
-
- Name:
-
- htmlsrpl.pl - HTML-aware search-and-replace program, with
- either literal strings or regular expressions. Acts either
- only outside HTML/SGML tags, or only within tags; can be
- restricted to operate only within and/or only outside
- specified elements; can also upper-case tag names. Runs
- under perl.
-
-
- Typical use:
-
- perl htmlsrpl.pl [options] infile.html > outfile.html
-
- Where command-line options have the form "option=value" (without whitespace
- on either side of the `=' character), and all options should precede
- filename arguments on the command line.
-
-
- Basic command-line options:
-
- old="..." String or expression to be replaced. Must be defined and
- non-null (unless the upcase=1 option is specified).
-
- new="..." The new replacement string or expression. If ``new='' is
- absent or null, the old="..." string is deleted.
-
- intags=1 If this option is specified on the command line, strings
- within tags are changed, but not text outside of tags. (The
- default action, if this option is absent, is to only replace
- text outside of tags.)
-
-
- Element inclusion/exclusion command-line options:
-
- inside=... The value of this option is a tagname or a comma-separated
- list of tagnames (e.g. inside=A or inside=b,i). Search and
- replace operations will only take place in material that is
- contained within all the specified elements. So if inside=b,i
- has been specified on the command line, only "Text3" in the
- following input file would be subject to search and replace:
- "Text1<B>Text2<I>Text3</I></B>". The order of inclusion makes
- no difference (so that <B> nested inside <I> would be treated
- exactly the same as <I> nested inside <B>).
-
-
- outside=... Search and replace will only take place outside the tag or
- (comma-separated) list of tags specified with this option. So
- if outside=b,i is specified, nothing contained within a
- <B>...</B> or <I>...</I> element will be subject to search and
- replace.
-
- inmost=... The same as inside=, except that search and replace only
- occurs _immediately_ within the element specified (i.e.
- inmost=b would mean that only "Text2" would be subject to
- search and replace in "Text1<B>Text2<I>Text3</I></B>").
-
- If more than one of these options is specified, search-and-replace only
- takes place when all the conditions specified in the options are satisfied.
-
- This program uses a rather simple-minded algorithm for determining what
- is contained within an element. There is a small list of known non-pairing
- tags (such as <IMG>, <BR>, etc.). When any opening tag not on this list is
- encountered, it is pushed onto a stack of presently-containing elements.
- When any closing tag is encountered, the most-recently occurring matching
- tagname is removed from the stack, along with everything above it in the
- stack (if no matching opening tag has been encountered, htmlsrpl.pl exits
- with an error -- use the htmlchek program in this package to help find the
- HTML error). This means, for example, that a <P> element unclosed by a </P>
- will often be considered to extend much farther than it should according
- to the HTML DTD; also, in a list such as "<DL><DT>Text1<DD>Text2</DL>",
- "Text2" is actually considered to be contained within a <DT> element.
-
- Note that when the inside=, inmost=, or outside= options are used
- together with the intags=1 option, a tag is never considered to be
- contained within the element which it itself delimits (i.e. the inclusion
- and exclusion relationships established by a tag come into force at the end
- of the tag if it is an opening tag, and at the beginning of the tag if it
- is a closing tag). Also, inclusions and exclusions are always calculated
- from the unprocessed input, before any search and replace has taken place.
-
-
- Regexp command-line options:
-
- regexp=1 If this option is specified, old="..." is used as a Perl
- regular expression, rather than as a simple literal string
- (the default is that both old="..." and new="..." are handled
- as simple literal strings). See the Perl documentation for
- information on regular expressions. Special characters that
- are shell metacharacters will have to be quoted on the
- command line, to protect them from interpretation by the
- shell. The `/' character should be escaped by a preceding
- backslash, or should be written as "\057", since this
- character is used as the delimiter in the Perl s/.../.../
- construct.
-
- regeval=1 If this option is specified, old="..." is used as a
- regular expression, and new="..." is a statement to be
- evaluated, as in the Perl s/.../statement/e construct.
- Special variables such as $`, $&, $', $1 etc. can be used as
- part of such a statement (remember that the "." operator is
- used to concatenate string values). If you use an erroneous
- expression, you will get a Perl errormessage (not a htmlsrpl
- errormessage), which you will have to interpret using the Perl
- manual.
-
- case=1 If this option is specified along with the regexp=1,
- regeval=1, or delete=1 options, then they operate without
- caring about alphabetic case.
-
-
- Command-line options that affect what is matched against:
-
-
- lines=1 If this option is specified, the chunks of the input file
- that will be individually searched and replaced are those
- that result when tag beginnings (`<') and tag endings (`>')
- are boundaries; these chunks can contain embedded newlines.
- (Remember that in Perl the regexp /./ does not match newline
- ("\n"); you can use [^\000] instead.)
- If the lines=1 option is not specified, then the default
- behavior is that linebreaks are also boundaries; the chunks
- then do not contain newlines. The `<' and `>' characters
- themselves are never part of the chunks matched against (they
- can only be altered by use of the delete=1 option), except
- for `>' characters outside of tags, which are treated as
- ordinary text.
-
- slash=1 If this option is specified, then the `/' slash character
- immediately following the `<' character of a closing tag is
- not matched against, and is not affected by any search-and-
- replace operation (except, of course, tag deletion with
- delete=1). Implies intags=1.
-
- delete=1 If this option is specified, old="..." is treated as a
- regexp and is matched against tagnames (not against the entire
- contents of tags); where tagnames match, the entire tag,
- including the surrounding `<' and `>' characters, is deleted.
- This option implies intags=1 and slash=1, and is incompatible
- with regexp=1, regeval=1, or a non-null value of new=.
-
-
- Uppercasing option:
-
- upcase=1 If this option is present, then tag names (the sequence of
- non-whitespace immediately following a `<' character) are
- upper-cased. Does not upper-case tag options (attributes).
- If old= is null or absent, then this is the only thing that
- htmlsrpl.pl does, and any other command-line options are
- ignored. Otherwise, uppercasing is done first, before any
- specified search-and-replace operation (and the intags=1
- option is assumed). Note that qualifiers like `inmost=' will
- govern the scope of any search-and-replace operation that
- accompanies uppercasing, but uppercasing itself always
- affects all tags.
-
-
- Final status message:
-
- At the end of processing, if no errors occurred, htmlsrpl.pl outputs a
- message to STDERR (either "Changed!" or "Unchanged"), informing whether
- or not any substitutions were actually performed on the output.
-
-
- Summary:
-
- You can do some cute things by playing around with these options. For
- example, ``perl htmlsrpl.pl regexp=1 old=".*"'' deletes all text (except
- newlines) outside tags, while adding ``intags=1'' to this command line means
- that all text inside tags is deleted instead (leaving ghostly ``<>'' markers
- behind). The command line ``perl htmlsrpl.pl delete=1 case=1 old="blink"''
- nukes any <BLINK> tags (yay!), while ``perl htmlsrpl.pl slash=1 case=1
- lines=1 regexp=1 old="^blink[^\000]*" new="I"'' will change all BLINK tags,
- with accompanying attributes (possibly on multiple lines), and replace them
- with the appropriate opening <I> and closing </I> tags. A command like ``perl
- htmlsrpl.pl outside=cite,h1,h2,h3,h4,h5,h6,title old="Pride and Prejudice"
- new="<cite>Pride and Prejudice</cite>"'' can be used to add mark-up in the
- appropriate places.
-
-
- Limitations:
-
- A limitation of this program is that it always treats `<' and `>' in the
- input file as tag-beginning and tag-ending characters (even in comments),
- and terminates prematurely if `<' and `>' are found in inappropriate places
- (except that loose `>' characters outside tags are harmless). In this case
- a "die" message will be output to STDERR, and the last line of the output
- will be "ERROR!".
-
- If you misspell an option name, then you'll either get an error when Perl
- tries to open a file with that name, or you'll get an indiscriminate
- "No `old=' string was specified" errormessage.
-
- The program processes all files on the command line to STDOUT; to process a
- number of files individually, use the iteration mechanism of your shell; for
- example:
-
- for a in *.html ; do perl htmlsrpl.pl old=ABC new=XYZ $a > otherdir/$a ; done
-
- in Unix sh, or:
-
- for %a in (*.htm) do call htmlsrpl %a otherdir\%a
-
- in MS-DOS, where htmlsrpl.bat is the following one-line batch file:
-
- perl htmlsrpl.pl old=ABC new=XYZ %1 > %2
-
-
- Author:
-
- Copyright H. Churchyard 1994, 1995 -- freely redistributable. This code is
- functional but not very well commented or aesthetic -- sorry! If you find
- an error in this program, e-mail me at churchh@uts.cc.utexas.edu.
-
- htmlsrpl version 1.11, January 22 1995
-