htmllib
HTMLParser
class is designed to be used as a base class for
other classes in order to add functionality, and allows most of its
methods to be extended or overridden. In turn, this class is derived
from and extends the SGMLParser
class defined in module
sgmllib
. Two implementations of formatter objects are
provided in the formatter
module; refer to the documentation
for that module for information on the formatter interface.
The following is a summary of the interface defined by
sgmllib.SGMLParser
:
feed()
method, which takes a string argument. This can be called with as
little or as much text at a time as desired; p.feed(a);
p.feed(b)
has the same effect as p.feed(a+b)
. When the data
contains complete HTML tags, these are processed immediately;
incomplete elements are saved in a buffer. To force processing of all
unprocessed data, call the close()
method.
For example, to parse the entire contents of a file, use:
parser.feed(open('myfile.html').read())
parser.close()
start_tag()
,
end_tag()
, or do_tag()
. The parser will
call these at appropriate moments: start_tag
or
do_tag
is called when an opening tag of the form
<tag ...>
is encountered; end_tag
is called
when a closing tag of the form <tag>
is encountered. If
an opening tag requires a corresponding closing tag, like <H1>
... </H1>
, the class should define the start_tag
method; if a tag requires no closing tag, like <P>
, the class
should define the do_tag
method.
HTMLParser
class provides some
additional methods and instance variables for use within tag methods.
<PRE>
element. The default value is false. This
affects the operation of handle_data()
and save_end()
.
<A>
tag with the same
names. The default implementation maintains a list of hyperlinks
(defined by the href
argument) within the document. The list
of hyperlinks is available as the data attribute anchorlist
.
anchor_bgn()
.
alt
value to the handle_data()
method.
save_end()
Use of the save_bgn()
/ save_end()
pair may not be
nested.
save_bgn()
. If nofill
flag is false,
whitespace is collapsed to single spaces. A call to this method
without a preceeding call to save_bgn()
will raise a
TypeError
exception.