home *** CD-ROM | disk | FTP | other *** search
-
- <HTML>
- <HEAD>
- <TITLE>HTML::Parser - SGML parser class</TITLE>
- <LINK REL="stylesheet" HREF="../../../Active.css" TYPE="text/css">
- <LINK REV="made" HREF="mailto:">
- </HEAD>
-
- <BODY>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> HTML::Parser - SGML parser class</P></STRONG>
- </TD></TR>
- </TABLE>
-
- <A NAME="__index__"></A>
- <!-- INDEX BEGIN -->
-
- <UL>
-
- <LI><A HREF="#name">NAME</A></LI><LI><A HREF="#supportedplatforms">SUPPORTED PLATFORMS</A></LI>
-
- <LI><A HREF="#synopsis">SYNOPSIS</A></LI>
- <LI><A HREF="#description">DESCRIPTION</A></LI>
- <LI><A HREF="#efficiency">EFFICIENCY</A></LI>
- <LI><A HREF="#see also">SEE ALSO</A></LI>
- <LI><A HREF="#copyright">COPYRIGHT</A></LI>
- </UL>
- <!-- INDEX END -->
-
- <HR>
- <P>
- <H1><A NAME="name">NAME</A></H1>
- <P>HTML::Parser - SGML parser class</P>
- <P>
- <HR>
- <H1><A NAME="supportedplatforms">SUPPORTED PLATFORMS</A></H1>
- <UL>
- <LI>Linux</LI>
- <LI>Solaris</LI>
- <LI>Windows</LI>
- </UL>
- <HR>
- <H1><A NAME="synopsis">SYNOPSIS</A></H1>
- <PRE>
- require HTML::Parser;
- $p = HTML::Parser->new; # should really a be subclass
- $p->parse($chunk1);
- $p->parse($chunk2);
- #...
- $p->eof; # signal end of document</PRE>
- <PRE>
- # Parse directly from file
- $p->parse_file("foo.html");
- # or
- open(F, "foo.html") || die;
- $p->parse_file(\*F);</PRE>
- <P>
- <HR>
- <H1><A NAME="description">DESCRIPTION</A></H1>
- <P>The <CODE>HTML::Parser</CODE> will tokenize an HTML document when the <A HREF="#item_parse"><CODE>parse()</CODE></A>
- method is called by invoking various callback methods. The document to
- be parsed can be supplied in arbitrary chunks.</P>
- <P>The external interface the an <EM>HTML::Parser</EM> is:</P>
- <DL>
- <DT><STRONG><A NAME="item_new">$p = HTML::Parser->new</A></STRONG><BR>
- <DD>
- The object constructor takes no arguments.
- <P></P>
- <DT><STRONG><A NAME="item_parse">$p->parse( $string );</A></STRONG><BR>
- <DD>
- Parse the $string as an HTML document. Can be called multiple times.
- The return value is a reference to the parser object.
- <P></P>
- <DT><STRONG><A NAME="item_eof">$p->eof</A></STRONG><BR>
- <DD>
- Signals end of document. Call <A HREF="#item_eof"><CODE>eof()</CODE></A> to flush any remaining buffered
- text. The return value is a reference to the parser object.
- <P></P>
- <DT><STRONG><A NAME="item_parse_file">$p->parse_file( $file );</A></STRONG><BR>
- <DD>
- This method can be called to parse text from a file. The argument can
- be a filename or an already opened file handle. The return value from
- <A HREF="#item_parse_file"><CODE>parse_file()</CODE></A> is a reference to the parser object.
- <P></P>
- <DT><STRONG><A NAME="item_strict_comment">$p->strict_comment( [$bool] )</A></STRONG><BR>
- <DD>
- By default we parse comments similar to how the popular browsers (like
- Netscape and MSIE) do it. This means that comments will always be
- terminated by the first occurrence of ``-->''. This is not correct
- according to the ``official'' HTML standards. The official behaviour
- can be enabled by calling the <A HREF="#item_strict_comment"><CODE>strict_comment()</CODE></A> method with a TRUE
- argument.
- <P>The return value from <A HREF="#item_strict_comment"><CODE>strict_comment()</CODE></A> is the old attribute value.</P>
- <P></P></DL>
- <P>In order to make the parser do anything interesting, you must make a
- subclass where you override one or more of the following methods as
- appropriate:</P>
- <DL>
- <DT><STRONG><A NAME="item_declaration">$self-><CODE>declaration($decl)</CODE></A></STRONG><BR>
- <DD>
- This method is called when a <EM>markup declaration</EM> has been
- recognized. For typical HTML documents, the only declaration you are
- likely to find is <!DOCTYPE ...>. The initial ``<!'' and ending ``>'' is
- not part of the string passed as argument. Comments are removed and
- entities will <STRONG>not</STRONG> be expanded.
- <P></P>
- <DT><STRONG><A NAME="item_start">$self->start($tag, $attr, $attrseq, $origtext)</A></STRONG><BR>
- <DD>
- This method is called when a complete start tag has been recognized.
- The first argument is the tag name (in lower case) and the second
- argument is a reference to a hash that contain all attributes found
- within the start tag. The attribute keys are converted to lower case.
- Entities found in the attribute values are already expanded. The
- third argument is a reference to an array with the lower case
- attribute keys in the original order. The fourth argument is the
- original HTML text.
- <P></P>
- <DT><STRONG><A NAME="item_end">$self->end($tag, $origtext)</A></STRONG><BR>
- <DD>
- This method is called when an end tag has been recognized. The
- first argument is the lower case tag name, the second the original
- HTML text of the tag.
- <P></P>
- <DT><STRONG><A NAME="item_text">$self-><CODE>text($text)</CODE></A></STRONG><BR>
- <DD>
- This method is called when plain text in the document is recognized.
- The text is passed on unmodified and might contain multiple lines.
- Note that for efficiency reasons entities in the text are <STRONG>not</STRONG>
- expanded. You should call HTML::Entities::decode($text) before you
- process the text any further.
- <P>A sequence of text in the HTML document can be broken between several
- invocations of $self->text. The parser will make sure that it does
- not break a word or a sequence of spaces between two invocations of
- $self->text().</P>
- <P></P>
- <DT><STRONG><A NAME="item_comment">$self-><CODE>comment($comment)</CODE></A></STRONG><BR>
- <DD>
- This method is called as comments are recognized. The leading and
- trailing ``--'' sequences have been stripped off the comment text.
- <P></P></DL>
- <P>The default implementation of these methods do nothing, i.e., the
- tokens are just ignored.</P>
- <P>There is really nothing in the basic parser that is HTML specific, so
- it is likely that the parser can parse other kinds of SGML documents.
- SGML has many obscure features (not implemented by this module) that
- prevent us from renaming this module as <CODE>SGML::Parser</CODE>.</P>
- <P>
- <HR>
- <H1><A NAME="efficiency">EFFICIENCY</A></H1>
- <P>The parser is fairly inefficient if the chunks passed to $p-><A HREF="#item_parse"><CODE>parse()</CODE></A>
- are too big. The reason is probably that perl ends up with a lot of
- character copying when tokens are removed from the beginning of the
- strings. A chunk size of about 256-512 bytes was optimal in a test I
- made with some real world HTML documents. (The parser was about 3
- times slower with a chunk size of 20K).</P>
- <P>
- <HR>
- <H1><A NAME="see also">SEE ALSO</A></H1>
- <P><A HREF="../../../site/lib/HTML/Entities.html">the HTML::Entities manpage</A>, <A HREF="../../../site/lib/HTML/TokeParser.html">the HTML::TokeParser manpage</A>, <A HREF="../../../site/lib/HTML/Filter.html">the HTML::Filter manpage</A>,
- <A HREF="../../../site/lib/HTML/HeadParser.html">the HTML::HeadParser manpage</A>, <A HREF="../../../site/lib/HTML/LinkExtor.html">the HTML::LinkExtor manpage</A></P>
- <P><A HREF="../../../site/lib/HTML/TreeBuilder.html">the HTML::TreeBuilder manpage</A> (part of the <EM>HTML-Tree</EM> distribution)</P>
- <P>
- <HR>
- <H1><A NAME="copyright">COPYRIGHT</A></H1>
- <P>Copyright 1996-1999 Gisle Aas. All rights reserved.</P>
- <P>This library is free software; you can redistribute it and/or
- modify it under the same terms as Perl itself.</P>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> HTML::Parser - SGML parser class</P></STRONG>
- </TD></TR>
- </TABLE>
-
- </BODY>
-
- </HTML>
-