home *** CD-ROM | disk | FTP | other *** search
-
- <HTML>
- <HEAD>
- <TITLE>HTML::TokeParser - Alternative HTML::Parser interface</TITLE>
- <LINK REL="stylesheet" HREF="../../../Active.css" TYPE="text/css">
- <LINK REV="made" HREF="mailto:">
- </HEAD>
-
- <BODY>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> HTML::TokeParser - Alternative HTML::Parser interface</P></STRONG>
- </TD></TR>
- </TABLE>
-
- <A NAME="__index__"></A>
- <!-- INDEX BEGIN -->
-
- <UL>
-
- <LI><A HREF="#name">NAME</A></LI><LI><A HREF="#supportedplatforms">SUPPORTED PLATFORMS</A></LI>
-
- <LI><A HREF="#synopsis">SYNOPSIS</A></LI>
- <LI><A HREF="#description">DESCRIPTION</A></LI>
- <LI><A HREF="#examples">EXAMPLES</A></LI>
- <LI><A HREF="#see also">SEE ALSO</A></LI>
- <LI><A HREF="#copyright">COPYRIGHT</A></LI>
- </UL>
- <!-- INDEX END -->
-
- <HR>
- <P>
- <H1><A NAME="name">NAME</A></H1>
- <P>HTML::TokeParser - Alternative HTML::Parser interface</P>
- <P>
- <HR>
- <H1><A NAME="supportedplatforms">SUPPORTED PLATFORMS</A></H1>
- <UL>
- <LI>Linux</LI>
- <LI>Solaris</LI>
- <LI>Windows</LI>
- </UL>
- <HR>
- <H1><A NAME="synopsis">SYNOPSIS</A></H1>
- <PRE>
- require HTML::TokeParser;
- $p = HTML::TokeParser->new("index.html") || die "Can't open: $!";
- while (my $token = $p->get_token) {
- #...
- }</PRE>
- <P>
- <HR>
- <H1><A NAME="description">DESCRIPTION</A></H1>
- <P>The HTML::TokeParser is an alternative interface to the HTML::Parser class.
- It basically turns the HTML::Parser inside out. You associate a file
- (or any IO::Handle object or string) with the parser at construction time and
- then repeatedly call $parser->get_token to obtain the tags and text
- found in the parsed document. No need to make a subclass to make the
- parser do anything.</P>
- <P>Calling the methods defined by the HTML::Parser base class will be
- confusing, so don't do that. Use the following methods instead:</P>
- <DL>
- <DT><STRONG><A NAME="item_new">$p = HTML::TokeParser->new( $file_or_doc );</A></STRONG><BR>
- <DD>
- The object constructor argument is either a file name, a file handle
- object, or the complete document to be parsed.
- <P>If the argument is a plain scalar, then it is taken as the name of a
- file to be opened and parsed. If the file can't be opened for
- reading, then the constructor will return an undefined value and $!
- will tell you why it failed.</P>
- <P>If the argument is a reference to a plain scalar, then this scalar is
- taken to be the document to parse.</P>
- <P>Otherwise the argument is taken to be some object that the
- <CODE>HTML::TokeParser</CODE> can <A HREF="../../../lib/Pod/perlfunc.html#item_read"><CODE>read()</CODE></A> from when it need more data. Typically
- it will be a filehandle of some kind.The stream will be <A HREF="../../../lib/Pod/perlfunc.html#item_read"><CODE>read()</CODE></A> until
- EOF, but not closed.</P>
- <P></P>
- <DT><STRONG><A NAME="item_get_token">$p->get_token</A></STRONG><BR>
- <DD>
- This method will return the next <EM>token</EM> found in the HTML document,
- or <A HREF="../../../lib/Pod/perlfunc.html#item_undef"><CODE>undef</CODE></A> at the end of the document. The token is returned as an
- array reference. The first element of the array will be a single
- character string denoting the type of this token; ``S'' for start tag,
- ``E'' for end tag, ``T'' for text, ``C'' for comment, and ``D'' for
- declaration. The rest of the array is the same as the arguments
- passed to the corresponding HTML::Parser callbacks (see
- <A HREF="../../../site/lib/HTML/Parser.html">the HTML::Parser manpage</A>). This summarize the tokens that can occur:
- <PRE>
- ["S", $tag, %$attr, @$attrseq, $origtext]
- ["E", $tag, $origtext]
- ["T", $text]
- ["C", $text]
- ["D", $text]</PRE>
- <P></P>
- <DT><STRONG><A NAME="item_unget_token">$p-><CODE>unget_token($token,...)</CODE></A></STRONG><BR>
- <DD>
- If you find out you have read too many tokens you can push them back,
- so that they are returned the next time $p->get_token is called.
- <P></P>
- <DT><STRONG><A NAME="item_get_tag">$p->get_tag( [$tag] )</A></STRONG><BR>
- <DD>
- This method return the next start or end tag (skipping any other
- tokens), or <A HREF="../../../lib/Pod/perlfunc.html#item_undef"><CODE>undef</CODE></A> if there is no more tags in the document. If an
- argument is given, then we skip tokens until the specified tag is
- found. A tag is returned as an array reference of the same form as
- for $p->get_token above, but the type code (first element) is missing
- and the name of end tags are prefixed with ``/''. This means that the
- tags returned look like this:
- <PRE>
- [$tag, %$attr, @$attrseq, $origtext]
- ["/$tag", $origtext]</PRE>
- <P></P>
- <DT><STRONG><A NAME="item_get_text">$p->get_text( [$endtag] )</A></STRONG><BR>
- <DD>
- This method returns all text found at the current position. It will
- return a zero length string if the next token is not text. The
- optional $endtag argument specify that any text occurring before the
- given tag is to be returned. Any entities will be expanded to their
- corresponding character.
- <P>The $p->{textify} attribute is a hash that define how certain tags can
- be treated as text. If the name of a start tag match a key in this
- hash then this tag is converted to text. The hash value is used to
- specify which tag attribute to obtain the text from. If this tag
- attribute is missing, then the upper case name of the tag enclosed in
- brackets is returned, e.g. ``[IMG]''. The hash value can also be a
- subroutine reference. In this case the routine is called with the
- start tag token content as arguments and the return values is treated
- as the text.</P>
- <P>The default $p->{textify} value is:</P>
- <PRE>
- {img => "alt", applet => "alt"}</PRE>
- <P>This means that <IMG> and <APPLET> tags are treated as text, and that
- the text to substitute can be found as ALT attribute.</P>
- <P></P>
- <DT><STRONG><A NAME="item_get_trimmed_text">$p->get_trimmed_text( [$endtag] )</A></STRONG><BR>
- <DD>
- Same as $p->get_text above, but will collapse any sequence of white
- space to a single space character. Leading and trailing space is
- removed.
- <P></P></DL>
- <P>
- <HR>
- <H1><A NAME="examples">EXAMPLES</A></H1>
- <P>This example extract all links from a document. It will print one
- line for each link, containing the URL and the textual description
- between the <A>...</A> tags:</P>
- <PRE>
- use HTML::TokeParser;
- $p = HTML::TokeParser->new(shift||"index.html");</PRE>
- <PRE>
- while (my $token = $p->get_tag("a")) {
- my $url = $token->[1]{href} || "-";
- my $text = $p->get_trimmed_text("/a");
- print "$url\t$text\n";
- }</PRE>
- <P>This example extract the <TITLE> from the document:</P>
- <PRE>
- use HTML::TokeParser;
- $p = HTML::TokeParser->new(shift||"index.html");
- if ($p->get_tag("title")) {
- my $title = $p->get_trimmed_text;
- print "Title: $title\n";
- }</PRE>
- <P>
- <HR>
- <H1><A NAME="see also">SEE ALSO</A></H1>
- <P><A HREF="../../../site/lib/HTML/Parser.html">the HTML::Parser manpage</A></P>
- <P>
- <HR>
- <H1><A NAME="copyright">COPYRIGHT</A></H1>
- <P>Copyright 1998-1999 Gisle Aas.</P>
- <P>This library is free software; you can redistribute it and/or
- modify it under the same terms as Perl itself.</P>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> HTML::TokeParser - Alternative HTML::Parser interface</P></STRONG>
- </TD></TR>
- </TABLE>
-
- </BODY>
-
- </HTML>
-