home *** CD-ROM | disk | FTP | other *** search
- <HTML>
- <HEAD>
- <TITLE>HTML::LinkExtor - Extract links from an HTML document</TITLE>
- <LINK REL="stylesheet" HREF="../../../Active.css" TYPE="text/css">
- <LINK REV="made" HREF="mailto:">
- </HEAD>
- <BODY>
- <STRONG><P CLASS=block> HTML::LinkExtor - Extract links from an HTML document</P></STRONG>
- </TD></TR>
- </TABLE>
- <A NAME="__index__"></A>
- <!-- INDEX BEGIN -->
- <UL>
- <LI><A HREF="#name">NAME</A></LI><LI><A HREF="#supportedplatforms">SUPPORTED PLATFORMS</A></LI>
- <LI><A HREF="#synopsis">SYNOPSIS</A></LI>
- <LI><A HREF="#description">DESCRIPTION</A></LI>
- <LI><A HREF="#example">EXAMPLE</A></LI>
- <LI><A HREF="#see also">SEE ALSO</A></LI>
- <LI><A HREF="#copyright">COPYRIGHT</A></LI>
- </UL>
- <!-- INDEX END -->
- <HR>
- <P>
- <H1><A NAME="name">NAME</A></H1>
- <P>HTML::LinkExtor - Extract links from an HTML document</P>
- <P>
- <HR>
- <H1><A NAME="supportedplatforms">SUPPORTED PLATFORMS</A></H1>
- <UL>
- <LI>Linux</LI>
- <LI>Solaris</LI>
- <LI>Windows</LI>
- </UL>
- <HR>
- <H1><A NAME="synopsis">SYNOPSIS</A></H1>
- <PRE>
- require HTML::LinkExtor;
- $p = HTML::LinkExtor->new(\&cb, "<A HREF="http://www.sn.no/"">http://www.sn.no/"</A>;);
- sub cb {
- my($tag, %links) = @_;
- print "$tag @{[%links]}\n";
- }
- $p->parse_file("index.html");</PRE>
- <P>
- <HR>
- <H1><A NAME="description">DESCRIPTION</A></H1>
- <P>The <EM>HTML::LinkExtor</EM> is an HTML parser that extract links from an
- HTML document. The <EM>HTML::LinkExtor</EM> is a subclass of
- <EM>HTML::Parser</EM>. This means that the document should be given to the
- parser by calling the $p-><CODE>parse()</CODE> or $p-><CODE>parse_file()</CODE> methods.</P>
- <DL>
- <DT><STRONG><A NAME="item_new">$p = HTML::LinkExtor->new([$callback[, $base]])</A></STRONG><BR>
- <DD>
- The constructor takes two optional argument. The first is a reference
- to a callback routine. It will be called as links are found. If a
- callback is not provided, then links are just accumulated internally
- and can be retrieved by calling the $p-><A HREF="#item_links"><CODE>links()</CODE></A> method.
- <P>The $base is an optional base URL used to absolutize all URLs found.
- You need to have the <EM>URI::URL</EM> module installed if you provide
- $base.</P>
- <P>The callback is called with the lowercase tag name as first argument,
- and then all link attributes as separate key/value pairs. All
- non-link attributes are removed.</P>
- <P></P>
- <DT><STRONG><A NAME="item_links">$p->links</A></STRONG><BR>
- <DD>
- Returns a list of all links found in the document. The returned
- values will be anonymous arrays with the follwing elements:
- <PRE>
- [$tag, $attr => $url1, $attr2 => $url2,...]</PRE>
- <P>The $p->links method will also truncate the internal link list. This
- means that if the method is called twice without any parsing in
- between then the second call will return an empty list.</P>
- <P>Also note that $p->links will always be empty if a callback routine
- was provided when the <EM>HTML::LinkExtor</EM> was created.</P>
- <P></P></DL>
- <P>
- <HR>
- <H1><A NAME="example">EXAMPLE</A></H1>
- <P>This is an example showing how you can extract links from a document
- received using LWP:</P>
- <PRE>
- use LWP::UserAgent;
- use HTML::LinkExtor;
- use URI::URL;</PRE>
- <PRE>
- $url = "<A HREF="http://www.sn.no/"">http://www.sn.no/"</A>;; # for instance
- $ua = new LWP::UserAgent;</PRE>
- <PRE>
- # Set up a callback that collect image links
- my @imgs = ();
- sub callback {
- my($tag, %attr) = @_;
- return if $tag ne 'img'; # we only look closer at <img ...>
- push(@imgs, values %attr);
- }</PRE>
- <PRE>
- # Make the parser. Unfortunately, we don't know the base yet
- # (it might be diffent from $url)
- $p = HTML::LinkExtor->new(\&callback);</PRE>
- <PRE>
- # Request document and parse it as it arrives
- $res = $ua->request(HTTP::Request->new(GET => $url),
- sub {$p->parse($_[0])});</PRE>
- <PRE>
- # Expand all image URLs to absolute ones
- my $base = $res->base;
- @imgs = map { $_ = url($_, $base)->abs; } @imgs;</PRE>
- <PRE>
- # Print them out
- print join("\n", @imgs), "\n";</PRE>
- <P>
- <HR>
- <H1><A NAME="see also">SEE ALSO</A></H1>
- <P><A HREF="../../../site/lib/HTML/Parser.html">the HTML::Parser manpage</A>, <A HREF="../../../site/lib/LWP.html">the LWP manpage</A>, <A HREF="../../../site/lib/URI/URL.html">the URI::URL manpage</A></P>
- <P>
- <HR>
- <H1><A NAME="copyright">COPYRIGHT</A></H1>
- <P>Copyright 1996-1998 Gisle Aas.</P>
- <P>This library is free software; you can redistribute it and/or
- modify it under the same terms as Perl itself.</P>
- <STRONG><P CLASS=block> HTML::LinkExtor - Extract links from an HTML document</P></STRONG>
- </TD></TR>
- </TABLE>
- </BODY>
- </HTML>