home *** CD-ROM | disk | FTP | other *** search
-
- <HTML>
- <HEAD>
- <TITLE>WWW::RobotsRules - Parse robots.txt files</TITLE>
- <LINK REL="stylesheet" HREF="../../../Active.css" TYPE="text/css">
- <LINK REV="made" HREF="mailto:">
- </HEAD>
-
- <BODY>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> WWW::RobotsRules - Parse robots.txt files</P></STRONG>
- </TD></TR>
- </TABLE>
-
- <A NAME="__index__"></A>
- <!-- INDEX BEGIN -->
-
- <UL>
-
- <LI><A HREF="#name">NAME</A></LI><LI><A HREF="#supportedplatforms">SUPPORTED PLATFORMS</A></LI>
-
- <LI><A HREF="#synopsis">SYNOPSIS</A></LI>
- <LI><A HREF="#description">DESCRIPTION</A></LI>
- <LI><A HREF="#robots.txt">ROBOTS.TXT</A></LI>
- <LI><A HREF="#robots.txt examples">ROBOTS.TXT EXAMPLES</A></LI>
- <LI><A HREF="#see also">SEE ALSO</A></LI>
- </UL>
- <!-- INDEX END -->
-
- <HR>
- <P>
- <H1><A NAME="name">NAME</A></H1>
- <P>WWW::RobotsRules - Parse robots.txt files</P>
- <P>
- <HR>
- <H1><A NAME="supportedplatforms">SUPPORTED PLATFORMS</A></H1>
- <UL>
- <LI>Linux</LI>
- <LI>Solaris</LI>
- <LI>Windows</LI>
- </UL>
- <HR>
- <H1><A NAME="synopsis">SYNOPSIS</A></H1>
- <PRE>
- require WWW::RobotRules;
- my $robotsrules = new WWW::RobotRules 'MOMspider/1.0';</PRE>
- <PRE>
- use LWP::Simple qw(get);</PRE>
- <PRE>
- $url = "<A HREF="http://some.place/robots.txt"">http://some.place/robots.txt"</A>;;
- my $robots_txt = get $url;
- $robotsrules->parse($url, $robots_txt);</PRE>
- <PRE>
- $url = "<A HREF="http://some.other.place/robots.txt"">http://some.other.place/robots.txt"</A>;;
- my $robots_txt = get $url;
- $robotsrules->parse($url, $robots_txt);</PRE>
- <PRE>
- # Now we are able to check if a URL is valid for those servers that
- # we have obtained and parsed "robots.txt" files for.
- if($robotsrules->allowed($url)) {
- $c = get $url;
- ...
- }</PRE>
- <P>
- <HR>
- <H1><A NAME="description">DESCRIPTION</A></H1>
- <P>This module parses a <EM>/robots.txt</EM> file as specified in
- ``A Standard for Robot Exclusion'', described in
- <http://info.webcrawler.com/mak/projects/robots/norobots.html>
- Webmasters can use the <EM>/robots.txt</EM> file to disallow conforming
- robots access to parts of their web site.</P>
- <P>The parsed file is kept in the WWW::RobotRules object, and this object
- provides methods to check if access to a given URL is prohibited. The
- same WWW::RobotRules object can parse multiple <EM>/robots.txt</EM> files.</P>
- <P>The following methods are provided:</P>
- <DL>
- <DT><STRONG><A NAME="item_new">$rules = WWW::RobotRules-><CODE>new($robot_name)</CODE></A></STRONG><BR>
- <DD>
- This is the constructor for WWW::RobotRules objects. The first
- argument given to <A HREF="#item_new"><CODE>new()</CODE></A> is the name of the robot.
- <P></P>
- <DT><STRONG><A NAME="item_parse">$rules->parse($url, $content, $fresh_until)</A></STRONG><BR>
- <DD>
- The <A HREF="#item_parse"><CODE>parse()</CODE></A> method takes as arguments the URL that was used to
- retrieve the <EM>/robots.txt</EM> file, and the contents of the file.
- <P></P>
- <DT><STRONG><A NAME="item_allowed">$rules-><CODE>allowed($url)</CODE></A></STRONG><BR>
- <DD>
- Returns TRUE if this robot is allowed to retrieve this URL.
- <P></P>
- <DT><STRONG><A NAME="item_agent">$rules-><CODE>agent([$name])</CODE></A></STRONG><BR>
- <DD>
- Get/set the agent name. NOTE: Changing the agent name will clear the robots.txt
- rules and expire times out of the cache.
- <P></P></DL>
- <P>
- <HR>
- <H1><A NAME="robots.txt">ROBOTS.TXT</A></H1>
- <P>The format and semantics of the ``/robots.txt'' file are as follows
- (this is an edited abstract of
- <http://info.webcrawler.com/mak/projects/robots/norobots.html>):</P>
- <P>The file consists of one or more records separated by one or more
- blank lines. Each record contains lines of the form</P>
- <PRE>
- <field-name>: <value></PRE>
- <P>The field name is case insensitive. Text after the '#' character on a
- line is ignored during parsing. This is used for comments. The
- following <field-names> can be used:</P>
- <DL>
- <DT><STRONG><A NAME="item_User%2DAgent">User-Agent</A></STRONG><BR>
- <DD>
- The value of this field is the name of the robot the record is
- describing access policy for. If more than one <EM>User-Agent</EM> field is
- present the record describes an identical access policy for more than
- one robot. At least one field needs to be present per record. If the
- value is '*', the record describes the default access policy for any
- robot that has not not matched any of the other records.
- <P></P>
- <DT><STRONG><A NAME="item_Disallow">Disallow</A></STRONG><BR>
- <DD>
- The value of this field specifies a partial URL that is not to be
- visited. This can be a full path, or a partial path; any URL that
- starts with this value will not be retrieved
- <P></P></DL>
- <P>
- <HR>
- <H1><A NAME="robots.txt examples">ROBOTS.TXT EXAMPLES</A></H1>
- <P>The following example ``/robots.txt'' file specifies that no robots
- should visit any URL starting with ``/cyberworld/map/'' or ``/tmp/'':</P>
- <PRE>
- User-agent: *
- Disallow: /cyberworld/map/ # This is an infinite virtual URL space
- Disallow: /tmp/ # these will soon disappear</PRE>
- <P>This example ``/robots.txt'' file specifies that no robots should visit
- any URL starting with ``/cyberworld/map/'', except the robot called
- ``cybermapper'':</P>
- <PRE>
- User-agent: *
- Disallow: /cyberworld/map/ # This is an infinite virtual URL space</PRE>
- <PRE>
- # Cybermapper knows where to go.
- User-agent: cybermapper
- Disallow:</PRE>
- <P>This example indicates that no robots should visit this site further:</P>
- <PRE>
- # go away
- User-agent: *
- Disallow: /</PRE>
- <P>
- <HR>
- <H1><A NAME="see also">SEE ALSO</A></H1>
- <P><A HREF="../../../site/lib/LWP/RobotUA.html">the LWP::RobotUA manpage</A>, <A HREF="../../../site/lib/WWW/RobotRules/AnyDBM_File.html">the WWW::RobotRules::AnyDBM_File manpage</A></P>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> WWW::RobotsRules - Parse robots.txt files</P></STRONG>
- </TD></TR>
- </TABLE>
-
- </BODY>
-
- </HTML>
-