home *** CD-ROM | disk | FTP | other *** search
-
- <HTML>
- <HEAD>
- <TITLE>WWW::Search - Virtual base class for WWW searches</TITLE>
- <LINK REL="stylesheet" HREF="../../../Active.css" TYPE="text/css">
- <LINK REV="made" HREF="mailto:">
- </HEAD>
-
- <BODY>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> WWW::Search - Virtual base class for WWW searches</P></STRONG>
- </TD></TR>
- </TABLE>
-
- <A NAME="__index__"></A>
- <!-- INDEX BEGIN -->
-
- <UL>
-
- <LI><A HREF="#name">NAME</A></LI><LI><A HREF="#supportedplatforms">SUPPORTED PLATFORMS</A></LI>
-
- <LI><A HREF="#synopsis">SYNOPSIS</A></LI>
- <LI><A HREF="#description">DESCRIPTION</A></LI>
- <UL>
-
- <LI><A HREF="#sample program">Sample program</A></LI>
- </UL>
-
- <LI><A HREF="#see also">SEE ALSO</A></LI>
- <LI><A HREF="#methods and functions">METHODS AND FUNCTIONS</A></LI>
- <UL>
-
- <LI><A HREF="#new">new</A></LI>
- <LI><A HREF="#reset_search (private)">reset_search (PRIVATE)</A></LI>
- <LI><A HREF="#version">version</A></LI>
- <LI><A HREF="#maintainer">maintainer</A></LI>
- <LI><A HREF="#gui_query">gui_query</A></LI>
- <LI><A HREF="#native_query">native_query</A></LI>
- <LI><A HREF="#approximate_result_count">approximate_result_count</A></LI>
- <LI><A HREF="#results">results</A></LI>
- <LI><A HREF="#next_result">next_result</A></LI>
- <LI><A HREF="#response">response</A></LI>
- <LI><A HREF="#seek_result($offset)"><CODE>seek_result($offset)</CODE></A></LI>
- <LI><A HREF="#maximum_to_retrieve">maximum_to_retrieve</A></LI>
- <LI><A HREF="#timeout">timeout</A></LI>
- <LI><A HREF="#opaque">opaque</A></LI>
- <LI><A HREF="#escape_query">escape_query</A></LI>
- <LI><A HREF="#unescape_query">unescape_query</A></LI>
- <LI><A HREF="#strip_tags">strip_tags</A></LI>
- <LI><A HREF="#hash_to_cgi_string (private)">hash_to_cgi_string (PRIVATE)</A></LI>
- <LI><A HREF="#http_proxy">http_proxy</A></LI>
- <LI><A HREF="#user_agent($non_robot) (private)"><CODE>user_agent($NON_ROBOT)</CODE> (PRIVATE)</A></LI>
- <LI><A HREF="#http_request($method, $url)">http_request($method, $url)</A></LI>
- <LI><A HREF="#split_lines (private)">split_lines (PRIVATE)</A></LI>
- <LI><A HREF="#generic_option (private)">generic_option (PRIVATE)</A></LI>
- <LI><A HREF="#setup_search (private)">setup_search (PRIVATE)</A></LI>
- <LI><A HREF="#user_agent_delay (private)">user_agent_delay (PRIVATE)</A></LI>
- <LI><A HREF="#absurl (private)">absurl (PRIVATE)</A></LI>
- <LI><A HREF="#retrieve_some (private)">retrieve_some (PRIVATE)</A></LI>
- <LI><A HREF="#test_cases">test_cases</A></LI>
- </UL>
-
- <LI><A HREF="#implementing new backends">IMPLEMENTING NEW BACKENDS</A></LI>
- <LI><A HREF="#bugs and desired features">BUGS AND DESIRED FEATURES</A></LI>
- <LI><A HREF="#author">AUTHOR</A></LI>
- <LI><A HREF="#copyright">COPYRIGHT</A></LI>
- </UL>
- <!-- INDEX END -->
-
- <HR>
- <P>
- <H1><A NAME="name">NAME</A></H1>
- <P>WWW::Search - Virtual base class for WWW searches</P>
- <P>
- <HR>
- <H1><A NAME="supportedplatforms">SUPPORTED PLATFORMS</A></H1>
- <UL>
- <LI>Linux</LI>
- <LI>Solaris</LI>
- <LI>Windows</LI>
- </UL>
- <HR>
- <H1><A NAME="synopsis">SYNOPSIS</A></H1>
- <PRE>
- require WWW::Search;
- $search_engine = "AltaVista";
- $search = new WWW::Search($search_engine);</PRE>
- <P>
- <HR>
- <H1><A NAME="description">DESCRIPTION</A></H1>
- <P>This class is the parent for all access methods supported by the
- <CODE>WWW::Search</CODE> library. This library implements a Perl API
- to web-based search engines.</P>
- <P>See README for a list of search engines currently supported.</P>
- <P>Search results can be limited, and there is a pause between each
- request to avoid overloading either the client or the server.</P>
- <P>
- <H2><A NAME="sample program">Sample program</A></H2>
- <P>Using the library should be straightforward.
- Here is a sample program:</P>
- <PRE>
- my($search) = new WWW::Search('AltaVista');
- $search->native_query(WWW::Search::escape_query($query));
- my($result);
- while ($result = $search->next_result()) {
- print $result->url, "\n";
- };</PRE>
- <P>Results are objects of type <CODE>WWW::SearchResult</CODE>
- (see <A HREF="../../../site/lib/WWW/SearchResult.html">the WWW::SearchResult manpage</A> for details).
- Note that different backends support different result fields.
- All backends are required to support title and url.</P>
- <P>
- <HR>
- <H1><A NAME="see also">SEE ALSO</A></H1>
- <P>For specific search engines, see <A HREF="../../../WWW/Search/TheEngineName.html">the WWW::Search::TheEngineName manpage</A>
- (replacing TheEngineName with a particular search engine).</P>
- <P>For details about the results of a search,
- see <A HREF="../../../site/lib/WWW/SearchResult.html">the WWW::SearchResult manpage</A>.</P>
- <P>
- <HR>
- <H1><A NAME="methods and functions">METHODS AND FUNCTIONS</A></H1>
- <P>
- <H2><A NAME="new">new</A></H2>
- <P>To create a new WWW::Search, call</P>
- <PRE>
- $search = new WWW::Search('SearchEngineName');</PRE>
- <P>where SearchEngineName is replaced with a particular search engine.
- For example:</P>
- <PRE>
- $search = new WWW::Search('Google');</PRE>
- <P>If no search engine is specified a default (currently 'AltaVista')
- will be chosen for you.
- The next step is usually:</P>
- <PRE>
- $search->native_query('search-engine-specific+query+string');</PRE>
- <P>
- <H2><A NAME="reset_search (private)">reset_search (PRIVATE)</A></H2>
- <P>Resets internal data structures to start over with a new search.</P>
- <P>
- <H2><A NAME="version">version</A></H2>
- <P>Returns the value of the $VERSION variable of the backend engine, or
- $WWW::Search::VERSION if the backend does not contain $VERSION.</P>
- <P>
- <H2><A NAME="maintainer">maintainer</A></H2>
- <P>Returns the value of the $MAINTAINER variable of the backend engine,
- or $WWW::Search::MAINTAINER if the backend does not contain
- $MAINTAINER.</P>
- <P>
- <H2><A NAME="gui_query">gui_query</A></H2>
- <P>Specify a query to the current search object;
- the query will be performed with the engine's default options,
- as if it were typed by a user in a browser window.</P>
- <P>The query must be escaped; call <A HREF="../../../site/lib/WWW/Search.html#escape_query">escape_query in the WWW::Search manpage</A> to escape
- a plain query. See <CODE>native_query</CODE> below for more information.</P>
- <P>Currently, this feature is supported by only a few backends;
- consult the documentation for each backend to see if it is implemented.</P>
- <P>
- <H2><A NAME="native_query">native_query</A></H2>
- <P>Specify a query (and optional options) to the current search object.
- Previous query (if any) and its cached results (if any) will be thrown away.
- The option values and the query must be escaped; call <A HREF="../../../WWW/Search/escape_query().html">the WWW::Search::escape_query() manpage</A>
- to escape a string.
- The search process is not actually begun until <CODE>results</CODE> or
- <CODE>next_result</CODE> is called (lazy!), so native_query does not return anything.</P>
- <P>Example:</P>
- <PRE>
- $search->native_query('search-engine-specific+escaped+query+string',
- { option1 => 'able', option2 => 'baker' } );</PRE>
- <P>The hash of options following the query string is optional.
- The query string is backend-specific.
- There are two kinds of options:
- options specific to the backend,
- and generic options applicable to multiple backends.</P>
- <P>Generic options all begin with 'search_'.
- Currently a few are supported:</P>
- <DL>
- <DT><STRONG><A NAME="item_search_url">search_url</A></STRONG><BR>
- <DD>
- Specifies the base URL for the search engine.
- <P></P>
- <DT><STRONG><A NAME="item_search_debug">search_debug</A></STRONG><BR>
- <DD>
- Enables backend debugging. The default is 0 (no debugging).
- <P></P>
- <DT><STRONG><A NAME="item_search_parse_debug">search_parse_debug</A></STRONG><BR>
- <DD>
- Enables backend parser debugging. The default is 0 (no debugging).
- <P></P>
- <DT><STRONG><A NAME="item_search_method">search_method</A></STRONG><BR>
- <DD>
- Specifies the HTTP method (<CODE>GET</CODE> or <CODE>POST</CODE>) for HTTP-based queries.
- The default is GET
- <P></P>
- <DT><STRONG><A NAME="item_search_to_file_FILE">search_to_file FILE</A></STRONG><BR>
- <DD>
- Causes the search results to be saved in a set of files
- prefixed by FILE.
- (Used internally by the test-suite, not intended for general use.)
- <P></P>
- <DT><STRONG><A NAME="item_search_from_file_FILE">search_from_file FILE</A></STRONG><BR>
- <DD>
- Reads a search from a set of files prefixed by FILE.
- (Used internally by the test-suite, not intended for general use.)
- <P></P></DL>
- <P>Some backends may not implement these generic options,
- but any which do implement them must provide these semantics.</P>
- <P>Backend-specific options are described
- in the documentation for each backend.
- In most cases the options and their values are packed together to create the query portion of
- the final URL.</P>
- <P>Details about how the search string and option hash are interpreted
- might be found in the search-engine-specific manual pages
- (WWW::Search::SearchEngineName).</P>
- <P>After <CODE>native_query</CODE>, the next step is usually:</P>
- <PRE>
- @results = $search->results();</PRE>
- <P>or</P>
- <PRE>
- while ($result = $search->next_result()) {
- # do_something;
- }</PRE>
- <P>
- <H2><A NAME="approximate_result_count">approximate_result_count</A></H2>
- <P>Some backends indicate how many hits they have found.
- Typically this is an approximate value.</P>
- <P>
- <H2><A NAME="results">results</A></H2>
- <P>Return all the results of a query as a reference to array
- of SearchResult objects.
- Example:</P>
- <PRE>
- @results = $search->results();
- foreach $result (@results) {
- print $result->url(), "\n";
- };</PRE>
- <P>On error, <CODE>results()</CODE> will return undef and set <CODE>response()</CODE>
- to the HTTP response code.</P>
- <P>
- <H2><A NAME="next_result">next_result</A></H2>
- <P>Call this method repeatedly to return each result of a query as a
- SearchResult object. Example:</P>
- <PRE>
- while ($result = $search->next_result()) {
- print $result->url(), "\n";
- };</PRE>
- <P>On error, <CODE>next_result()</CODE> will return undef and set <CODE>response()</CODE>
- to the HTTP response code.</P>
- <P>
- <H2><A NAME="response">response</A></H2>
- <P>Return the HTTP Response code for the last query
- (see <A HREF="../../../site/lib/HTTP/Response.html">the HTTP::Response manpage</A>).
- If the query returns <A HREF="../../../lib/Pod/perlfunc.html#item_undef"><CODE>undef</CODE></A>,
- errors could be reported like this:</P>
- <PRE>
- my($response) = $search->response();
- if ($response->is_success) {
- print "normal end of result list\n";
- } else {
- print "error: " . $response->as_string() . "\n";
- };</PRE>
- <P>Note: even if the backend does not involve the web
- it should return HTTP::Response-style codes.</P>
- <P>
- <H2><A NAME="seek_result($offset)"><CODE>seek_result($offset)</CODE></A></H2>
- <P>Set which result <CODE>next_result</CODE> should return
- (like <CODE>lseek</CODE> in Unix).
- Results are zero-indexed.</P>
- <P>The only guaranteed valid offset is 0,
- which will replay the results from the beginning.
- In particular, seeking past the end of the current cached
- results probably will not do what you might think it should.</P>
- <P>Results are cached, so this does not re-issue the query
- or cause IO (unless you go off the end of the results).
- To re-do the query, create a new search object.</P>
- <P>Example:</P>
- <PRE>
- $search->seek_result(0);</PRE>
- <P>
- <H2><A NAME="maximum_to_retrieve">maximum_to_retrieve</A></H2>
- <P>The maximum number of hits to return.
- Queries resulting in more than this many hits will return
- the first hits, up to this limit.
- Although this specifies a maximum limit,
- search engines might return less than this number.</P>
- <P>Defaults to 500.</P>
- <P>Example:
- $max = $search->maximum_to_retrieve(100);</P>
- <P>
- <H2><A NAME="timeout">timeout</A></H2>
- <P>The maximum length of time any portion of the query should take,
- in seconds.</P>
- <P>Defaults to 60.</P>
- <P>Example:
- $search->timeout(120);</P>
- <P>
- <H2><A NAME="opaque">opaque</A></H2>
- <P>This function provides an application a place to store
- one opaque data element (or many via a Perl reference).
- This facility is useful to (for example),
- maintain client-specific information in each active query
- when you have multiple concurrent queries.</P>
- <P>
- <H2><A NAME="escape_query">escape_query</A></H2>
- <P>Escape a query.
- Before queries are made special characters must be escaped
- so that a proper URL can be formed.
- This is like escaping a URL,
- but all non-alphanumeric characters are escaped and
- and spaces are converted to ``+''s.</P>
- <P>Example:
- $escaped = WWW::Search::escape_query('+hi +mom');</P>
- <PRE>
- (Returns "%2Bhi+%2Bmom").</PRE>
- <P>See also <CODE>unescape_query</CODE>.
- NOTE that this is not a method, it is a plain function.</P>
- <P>
- <H2><A NAME="unescape_query">unescape_query</A></H2>
- <P>Unescape a query.
- See <CODE>escape_query</CODE> for details.</P>
- <P>Example:
- $unescaped = WWW::Search::unescape_query('%22hi+mom%22');</P>
- <PRE>
- (Returns '"hi mom"').</PRE>
- <P>NOTE that this is not a method, it is a plain function.</P>
- <P>
- <H2><A NAME="strip_tags">strip_tags</A></H2>
- <P>Given a string, returns a copy of that string with HTML tags removed.
- This should be used by each backend as they insert the title and
- description values into the SearchResults.</P>
- <P>
- <H2><A NAME="hash_to_cgi_string (private)">hash_to_cgi_string (PRIVATE)</A></H2>
- <P>Given a reference to a hash of string => string, constructs a CGI
- parameter string that looks like 'key1=value1&key2=value2'.</P>
- <P>Backends should use this function rather than piecing the URL together
- by hand, to ensure that URLs are identical across platforms and
- software versions.</P>
- <P>Example:</P>
- <PRE>
- $self->{_options} = {
- 'opt3' => 'val3',
- 'search_url' => '<A HREF="http://www.deja.com/dnquery.xp">http://www.deja.com/dnquery.xp</A>',
- 'opt1' => 'val1',
- 'QRY' => $native_query,
- 'opt2' => 'val2',
- };
- $self->{_next_url} = $self->{_options}{'search_url'} .'?'.
- $self->hash_to_cgi_string($self->{_options});</PRE>
- <P>
- <H2><A NAME="http_proxy">http_proxy</A></H2>
- <P>Set-up an HTTP proxy
- (for connections from behind a firewall).</P>
- <P>This routine should be called before the first retrieval is attempted.</P>
- <P>Example:</P>
- <PRE>
- $search->http_proxy("<A HREF="http://gateway:8080"">http://gateway:8080"</A>;);</PRE>
- <P>
- <H2><A NAME="user_agent($non_robot) (private)"><CODE>user_agent($NON_ROBOT)</CODE> (PRIVATE)</A></H2>
- <P>This internal routine creates a user-agent
- for dervived classes that query the web.
- If <CODE>$NON_ROBOT</CODE>, a normal user-agent (rather than a robot-style user-agent)
- is used.</P>
- <P>backends should use robot-style user-agents whereever possible.
- Also, backends should call <CODE>user_agent_delay</CODE> every page retrieval
- to avoid swamping search-engines.</P>
- <P>
- <H2><A NAME="http_request($method, $url)">http_request($method, $url)</A></H2>
- <P>Return the response from an http request,
- handling debugging. Requires that user_agent already be set up.
- For POST methods, query is split off of the URL and passed
- in the request body.</P>
- <P>
- <H2><A NAME="split_lines (private)">split_lines (PRIVATE)</A></H2>
- <P>This internal routine splits data (typically the result of the web
- page retrieval) into lines in a way that is OS independent.</P>
- <P>
- <H2><A NAME="generic_option (private)">generic_option (PRIVATE)</A></H2>
- <P>This internal routine checks if an option
- is generic or backend specific.
- Currently all generic options begin with 'search_'.
- This routine is not a method.</P>
- <P>
- <H2><A NAME="setup_search (private)">setup_search (PRIVATE)</A></H2>
- <P>This internal routine does generic Search setup.
- It calls <CODE>native_setup_search</CODE> to do backend specific setup.</P>
- <P>
- <H2><A NAME="user_agent_delay (private)">user_agent_delay (PRIVATE)</A></H2>
- <P>Derived classes should call this between requests to remote
- servers to avoid overloading them with many, fast back-to-back requests.</P>
- <P>
- <H2><A NAME="absurl (private)">absurl (PRIVATE)</A></H2>
- <P>An internal routine to convert a relative URL into a absolute URL. It
- takes two arguments, the 'base' url (usually the search engine CGI
- URL) and the URL to be converted. Returns a URI::URL object.</P>
- <P>
- <H2><A NAME="retrieve_some (private)">retrieve_some (PRIVATE)</A></H2>
- <P>An internal routine to interface with <CODE>native_retrieve_some</CODE>.
- Checks for overflow.</P>
- <P>
- <H2><A NAME="test_cases">test_cases</A></H2>
- <P>Returns the value of the $TEST_CASES variable of the backend engine.
- All backends should set $TEST_CASES to a string containing perl code
- which will be eval-ed during 'make test'.
- See Excite.pm for an example.</P>
- <P>
- <HR>
- <H1><A NAME="implementing new backends">IMPLEMENTING NEW BACKENDS</A></H1>
- <P><CODE>WWW::Search</CODE> supports backends to separate search engines.
- Each backend is implemented as a subclass of <CODE>WWW::Search</CODE>.
- <A HREF="../../../site/lib/WWW/Search/AltaVista.html">the WWW::Search::AltaVista manpage</A> provides a good sample backend.</P>
- <P>A backend must have the two routines
- <CODE>native_retrieve_some</CODE> and <CODE>native_setup_search</CODE>.</P>
- <P><CODE>native_retrieve_some</CODE> is the core of a backend.
- It will be called periodically to fetch URLs.
- It should retrieve several hits from the search service
- and add them to the cache. It should return the number
- of hits found, or undef when there are no more hits.</P>
- <P>Internally, <CODE>native_retrieve_some</CODE> typically sends an HTTP request to
- the search service, parse the HTML, extract the links and
- descriptions, then save the URL for the next page of results. See the
- code for the AltaVista implementation for an example.</P>
- <P><CODE>native_setup_search</CODE> is invoked before the search.
- It is passed a single argument: the escaped, native version
- of the query.</P>
- <P>The front- and backends share a single object (a hash).
- The backend can change any hash element beginning with underscore,
- and <CODE>{response}</CODE> (an <CODE>HTTP::Response</CODE> code) and <CODE>{cache}</CODE>
- (the array of <CODE>WWW::SearchResult</CODE> objects caching all results).
- Again, look at one of the existing web search backends as an example.</P>
- <P>If you implement a new backend, please let the authors know.</P>
- <P>
- <HR>
- <H1><A NAME="bugs and desired features">BUGS AND DESIRED FEATURES</A></H1>
- <P>The bugs are there for you to find (some people call them Easter Eggs).</P>
- <P>Desired features:</P>
- <DL>
- <DT><STRONG><A NAME="item_A_portable_query_language%2E">A portable query language.</A></STRONG><BR>
- <DD>
- A portable language would easily allow you to move queries easily
- between different search engines. A query abstraction is non-trivial
- and unfortunately will not be done anytime soon by the current
- maintainers. If you want to take a shot at it, please let me know.
- <P></P></DL>
- <P>
- <HR>
- <H1><A NAME="author">AUTHOR</A></H1>
- <P><CODE>WWW::Search</CODE> was written by John Heidemann, <<A HREF="mailto:johnh@isi.edu">johnh@isi.edu</A>>.
- <CODE>WWW::Search</CODE> is currently maintained by Martin Thurn, <<A HREF="mailto:MartinThurn@iname.com">MartinThurn@iname.com</A>>.</P>
- <P>backends and applications for WWW::Search were originally written by
- John Heidemann,
- Wm. L. Scheding,
- Cesare Feroldi de Rosa,
- and
- GLen Pringle.</P>
- <P>
- <HR>
- <H1><A NAME="copyright">COPYRIGHT</A></H1>
- <P>Copyright (c) 1996 University of Southern California.
- All rights reserved.
- </P>
- <PRE>
-
- Redistribution and use in source and binary forms are permitted
- provided that the above copyright notice and this paragraph are
- duplicated in all such forms and that any documentation, advertising
- materials, and other materials related to such distribution and use
- acknowledge that the software was developed by the University of
- Southern California, Information Sciences Institute. The name of the
- University may not be used to endorse or promote products derived from
- this software without specific prior written permission.</PRE>
- <P>THIS SOFTWARE IS PROVIDED ``AS IS'' AND WITHOUT ANY EXPRESS OR IMPLIED
- WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF
- MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE.</P>
- <TABLE BORDER=0 CELLPADDING=0 CELLSPACING=0 WIDTH=100%>
- <TR><TD CLASS=block VALIGN=MIDDLE WIDTH=100% BGCOLOR="#cccccc">
- <STRONG><P CLASS=block> WWW::Search - Virtual base class for WWW searches</P></STRONG>
- </TD></TR>
- </TABLE>
-
- </BODY>
-
- </HTML>
-