home *** CD-ROM | disk | FTP | other *** search
- <TITLE>urllib -- Python library reference</TITLE>
- Next: <A HREF="../h/httplib" TYPE="Next">httplib</A>
- Prev: <A HREF="../c/cgi" TYPE="Prev">cgi</A>
- Up: <A HREF="../i/internet_and_www" TYPE="Up">Internet and WWW</A>
- Top: <A HREF="../t/top" TYPE="Top">Top</A>
- <H1>10.2. Standard Module <CODE>urllib</CODE></H1>
- This module provides a high-level interface for fetching data across
- the World-Wide Web. In particular, the <CODE>urlopen</CODE> function is
- similar to the built-in function <CODE>open</CODE>, but accepts URLs
- (Universal Resource Locators) instead of filenames. Some restrictions
- apply --- it can only open URLs for reading, and no seek operations
- are available.
- <P>
- it defines the following public functions:
- <P>
- <DL><DT><B>urlopen</B> (<VAR>url</VAR>) -- function of module urllib<DD>
- Open a network object denoted by a URL for reading. If the URL does
- not have a scheme identifier, or if it has `<SAMP>file:</SAMP>' as its scheme
- identifier, this opens a local file; otherwise it opens a socket to a
- server somewhere on the network. If the connection cannot be made, or
- if the server returns an error code, the <CODE>IOError</CODE> exception is
- raised. If all went well, a file-like object is returned. This
- supports the following methods: <CODE>read()</CODE>, <CODE>readline()</CODE>,
- <CODE>readlines()</CODE>, <CODE>fileno()</CODE>, <CODE>close()</CODE> and <CODE>info()</CODE>.
- Except for the last one, these methods have the same interface as for
- file objects --- see the section on File Objects earlier in this
- manual. (It's not a built-in file object, however, so it can't be
- used at those few places where a true built-in file object is
- required.)
- <P>
- The <CODE>info()</CODE> method returns an instance of the class
- <CODE>rfc822.Message</CODE> containing the headers received from the server,
- if the protocol uses such headers (currently the only supported
- protocol that uses this is HTTP). See the description of the
- <CODE>rfc822</CODE> module.
- </DL>
- <DL><DT><B>urlretrieve</B> (<VAR>url</VAR>) -- function of module urllib<DD>
- Copy a network object denoted by a URL to a local file, if necessary.
- If the URL points to a local file, or a valid cached copy of the
- object exists, the object is not copied. Return a tuple (<VAR>filename</VAR>,
- <VAR>headers</VAR>) where <VAR>filename</VAR> is the local file name under which
- the object can be found, and <VAR>headers</VAR> is either <CODE>None</CODE> (for
- a local object) or whatever the <CODE>info()</CODE> method of the object
- returned by <CODE>urlopen()</CODE> returned (for a remote object, possibly
- cached). Exceptions are the same as for <CODE>urlopen()</CODE>.
- </DL>
- <DL><DT><B>urlcleanup</B> () -- function of module urllib<DD>
- Clear the cache that may have been built up by previous calls to
- <CODE>urlretrieve()</CODE>.
- </DL>
- <DL><DT><B>quote</B> (<VAR>string</VAR>[, <VAR>addsafe</VAR>]) -- function of module urllib<DD>
- Replace special characters in <VAR>string</VAR> using the <CODE>%xx</CODE> escape.
- Letters, digits, and the characters ``<CODE>_,.-</CODE>'' are never quoted.
- The optional <VAR>addsafe</VAR> parameter specifies additional characters
- that should not be quoted --- its default value is <CODE>'/'</CODE>.
- <P>
- Example: <CODE>quote('/~conolly/')</CODE> yields <CODE>'/%7econnolly/'</CODE>.
- </DL>
- <DL><DT><B>unquote</B> (<VAR>string</VAR>) -- function of module urllib<DD>
- Replace `<SAMP>%xx</SAMP>' escapes by their single-character equivalent.
- <P>
- Example: <CODE>unquote('/%7Econnolly/')</CODE> yields <CODE>'/~connolly/'</CODE>.
- </DL>
- Restrictions:
- <P>
- <UL>
- <LI>• Currently, only the following protocols are supported: HTTP, (versions
- 0.9 and 1.0), Gopher (but not Gopher-+), FTP, and local files.
- <LI>• The caching feature of <CODE>urlretrieve()</CODE> has been disabled until I
- find the time to hack proper processing of Expiration time headers.
- <P>
- <LI>• There should be a function to query whether a particular URL is in
- the cache.
- <P>
- <LI>• For backward compatibility, if a URL appears to point to a local file
- but the file can't be opened, the URL is re-interpreted using the FTP
- protocol. This can sometimes cause confusing error messages.
- <P>
- <LI>• The <CODE>urlopen()</CODE> and <CODE>urlretrieve()</CODE> functions can cause
- arbitrarily long delays while waiting for a network connection to be
- set up. This means that it is difficult to build an interactive
- web client using these functions without using threads.
- <P>
- <LI>• The data returned by <CODE>urlopen()</CODE> or <CODE>urlretrieve()</CODE> is the
- raw data returned by the server. This may be binary data (e.g. an
- image), plain text or (for example) HTML. The HTTP protocol provides
- type information in the reply header, which can be inspected by
- looking at the <CODE>Content-type</CODE> header. For the Gopher protocol,
- type information is encoded in the URL; there is currently no easy way
- to extract it. If the returned data is HTML, you can use the module
- <CODE>htmllib</CODE> to parse it.
- <LI>• Although the <CODE>urllib</CODE> module contains (undocumented) routines to
- parse and unparse URL strings, the recommended interface for URL
- manipulation is in module <CODE>urlparse</CODE>.
- </UL>
-