home *** CD-ROM | disk | FTP | other *** search
- <TITLE>sgmllib -- Python library reference</TITLE>
- Next: <A HREF="../h/htmllib" TYPE="Next">htmllib</A>
- Prev: <A HREF="../u/urlparse" TYPE="Prev">urlparse</A>
- Up: <A HREF="../i/internet_and_www" TYPE="Up">Internet and WWW</A>
- Top: <A HREF="../t/top" TYPE="Top">Top</A>
- <H1>10.8. Standard Module <CODE>sgmllib</CODE></H1>
- This module defines a class <CODE>SGMLParser</CODE> which serves as the
- basis for parsing text files formatted in SGML (Standard Generalized
- Mark-up Language). In fact, it does not provide a full SGML parser
- --- it only parses SGML insofar as it is used by HTML, and the module
- only exists as a base for the <CODE>htmllib</CODE> module.
- In particular, the parser is hardcoded to recognize the following
- constructs:
- <P>
- <UL>
- <LI>• Opening and closing tags of the form
- ``<CODE><<VAR>tag</VAR> <VAR>attr</VAR>="<VAR>value</VAR>" ...></CODE>'' and
- ``<CODE></<VAR>tag</VAR>></CODE>'', respectively.
- <P>
- <LI>• Numeric character references of the form ``<CODE>&#<VAR>name</VAR>;</CODE>''.
- <P>
- <LI>• Entity references of the form ``<CODE>&<VAR>name</VAR>;</CODE>''.
- <P>
- <LI>• SGML comments of the form ``<CODE><!--<VAR>text</VAR>--></CODE>''. Note that
- spaces, tabs, and newlines are allowed between the trailing
- ``<CODE>></CODE>'' and the immediately preceeding ``<CODE>--</CODE>''.
- <P>
- </UL>
- The <CODE>SGMLParser</CODE> class must be instantiated without arguments.
- It has the following interface methods:
- <P>
- <DL><DT><B>reset</B> () -- Method on SGMLParser<DD>
- Reset the instance. Loses all unprocessed data. This is called
- implicitly at instantiation time.
- </DL>
- <DL><DT><B>setnomoretags</B> () -- Method on SGMLParser<DD>
- Stop processing tags. Treat all following input as literal input
- (CDATA). (This is only provided so the HTML tag <CODE><PLAINTEXT></CODE>
- can be implemented.)
- </DL>
- <DL><DT><B>setliteral</B> () -- Method on SGMLParser<DD>
- Enter literal mode (CDATA mode).
- </DL>
- <DL><DT><B>feed</B> (<VAR>data</VAR>) -- Method on SGMLParser<DD>
- Feed some text to the parser. It is processed insofar as it consists
- of complete elements; incomplete data is buffered until more data is
- fed or <CODE>close()</CODE> is called.
- </DL>
- <DL><DT><B>close</B> () -- Method on SGMLParser<DD>
- Force processing of all buffered data as if it were followed by an
- end-of-file mark. This method may be redefined by a derived class to
- define additional processing at the end of the input, but the
- redefined version should always call <CODE>SGMLParser.close()</CODE>.
- </DL>
- <DL><DT><B>handle_starttag</B> (<VAR>tag</VAR>, <VAR>method</VAR>, <VAR>attributes</VAR>) -- Method on SGMLParser<DD>
- This method is called to handle start tags for which either a
- <CODE>start_<VAR>tag</VAR>()</CODE> or <CODE>do_<VAR>tag</VAR>()</CODE> method has been
- defined. The <CODE>tag</CODE> argument is the name of the tag converted to
- lower case, and the <CODE>method</CODE> argument is the bound method which
- should be used to support semantic interpretation of the start tag.
- The <VAR>attributes</VAR> argument is a list of (<VAR>name</VAR>, <VAR>value</VAR>)
- pairs containing the attributes found inside the tag's <CODE><></CODE>
- brackets. The <VAR>name</VAR> has been translated to lower case and double
- quotes and backslashes in the <VAR>value</VAR> have been interpreted. For
- instance, for the tag <CODE><A HREF="http://www.cwi.nl/"></CODE>, this
- method would be called as <CODE>unknown_starttag('a', [('href',
- 'http://www.cwi.nl/')])</CODE>. The base implementation simply calls
- <CODE>method</CODE> with <CODE>attributes</CODE> as the only argument.
- </DL>
- <DL><DT><B>handle_endtag</B> (<VAR>tag</VAR>, <VAR>method</VAR>) -- Method on SGMLParser<DD>
- This method is called to handle endtags for which an
- <CODE>end_<VAR>tag</VAR>()</CODE> method has been defined. The <CODE>tag</CODE>
- argument is the name of the tag converted to lower case, and the
- <CODE>method</CODE> argument is the bound method which should be used to
- support semantic interpretation of the end tag. If no
- <CODE>end_<VAR>tag</VAR>()</CODE> method is defined for the closing element, this
- handler is not called. The base implementation simply calls
- <CODE>method</CODE>.
- </DL>
- <DL><DT><B>handle_data</B> (<VAR>data</VAR>) -- Method on SGMLParser<DD>
- This method is called to process arbitrary data. It is intended to be
- overridden by a derived class; the base class implementation does
- nothing.
- </DL>
- <DL><DT><B>handle_charref</B> (<VAR>ref</VAR>) -- Method on SGMLParser<DD>
- This method is called to process a character reference of the form
- ``<CODE>&#<VAR>ref</VAR>;</CODE>''. In the base implementation, <VAR>ref</VAR> must
- be a decimal number in the
- range 0-255. It translates the character to ASCII and calls the
- method <CODE>handle_data()</CODE> with the character as argument. If
- <VAR>ref</VAR> is invalid or out of range, the method
- <CODE>unknown_charref(<VAR>ref</VAR>)</CODE> is called to handle the error. A
- subclass must override this method to provide support for named
- character entities.
- </DL>
- <DL><DT><B>handle_entityref</B> (<VAR>ref</VAR>) -- Method on SGMLParser<DD>
- This method is called to process a general entity reference of the form
- ``<CODE>&<VAR>ref</VAR>;</CODE>'' where <VAR>ref</VAR> is an general entity
- reference. It looks for <VAR>ref</VAR> in the instance (or class)
- variable <CODE>entitydefs</CODE> which should be a mapping from entity names
- to corresponding translations.
- If a translation is found, it calls the method <CODE>handle_data()</CODE>
- with the translation; otherwise, it calls the method
- <CODE>unknown_entityref(<VAR>ref</VAR>)</CODE>. The default <CODE>entitydefs</CODE>
- defines translations for <CODE>&</CODE>, <CODE>&apos</CODE>, <CODE>></CODE>,
- <CODE><</CODE>, and <CODE>"</CODE>.
- </DL>
- <DL><DT><B>handle_comment</B> (<VAR>comment</VAR>) -- Method on SGMLParser<DD>
- This method is called when a comment is encountered. The
- <CODE>comment</CODE> argument is a string containing the text between the
- ``<CODE><!--</CODE>'' and ``<CODE>--></CODE>'' delimiters, but not the delimiters
- themselves. For example, the comment ``<CODE><!--text--></CODE>'' will
- cause this method to be called with the argument <CODE>'text'</CODE>. The
- default method does nothing.
- </DL>
- <DL><DT><B>report_unbalanced</B> (<VAR>tag</VAR>) -- Method on SGMLParser<DD>
- This method is called when an end tag is found which does not
- correspond to any open element.
- </DL>
- <DL><DT><B>unknown_starttag</B> (<VAR>tag</VAR>, <VAR>attributes</VAR>) -- Method on SGMLParser<DD>
- This method is called to process an unknown start tag. It is intended
- to be overridden by a derived class; the base class implementation
- does nothing.
- </DL>
- <DL><DT><B>unknown_endtag</B> (<VAR>tag</VAR>) -- Method on SGMLParser<DD>
- This method is called to process an unknown end tag. It is intended
- to be overridden by a derived class; the base class implementation
- does nothing.
- </DL>
- <DL><DT><B>unknown_charref</B> (<VAR>ref</VAR>) -- Method on SGMLParser<DD>
- This method is called to process unresolvable numeric character
- references. It is intended to be overridden by a derived class; the
- base class implementation does nothing.
- </DL>
- <DL><DT><B>unknown_entityref</B> (<VAR>ref</VAR>) -- Method on SGMLParser<DD>
- This method is called to process an unknown entity reference. It is
- intended to be overridden by a derived class; the base class
- implementation does nothing.
- </DL>
- Apart from overriding or extending the methods listed above, derived
- classes may also define methods of the following form to define
- processing of specific tags. Tag names in the input stream are case
- independent; the <VAR>tag</VAR> occurring in method names must be in lower
- case:
- <P>
- <DL><DT><B>start_<VAR>tag</VAR></B> (<VAR>attributes</VAR>) -- Method on SGMLParser<DD>
- This method is called to process an opening tag <VAR>tag</VAR>. It has
- preference over <CODE>do_<VAR>tag</VAR>()</CODE>. The <VAR>attributes</VAR> argument
- has the same meaning as described for <CODE>handle_starttag()</CODE> above.
- </DL>
- <DL><DT><B>do_<VAR>tag</VAR></B> (<VAR>attributes</VAR>) -- Method on SGMLParser<DD>
- This method is called to process an opening tag <VAR>tag</VAR> that does
- not come with a matching closing tag. The <VAR>attributes</VAR> argument
- has the same meaning as described for <CODE>handle_starttag()</CODE> above.
- </DL>
- <DL><DT><B>end_<VAR>tag</VAR></B> () -- Method on SGMLParser<DD>
- This method is called to process a closing tag <VAR>tag</VAR>.
- </DL>
- Note that the parser maintains a stack of open elements for which no
- end tag has been found yet. Only tags processed by
- <CODE>start_<VAR>tag</VAR>()</CODE> are pushed on this stack. Definition of an
- <CODE>end_<VAR>tag</VAR>()</CODE> method is optional for these tags. For tags
- processed by <CODE>do_<VAR>tag</VAR>()</CODE> or by <CODE>unknown_tag()</CODE>, no
- <CODE>end_<VAR>tag</VAR>()</CODE> method must be defined; if defined, it will not
- be used. If both <CODE>start_<VAR>tag</VAR>()</CODE> and <CODE>do_<VAR>tag</VAR>()</CODE>
- methods exist for a tag, the <CODE>start_<VAR>tag</VAR>()</CODE> method takes
- precedence.
-