Tcl XML Parsers
This document describes the Tcl interface for two XML document parsers,
TclExpat and TclXML. Both of these parsers implement the same Tcl
script level API.
TclExpat is a Tcl interface to
James Clark's expat XML parser.
expat itself is written in C, and TclExpat builds as a loadable extension.
TclXML is a pure-Tcl implementation of a XML parser. No extensions
to Tcl are necessary to use this parser.
The major difference between the two parsers is performance.
TclXML has some additional functionality, as described below.
Table Of Contents
- Current Versions
- Packages and Namespaces
- Parser Creation
- Summary
- Parser Command Methods
- Callback Command Return Codes
- Entity Expansion
- References
Current Versions
This document describes TclExpat 1.1 and TclXML 1.1.
Packages and Namespaces
The TclExpat extension defines the package expat
.
The command:
package require expat
is used to make TclExpat available to a Tcl script.
TclXML defines the packages xml
and
sgml
. An application normally only uses
the xml
package.
The command:
package require xml
is used to make the TclXML parser available to a Tcl script. This automatically
includes the sgml
package.
TclXML defines the namespaces xml
and
sgml
. TclExpat defines no namespaces of
its own.
Both parsers use a similar method to create and use parser objects.
Each parser package defines a parser creation command. The application
uses this command to create an instance of a parser. Both packages
allow multiple parsers to be created and used simultaneously.
Both packages can be used within the one Tcl interpreter simultaneously.
TclExpat's parser creation command is expat
.
TclXML's parser creation command is xml::parser
Both creation commands accept a single argument: the name of the parser
instance to create. If a name is not supplied then a unique name will
be automatically generated. The name of the newly created parser instance
is returned. Configuration options may also be given on the command line.
The command prototypes are:
expat name ?configuration options...?
xml::parser name ?configuration options...?
Summary
| TclExpat | TclXML |
Package(s) | expat | xml
sgml |
Namespace(s) | | xml
sgml |
Command Prototype | expat name ?configuration options...? | xml::parser name ?configuration options...? |
Parser Command Methods
Both parsers accept the same command methods, and their operation is (almost) identical.
Valid parser methods are:
parser cget option
-
Queries a configuration option of the parser.
See below for valid configuration options.
parser configure option value ?option value ...?
-
Sets/queries configuration options for the parser. Valid options include:
-final boolean
-
This option indicates whether the document data next presented to the
parse
method is the final part
of the document. A value of "0" indicates that more data is
expected. A value of "1" indicates that no more is expected.
The default value is "1".
If this option is set to "0" then the parser will not report
certain errors if the XML data is not well-formed upon end of input,
such as unclosed or unbalanced start or end tags. Instead some data
may be saved by the parser until the next call to the
parse
method, thus delaying
the reporting of some of the data.
If this option is set to "1" then documents which are not
well-formed upon end of input will generate an error.
-baseurl URL
-
Used to resolve relative URL references (not currently used).
-reportempty boolean
-
Specifies whether the parser should include extra arguments to the
invocation of the
-elementstartcommand and
-elementendcommand
callback commands to indicate that the element used the empty element
syntax, such as
<Empty/>
.
If this option is set and an element does use the shorthand syntax
then the start and end callback commands have the arguments
"-empty 1
" added.
Without this option it is not possible to distinguish between an
empty element and an element which has no content
(a subtle difference which probably few, if any, applications will care about).
Example:
$parser configure -elementstartcommand HandleStart
$parser configure -elementendcommand HandleEnd
$parser parse {<test/>}
This would result in the following commands being invoked:
HandleStart test {} -empty 1
HandleEnd test -empty 1
-elementstartcommand script
-
Specifies a Tcl command to associate with the start tag of an element. The actual command
consists of this option followed by at least two arguments: the element type name and the
attribute list. In addition, if the -reportempty option
is set then the command may be invoked with the
-empty
configuration option to
indicate whether it is an empty element. See the description of the
-reportempty option for an example.
The attribute list is a Tcl list consisting of name/value pairs, suitable for passing to the
array set
Tcl command.
Example:
$parser configure -elementstartcommand HandleStart
proc HandleStart {name attlist} {
puts stderr "Element start ==> $name has attributes $attlist"
}
$parser parse {<test id="123"></test>}
This would result in the following command being invoked:
HandleStart test {id 123}
-elementendcommand script
-
Specifies a Tcl command to associate with the end tag of an element. The actual command
consists of this option followed by at least one argument: the element type name.
In addition, if the -reportempty option
is set then the command may be invoked with the
-empty
configuration option to
indicate whether it is an empty element. See the description of the
-reportempty option for an example.
Example:
$parser configure -elementendcommand HandleEnd
proc HandleEnd {name} {
puts stderr "Element end ==> $name"
}
$parser parse {<test id="123"></test>}
This would result in the following command being invoked:
HandleEnd test
-characterdatacommand script
-
Specifies a Tcl command to associate with character data in the document, ie. text.
The actual command consists of this option followed by one argument: the text.
It is not guaranteed that character data will be passed to the application in a
single call to this command. That is, the application should be prepared to receive
multiple invocations of this callback with no intervening callbacks from other
features. This is especially true when using TclExpat, since expat itself appears to
break character data on line boundaries. See the section
Entity Expansion for more information.
Example:
$parser configure -characterdatacommand HandleText
proc HandleText {data} {
puts stderr "Character data ==> $data"
}
$parser parse {<test>this is a test document</test>}
This would result in the following command being invoked:
HandleText {this is a test document}
-processinginstructioncommand script
-
Specifies a Tcl command to associate with processing instructions in the document.
The actual command consists of this option followed by two arguments: the PI target
and the PI data.
Example:
$parser configure -processinginstructioncommand HandlePI
proc HandlePI {target data} {
puts stderr "Processing instruction ==> $target $data"
}
$parser parse {<test><?special this is a processing instruction?></test>}
This would result in the following command being invoked:
HandlePI special {this is a processing instruction}
-xmldeclcommand script
-
Specifies a Tcl command to associate with the XML declaration part of the document.
The actual command consists of this option followed by three arguments: the XML version,
the document encoding and the standalone declaration.
TclExpat does not implement this option.
Example:
$parser configure -xmldeclcommand HandleXMLDecl
proc HandleXMLDecl {version encoding standalone} {
puts stderr "XML Declaration ==> version $version encoding $encoding standalone $standalone"
}
$parser parse {<?xml version="1.0"?>
<test></test>}
This would result in the following command being invoked:
HandleXMLDecl 1.0 {} {}
-doctypecommand script
-
Specifies a Tcl command to associate with the document type declaration part of the
document.
The actual command consists of this option followed by four arguments:
the document element type, the public identifier, the system identifier and
the internal DTD subset.
TclExpat does not implement this option.
Example:
$parser configure -doctypecommand HandleDocType
proc HandleDocType {docelement publicID systemID internalDTD} {
puts stderr "Document Type Declaration ==> document element $docelement, internal DTD subset $internalDTD"
}
$parser parse {<?xml version="1.0"?>
<!DOCTYPE test [
<!ELEMENT test ANY>
]>
<test></test>}
This would result in the following command being invoked:
HandleDocType test {} {} {
<!ELEMENT test ANY>
}
-externalentityrefcommand script
-
Specifies a Tcl command to associate with an external entity reference.
The actual command consists of this option followed by one argument:
the URI for the external entity.
TclXML does not implement this option.
-entityreferencecommand script
-
Specifies a Tcl command to associate with a general entity reference.
The actual command consists of this option followed by one argument:
the entity name.
When an entity reference is encountered the parser will automatically
handle parameter entities and character entities. If the general
entity has an entry defined in the
-entityvariable array variable
then its substitution text will automatically be handled. If this
option is set then the parser will perform the callback.
If this option is not set then the entity reference is left untouched
and passed to the application as character data.
If the application wishes to handle all general entity references,
including those to the pre-defined entities, then it should set this
option and supply an empty array variable for the
-entityvariable option.
Note that this option only applies to general entities occurring in
character data sections. It is up to the application to handle
entity references in attribute values. See Section 4.4 of the
XML specification [1].
See the section Entity Expansion for more
information.
TclExpat does not implement this option.
-entityvariable varName
-
Specifies a Tcl array variable which contains the substitution text for
entities. The indices of the array are the entity references.
The parser will automatically perform the substitution
and pass the data to the application via the
-characterdatacommand
callback command.
The default array variable has entries for the pre-defined XML entities:
lt | < | < |
gt | > | > |
amp | & | & |
quot | " | " |
apos | ' | ' |
Note that the current implementation of TclXML performs a separate
callback for the entity replacement text.
See the section Entity Expansion for more
information.
TclExpat does not implement this option.
-defaultcommand script
-
Specifies a Tcl command to associate with document features not otherwise
specified above.
The actual command consists of this option followed by one argument:
the details of the document feature.
TclXML does not implement this option.
-commentcommand script
-
Specifies a Tcl command to associate with comments in the document.
The actual command consists of this option followed by one argument:
the comment data.
Example:
$parser configure -commentcommand HandleComment
proc HandleComment {data} {
puts stderr "Comment ==> $data"
}
$parser parse {<test><!-- this is <obviously> a comment --></test>}
This would result in the following command being invoked:
HandleComment { this is <obviously> a comment }
parser parse data
-
Parses XML data. Callbacks for various
document features, such as element start, element end,
character data, etc, will be invoked.
parser reset
-
Resets the parser in preparation for parsing another document.
See also the
-final
option.
Callback Command Return Codes
The script invoked for any of the parser callback commands, such as
-elementstartcommand
, -elementendcommand
, etc,
may return an error code other than "ok" or "error".
All callbacks may in addition return "break" or "continue".
If a callback script returns an "error" error code then processing
of the document is terminated and the error is propagated in the usual fashion.
If a callback script returns a "break" error code then all further
processing of the document data ceases, and the parser returns with a normal
status.
If a callback script returns a "continue" error code then processing
of the current element, and its children, ceases and processing continues with
the next (sibling) element.
Examples
Demonstration of break
:
proc EStart {name attlist} {
array set attr $attlist
if {[info exists attr(class)]} {
switch $attr(class) {
break {
uplevel break
}
continue {
uplevel continue
}
error {
uplevel error {application invoked}
}
}
}
set id {}
catch {set id " id $attr(id)"}
puts stderr "Start element $name$id"
}
$parser configure -elementstartcommand EStart
$parser parse {<test>
<child class="break" id="1"><grandchild/></child>
<child class="break" id="2"><grandchild/></child>
</test>}
This script produces the output:
Start element test
Demonstration of continue
:
$parser reset
$parser parse {<test>
<child class="continue" id="1"><grandchild id="3"/></child>
<child id="2"><grandchild id="4"/></child>
</test>}
This script produces the output:
Start element test
Start element child id 2
Start element grandchild id 4
Entity Expansion
TclXML has support for XML character and general entities. This support is controlled
by the -entityvariable and
-entityreferencecommand options.
TclXML is able to automatically substitute entity references for their replacement text.
The replacement text is also parsed, so further entities and elements may occur.
By default, TclXML will automatically perform substitution of character entities and
the five predefined general entities: < (<), > (>),
" ("), ' (') and & (&). Other general entities will
be included in the document's character data untouched.
Even though the replacement text for an entity may be only character data, it will be
passed to the application in a separate invocation of the character data callback.
For example, the script:
set p [xml::parser]
$p configure -characterdatacommand pcdata
$p parse {One general < entity}
will result in these (and only these) Tcl commands being evaluated:
pcdata {One general }
pcdata <
pcdata { entity}
An application may provide its own replacement text for general entities by supplying
a Tcl array name with the -entityvariable option.
An application may be provided with callbacks for undefined general entities by using
the -entityreferencecommand option.
This option can be used to handle all entity references by giving an empty array for the
-entityvariable option.
It is possible to disable all entity expansion functionality by giving an empty array for the
-entityvariable option and an empty string for the
-entityreferencecommand option. In this case
character data will be delivered to the application in a single invocation with no expansion
occurring. For example, the script:
array set empty {}
set p [xml::parser]
$p configure -characterdatacommand pcdata -entityvariable empty -entityreferencecommand {}
$p parse {One general < entity}
will result in this (and only this) Tcl command being evaluated:
pcdata {One general < entity}
Note that TclXML does not currently support parsing XML DTDs and so it has no support for
XML parameter entities. If parameter entities occur in the document then they are left
untouched, since the XML Recommendation specifies that they be ignored in that situation.
- [1]
- Extensible Markup Language (XML) 1.0. World Wide Web Consortium Recommendation.
http://www.w3.org/TR/REC-xml