4. Developers and Implementors
(including WebMasters and server operators)
Where's the spec?
Eve Maler has released the
DTD
and
documentation
used for the spec itself: this is a new version that was used to encode
the XML, XLink, XPointer, DOM, etc
specifications. Be aware that this version is no longer compatible with
the version that XML 1.0 uses; please send any comments or questions to
Eve.
What are these terms 'DTDless',
'valid', and 'well-formed'?
Full SGML uses a Document Type Definition (DTD) to describe the
markup (elements) available in any specific type of document. However,
the design and construction of a DTD can be a complex and non-trivial
task, so XML has been designed so it can be used either with or without
a DTD. DTDless operation means you can invent markup without having to
define it formally, at the penalty of losing automated control over
the structuring of additional documents of the same type.
To make this work, a DTDless file in effect defines
its own markup informally, by the simple existence and location of elements
where you create them. But when an XML application such as a browser
encounters a DTDless file, it needs to be able to understand the
document structure while it reads it, because it has no DTD to tell it what
to expect, so some changes have been made to the rules.
For example, HTML's <IMG>
element is defined as "EMPTY": it
doesn't have an end-tag. An XML application reading a
file without a DTD and encountering <IMG> would
have no way to know whether or not to expect an end-tag,
so the concept of 'well-formed' files has
become necessary. This makes the start and end of every element, and the
occurrence of EMPTY elements completely unambiguous.
'Well-formed' files
All XML documents, both DTDless and valid, must be well-formed:
- if there is
no DTD in use, the document must start with a Standalone
Document Declaration (SDD) saying so:<?xml version="1.0" standalone="yes"?>
<foo>
<bar>...<blort/>...</bar>
</foo>
David Brownell notes: "XML that's
'just' well-formed doesn't need to use a
Standalone Document Declaration at all. Such declarations are there to
permit certain speedups when processing documents while ignoring
external parameter entities -- basically, you can't rely on
external declarations in standalone documents. The types that are
relevant are entities and attributes. Standalone documents
must not require any kind of attribute value normalization or
defaulting, otherwise they are invalid."
- all tags must be balanced: that is, all elements which
may contain character data must have both start- and end-tags present
(omission is not allowed except for
empty elements, see below);
- all attribute values must be in quotes (the
single-quote character [the apostrophe] may be used if the value
contains a double-quote character, and vice versa):
if you need both, use ' or
", and declare them in the
internal subset;
- any EMPTY element
tags (eg those with no end-tag like HTML's
<IMG>,
<HR>, and <BR>
and others) must either end with "/>"
or you have to make them appear non-EMPTY by adding a real
end-tag;
Example:
<BR> would become either
<BR/> or
<BR></BR>
- there must not be any isolated markup-start characters
(<
or &) in your text data (ie
they must be given as < and
&), and the sequence
]]> must be given as ]]>
if it does not occur as the end of a CDATA marked
section;
- elements must nest inside each other properly (no
overlapping markup, same rule as for all SGML);
- Well-formed files with no DTD may use attributes on
any element, but the attributes must all be of type CDATA by default.
XML files with no DTD are
considered to have <, >, ', "&, and & predefined and thus available for use
even without a DTD. Valid XML files must declare them explicitly if
they use them. If you want to use more than these five default
character entities, but you want to avoid having to write a full DTD,
it is possible to declare just character entities on their own in the
internal subset of a standalone XML file (thanks to Richard Lander for
this):
Valid XML
A valid file begins like any other SGML file with a Document Type
Declaration, but may have an optional XML Declaration
prepended:
<?xml version="1.0"?>
<!DOCTYPE advert SYSTEM "http://www.foo.org/ad.dtd">
<advert>
<headline>...<pic/>...</headline>
<text>...</text>
</advert>The
XML
Specification defines an SGML Declaration for XML which is fixed
for all instances (the declaration has been removed from the text of the
Specification and is now in
a separate
document). An
XML version of the specified DTD must be
accessible to the XML processor, either by being available locally
(ie
the user already has a copy on disk), or by being retrievable via the
network. You can specify this by supplying the URL for the DTD in a
System Identifier (as in the example above). It is possible (many people
would say preferable) to supply a
Formal Public Identifier,
but if used, this must
precede the System
Identifier, which must still be given (and only the PUBLIC
keyword is used),
<!DOCTYPE advert PUBLIC "-//Foo, Inc//DTD Advertisements//EN"
"http://www.foo.org/ad.dtd">
The defaults for the other attributes of the XML Declaration are
version="1.0" and encoding="UTF-8".
Which should I use in my DTD, attributes or elements?
There is no single answer to this: a lot depends on what
you are designing the document type for. The two extremes are best
illustrated with examples.
-
'Traditional' textual practice
is to put the 'real' text (what would be
printed) as character data content, and keep the metadata (like line
numbers) in attributes, from where they can more easily be isolated
for analysis or special treatment like display in the margin or in a mouseover:
<l n="184"><sp>Portia</sp><text>The quality of mercy is not strain'd,</text></l>
-
But from the systems point of view, there is nothing
'wrong' with storing the data the other way
round, especially where the volume of text data on each occasion is
relatively small:
<line speaker="Portia" text="The quality of mercy is not strain'd,">184</line>
A lot will depend on what you want to do with the
information and which bits of it are easiest accessed by each method.
A rule of thumb for conventional textual documents is that if the
markup were all stripped away, the bare text should still be readable
and usable, even if inconvenient. For database output, however, or
other machine-generated documents, 'reading'
may not be meaningful, so it is perfectly possible to have documents
where
all the data is in attributes, and the
document contains no character data in content models at all. See
http://www.oasis-open.org/cover/elementsAndAttrs.html
for more.
What else has changed between SGML and XML?
The principal changes
are in what you can do in writing a Document Type Definition (DTD). To
simplify the syntax and make it easier to write processing software, a
large number of SGML markup declaration options have been suppressed
(see the
list of omitted
features).
An extra delimiter is permitted in Names
(the colon) for use in experiments with namespaces (enabling DTDs to
distinguish element source, ownership, or application). A colon may only
appear in mid-name, though, not at the start or the end. Work is ongoing
to define how these can be declared and referenced using element and
attribute markup.
What XML software can I use today?
For a detailed guide to examples of SGML and XML programs and the concepts behind
them, see the editor's book
Understanding SGML and XML
Tools (Kluwer, 1998, 0-7923-8169-6).
Information for developers of Chinese XML systems can be found
at the Chinese
XML Now! website of Academia
Sinica:
http://www.ascc.net/xml/
This site includes an FAQ and test files.
Do I have to change any of my server software to work
with XML?
Only to serve up .xml files as the
correct MIME type (application/xml, see
RFC2376), so for
serving XML documents all that is needed is to edit the
mime-types file (or its equivalent) and add the
line
application/xml xml XML
In some servers
(eg Apache),
users can change the MIME type for specific file types from their own
directories by using directives in a .htaccess file. The MIME
content-type text/xml must only be applied to
pure ASCII files (ISO 646 IRV) because of a character-set restriction
in the RFC: for all normal use,
application/xml is the one to go for.
Since XML is designed to support
stylesheets and sophisticated hyperlinking, XML documents may be
accompanied by ancillary files in the same way that SGML files are:
DTDs, entity files, catalogs, stylesheets, etc, which may need other
MIME Content-Type entries, such as text/css
for CSS stylesheets. XUA (XML User Agent), which is one of the planned
deliverables of the XML WG, might provide a mechanism for packaging
XML documents and XSL styles into a single message.
If you run scripts generating HTML, which you wish to work with
XML, they will need to be modified to produce the relevant document
type.
Can I still use server-side
INCLUDEs?
Yes, so long as what they generate ends up as part of an
XML-conformant file (ie either
valid or just
well-formed).
Can I (and my authors) still use
client-side INCLUDEs?
The same rule applies as for
server-side INCLUDEs,
so you need to ensure that any embedded code which gets passed to a
third-party engine (eg
SDQL
enquiries,
Java
writes,
LiveWire requests,
streamed content,
etc) does not contain any characters
which might be misinterpreted as XML markup (ie
no angle brackets or ampersands): either use a
CDATA
marked section to avoid your XML application parsing the embedded code,
or use the standard <,
>, and
& character entity references
instead.
I'm trying to understand the XML
Spec: why does SGML (and XML) have such difficult terminology?
For implementation to succeed, the
terminology needs to be precise. Design goal 8 of the specification
tells us that "the design of XML shall be formal and
concise". To describe XML in formal terms, the specification
uses the concise language of Computer Science, which is often
confusing to non-CS people because it uses well-known English words in
a specialised sense which can be very different from their commonly
understood meanings -- for example,
'grammar', 'production',
'token', or
'terminal'.
The specification rarely explains these terms because of the other part
of this design goal: the specification should be concise. It doesn't
repeat explanations that are available elsewhere. In essence this
means that to grok the fullness of the spec, you need
foreknowledge of computer science and SGML.
Sloppy terminology in specifications causes misunderstandings, so
formal standards have to be phrased in formal terminology. This FAQ is not a
formal document, and the astute reader may already have noticed it
refers to 'element names' where 'element
type names is more correct; but the former is more widely
understood.
Is there a Developer's API kit for
XML?
Several are available or under development. Details of these and
other XML software are held on the
SGML/XML Web pages.
The big conversion and application development engines like
Balise,
Omnimark,
and
SGMLC are all working on adding XML.
Details of SGML software of all kinds is on
the SGML Web pages.
How does XML fit with the DOM?
The Document Object Model (DOM) (
http://www.w3.org/TR/PR-DOM-Level-1)
provides an abstract API for constructing, accessing, and manipulating
XML and HTML documents. A "binding" of the DOM to a
particular programming language provides a concrete API.
Is there a
conformance test suite for XML processors?
James Clark has a collection of test cases for testing XML
parsers at
http://www.jclark.com/xml/
which includes a conformance test.
How do I include
one DTD (or fragment) in another?
This works exactly the same as for regular SGML. First you
declare the entity you want to include, and then you reference it by
name:
Můžete stejným způsobem jako v SGML. :
<!ENTITY % mylists PUBLIC
"-//Foo, Inc//ENTITIES Common list structures//EN"
"dtds/listfrag.ent">
...
%mylists;
Such declarations traditionally go all together towards the top
of the main DTD file, where they can be managed and maintained, but
this is not essential so long as they are declared before they are
used. You use Parameter Entity syntax for this (the percent sign)
because the file is to be included at DTD compile time, not when the
document instance itself is parsed.
Note that a URL is compulsory in XML for all external file
references: standard rules for dereferencing URLs apply (assume the same
method, server, and directory as the containing document). The URL can
be supplied either as a System Identifier alone:
<!ENTITY mydtd SYSTEM "http://www.foo.bar/~blort/my.dtd">or
as a second parameter to a
formal Public Identifier
as in the
earlier example.
I've already
got SGML DTDs: how do I convert them for use with XML?
There are numerous projects being started to convert common or
popular SGML DTDs to XML format (for example Patrice Bonhomme is working
on an unofficial XML version of the TEI Lite DTD: details of that are
discussed on the TEI-L mailing list).
The following checklist comes courtesy of Sean McGrath
(author of XML By Example, Prentice Hall, 1998)
[my italics]:
- No equivalent of the SGML Declaration.
So keywords, character set
etc are essentially fixed;
- Tag mimimization is not allowed, so <!ELEMENT x - O (A,B)>
becomes <!ELEMENT X (A,B)> and
<!ELEMENT x - O EMPTY>
becomes <!ELEMENT X EMPTY>;
-
#PCDATA must only occur extreme
left in an OR model,
eg<!ELEMENT x (A|B|#PCDATA|C)>
becomes <!ELEMENT x (#PCDATA|A|B|C)>
and <!ELEMENT x (A,#PCDATA)>
is illegal;
- No CDATA, RCDATA
elements [declared content];
- Some SGML attribute types are not allowed in XML
eg NUTOKEN. Also
there are no NOTATION attributes (data attributes);
- Some SGML attribute defaults are not allowed in XML
eg CONREF;
- Comments cannot be inline to declarations like
[they can in standard SGML]
<!ELEMENT x (A,B) -- this is an SGML comment in a
declaration
- A whole bunch of SGML optional features are not
present in XML:
- all forms of tag
minimization (OMITTAG, DATATAG,
SHORTREF, etc);
- Link Process
Definitions;
- Multiple
DTDs per document
and many more: see
the question on the bits of SGML that were
removed for XML for a reference to the complete list;
And last but not least, CONCUR!
There are some important differences betweeen the internal and external
subset portion of a DTD in XML:
marked
sections can only occur in the external
subset. Parameter Entities must be used to replace entire declarations in the
internal subset portion of a DTD, eg
the following is invalid XML:
<!DOCTYPE x [
<!ENTITY % modelx "(A|B)*">
<!ELEMENT x %modelx;>
]>
<x></x>
What's the story
on XML and EDI?
Electronic Document Interchange has been used in e-commerce for many years
to exchange documents between commercial partners to a transaction. It
has required special proprietary software, but there are now moves to
enable EDI data to travel inside XML. Details of developments are at
http://www.xmledi.com/
and there is a guideline document at
http://www.geocities.com/WallStreet/Floor/5815/guide.htm