3. Authors of SGML (including writers of HTML)
Does XML replace HTML?
No. XML itself does not replace
HTML: instead, it provides an
alternative which allows you to define your own set of markup elements.
HTML is expected to remain in common use for some time to come, and
Document Type Definitions
for HTML will be available in XML versions as well as in original
SGML. XML is designed to make the writing of DTDs much simpler
than with full
SGML. (See
the
question on DTDs for what one is and why you'd want one.)
Work is going on to produce XML versions
of HTML and other popular existing DTDs, but this may not take off until more
stable software is available. Watch
comp.text.sgml, comp.text.xml,
XML-L, and xml-dev
for announcements.
What does an XML document look like
inside?
The basic structure is very similar to most other applications of
SGML, including HTML. XML documents can be very simple, with no document
type declaration, and straightforward nested markup of your own design:
<?xml version="1.0" standalone="yes"?>
<conversation>
<greeting>Hello, world!</greeting>
<response>Stop the planet, I want to get off!</response>
</conversation>
Or they can be more complicated, with a DTD specified (see
), and maybe
an internal subset, and a more complex structure:
<?xml version="1.0" standalone="no" encoding="UTF-8"?>
<!DOCTYPE titlepage SYSTEM "http://www.frisket.org/dtds/typo.dtd"
[<!ENTITY % active.links "INCLUDE">]>
<titlepage>
<white-space type="vertical" amount="36"/>
<title font="Baskerville" size="24/30"
alignment="centered">Hello, world!</title>
<white-space type="vertical" amount="12"/>
<!-- In some copies the following decoration is
hand-colored, presumably by the author -->
<image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"/>
<white-space type="vertical" amount="24"/>
<author font="Baskerville" size="18/22" style="italic">Vitam capias</author>
</titlepage>
Or they can be anywhere between: a lot will depend on how you
want to define your document type (or whose you use) and what it will be
used for. See the question on
valid and
well-formed files.
How does XML handle white-space in my
documents?
The SGML rules regarding white-space have been changed for XML,
so all white-space, including linebreaks, TAB
characters, and regular spaces, even between elements where no text
can appear, is passed by the parser unchanged
to the application (browser, formatter, viewer,
etc). This means:
- 'insignificant'
white-space between structural elements (those which appear where only
element content is allowed, ie between
other elements, without text data) will get
passed to the application (under standard SGML
this white-space gets suppressed, which is why you can put all that
extra space in HTML documents and not worry about it. This is not so
in XML);
- 'significant' white-space
within elements which can contain text and markup
mixed together ("mixed content" or PCDATA [parsed
character data]) will still get passed to the application exactly as
under regular SGML.
<chapter>
<section>
<title>
My title for Section
1.
</title>
<p>
...
</p>
</section>
</chapter>
The parser must, however, still inform the application that
white-space has occurred in element content, if it can detect it. (Users of standard
SGML may recognize that this information was not in the
ESIS,
but it
is in the
grove.)
In the above example, the application will receive all the
pretty-printing linebreaks, TABs, and spaces between the elements
as
well as those embedded in the section title. It is the function of the
application (browser, formatter, viewer,
etc) to decide which type of white-space
to discard and which to retain.
Which parts of an XML document are
case-sensitive?
All of it, both markup and
text. This is significantly different from HTML and most other SGML
document types. It was introduced to allow markup in non-Latin-alphabet
languages and to obviate problems with case-folding in scripts which are
caseless.
- Element type
names (used in start-tags and end-tags) are case-sensitive: you must
stick with whatever combination of upper- or lower-case you use to
define them (either by usage or in a DTD).
So you can't say <BODY>...</body>: upper- and lower-case must
match; thus
<IMG> and
<img> are
two different element types;
- For well-formed files with no
DTD, the first occurrence of an element type name
defines the casing;
- Attribute names are also case-sensitive, on a
per-element basis: for example
<PIC width="7in">
and <PIC WIDTH="6in">
in the same file exhibit two separate attributes,
because the different casings of width
and WIDTH distinguish them;
- Attribute values are also
case-sensitive. Character data values (eg
HRef="MyFile.SGML") always have
been, but ID and IDREF attributes are now case-sensitive as well and no longer
get folded to uppercase for comparisons;
- All entity names (Aacute),
and your data content (your text), are case-sensitive, exactly as
before.
How can I make my existing HTML files
work in XML?
Make them
well-formed DTD-less documents (see
below) and write a stylesheet. A
DTD (Document Type Description) is
optional in XML, but HTML files converted to XML format currently have
to be DTDless because there are few working XML versions of the current
SGML-based HTML DTDs yet
(they need to be substantially edited to remove their dependence
on those features of SGML which are excluded from XML).
It is necessary to convert existing HTML
files to be well-formed because XML does not allow end-tag minimization
(missing </p>, etc)
which is allowed in most HTML DTDs. Many HTML authoring tools already
produce almost (but not quite) well-formed XML. As a preparation for
XML, the W3C's
HTML Tidy
program can clean up some of the formatting mess left beind by
inadequate HTML editors.
If you want to move your files out of HTML
into some other DTD entirely, there is a pilot site run by CommerceNet
(
http://www.xmlx.com/) for
the exchange of XML DTDs, and a pilot FPI server at
http://www.ucc.ie/cgi-bin/public
with several common SGML DTDs to start from.
- replace the
DOCTYPE declaration and any internal subset
(basically everything within the first set of angled brackets <!DOCTYPE
HTML...>) with the XML Declaration <?xml
version="1.0" standalone="yes"?>
- change any EMPTY elements
(eg
<ISINDEX>, <BASE>,
<META>, <LINK>,
<NEXTID> and <RANGE>
in the header, and <IMG>,
<BR>, <HR>,
<FRAME>, <WBR>,
<BASEFONT>, <SPACER>,
<AUDIOSCOPE>,
<AREA>, <PARAM>,
<KEYGEN>, <COL>,
<LIMITTEXT>, <SPOT>,
<TAB>, <OVER>,
<RIGHT>, <LEFT>,
<CHOOSE>, <ATOP>,
and <OF> in the body) so that they end
with "/>", for example
<IMG SRC="mypic.gif"
alt="Picture"/>
- ensure there are
correctly-matched explicit end-tags for all non-empty elements;
eg every
<P> must have a
</P>, etc.
If your HTML was created by a conformant editor, this process can be
automated by a normalizer program like sgmlnorm
(part of SP)
or the sgml-normalize function in an editor
like Emacs/psgml;
- escape all <
and & non-markup
(ie literal) characters as
< and & respectively;
- ensure all attribute values are in quotes;
- ensure all occurrences of all element names in
start-tags and end-tags match with respect to
upper- and lower-case and that they are consistent throughout the file;
- ensure all attribute names are similarly in a
consistent case throughout the file.
Be aware that many HTML browsers may not accept XML-style
EMPTY elements with the trailing slash, so the above
changes may not be backwards-compatible. An alternative is to add a dummy
end-tag to all EMPTY elements, so
<IMG src="foo.gif"> becomes
<IMG src="foo.gif"></IMG>.
If you have a lot of valid HTML files,
could write a script to do this
in a programming language which understands SGML/XML markup (such as
Omnimark,
Balise,
SGMLC,
or a system using one of the SGML libraries for
Perl,
Python, or
Tcl), or you could even use editor macros if you know what you're
doing.
If your HTML files are invalid (HTML
created by most WYSIWYG editors is invalid) then they will almost
certainly have to be converted manually, although if the deformities are
regular and carefully constructed, the files may actually be almost
well-formed, and you could write a program or script to do as described
above. To test for invalidity and non-conformance, check the following:
- do the files
contain markup syntax errors? For example, are there any backslashes
instead of forward slashes on end-tags; or elements which nest
incorrectly (eg <B>an
element which starts <I>inside one element</B> but ends
outside it</I>)?
- do the files contain markup
which conflicts with the HTML DTDs, such as headings inside paragraphs,
list items outside list environments?
- do the files use elements which are not in any DTD?
Although this is easy to transform to a DTDless well-formed file
(because you don't have to define elements in advance) most
proprietary [browser-specific] extensions have never been formally
defined, so it is often impossible to work out where they can
meaningfully be used.
Markup which is valid but which is meaningless or void may need
to be edited out before conversion (such as repeated empty paragraphs or
linebreaks, empty tables, invisible
'spacing' GIFs etc:
XML uses stylesheets, so you won't need any of these).
Is there an XML version of HTML?
There are XML versions of the HTML DTD in preparation but
none ready yet:
-
Ben Trafford is
developing an XML version of HTML 3.2
- I have started work on an XML version of HTML Pro, but
it's not easy, and I need convincing it's worth doing.
-
The Extensible HyperText Markup Language (XHTML) is a
W3C project:
"This specification defines XHTML 1.0, a reformulation of HTML 4.0
as an XML 1.0 application, and three DTDs
corresponding to the ones defined by HTML 4.0. The semantics of the
elements and their attributes are defined in
the W3C Recommendation for HTML 4.0. These semantics provide the
foundation for future extensibility of XHTML.
Compatibility with existing HTML user agents is possible by
following a small set of guidelines."
If XML is just a subset of SGML, can I
use XML files directly with SGML tools?
Yes, provided you use SGML software which knows about the new WebSGML
Adaptations to ISO 8879 (features needed to support XML, such as the
special form for EMPTY elements; some aspects of the
SGML Declaration such as NAMECASE GENERAL NO;
multiple attribute declarations, etc).
An alternative is to use an SGML DTD to
let you create an SGML file, but one which does not use empty elements; and
then remove the DocType Declaration so it becomes a well-formed
DTDless XML file.
At the moment there are few tools which handle XML files
unchanged because of the format of these
EMPTY
elements, but this is changing. The
nsgmls
parser has an XML conformance switch, introduced for use with
Jade, and the first XML-specific
editors and parsers are in use (see the question on
software).
I'm used to authoring and serving
HTML. Can I learn XML easily?
Yes, very easily, but at the moment there is still a need for
tutorials, simple tools, and more examples of XML documents.
Well-formed XML documents may look
similar to
HTML except for some small
but very important points of syntax.
The big practical difference is
that XML has to stick to the rules. HTML browsers let you create
broken HTML because they elide all the broken bits: with XML your
files have to be correct or they simply won't work.
Will XML be able to use non-Latin
characters?
Yes, the
XML
Specification explicitly says XML uses
ISO 10646, the international
standard 31-bit character repertoire which covers most human (and some
non-human) languages. This is currently congruent with Unicode.
The spec says (2.2): "All XML processors must accept the
UTF-8 and UTF-16 encodings of ISO 10646...". UTF-8 is an
encoding of Unicode into 8-bit characters: the first 128 are the same as
ASCII, the rest are used to encode the rest of Unicode into sequences of
between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the
same as ISO 646 IRV (ASCII), so you can continue to use ASCII for
English or other unaccented languages using the Latin alphabet. Note
that UTF-8 is incompatible with ISO 8859-1 (ISO Latin-1) after code
point 126 decimal (the end of ASCII). UTF-16 is like UTF-8 but with a
scheme to represent the next 16 planes of 64k characters as two 16-bit
characters.
"...the mechanisms for signalling which of the two
are in use, and for bringing other encodings into play, are [...]
in the discussion of character encodings." The
XML Specification explains how to
specify in your XML file which coded character set you are using.
Use of UCS-4 can only legally be specified
in SGML or XML when the WebSGML Adaptations to ISO 8879 are implemented:
this enables numbers longer than eight digits to be used in the SGML
Declaration.
"Regardless of the specific encoding used, any character
in the ISO 10646 character set may be referred to by the decimal or
hexadecimal equivalent of its bit string": so no matter which
character set you personally use, you can still refer to specific
individual characters from elsewhere in the encoded repertoire by using
&#dddd; (decimal character code) or
&#xHHHH;
(hexadecimal character code, in uppercase).
The terminology can get confusing, as can the numbers: see the
ISO
10646 Concept Dictionary. Rick Jelliffe has
"XML-ized"
the ISO character entity sets.
What's a Document Type Definition
(DTD) and where do I get one?
A DTD is a file (or several files
to be used together), written in XML, which contains a formal
definition of a
particular type of document. It sets out what names can be used for
element types, where they may occur, and how they all fit together.
For example, if you want a document type to be able to describe
<List>s which contain <Item>s, part of your DTD would contain
something like
<!ELEMENT List (Item)+>
<!ELEMENT Item (#PCDATA)>This fragment defines
a list as an element type
containing one or more items (that's the plus sign),
and items as element types containing just text. XML is the formal
specification language which
processors read to automatically parse the DTD and then use that
information to
identify where every element type comes and how each relates to the
other, so that stylesheets, navigators, browsers, search engines,
databases, printing routines, and other applications can be
used. The above fragment lets you create lists which get stored
as:<List><Item>Chocolate</Item><Item>Music</Item><Item>Surfing</Item></List>How
the list appears in print or on the screen depends on your stylesheet:
you do not normally need to put anything in the XML to affect
formatting in the way that had to be done with HTML before stylesheets.
In effect, a DTD provides applications with
advance notice of what names and structures can be used in a particular
document type. Using a DTD means you can be certain that all documents
which belong to a particular type will be constructed and named in a
conformant manner.
There are thousands of
SGML DTDs already in existence in all kinds of areas (see the
SGML Web
pages
for examples). Many of them can be downloaded and used freely; or you
can write your own. As with any language, you need to learn it to do
this (see for example
Developing SGML DTDs by
Maler and el
Andaloussi, Prentice Hall, 1997, 0-13-309881-8): but XML is much
simpler than full SGML: see the
list of restrictions which shows
what has been cut out. Existing SGML DTDs do need to be converted to XML
for use with XML systems:
read the question
on converting SGML DTDs to XML, and expect to see announcements
of popular DTDs eventually becoming available in XML format.
I keep hearing about alternatives to DTDs. What's a schema?
Bob DuCharme writes: "Many XML developers are dissatisfied with the syntax of the markup
declarations described in the XML spec for two reasons. First, they feel
that if XML documents are so good at describing structured information,
then the description of a document type's structure (its 'schema')
should be in an XML document instead of written with its own special
syntax. In addition to being more consistent, this would make it easier
to edit and manipulate the schema with regular document manipulation
tools. Secondly, they feel that traditional DTD notation doesn't allow
schema designers the power to impose enough constraints on the data, for
example, the ability to say that a certain element type must always have
a positive integer value, that it may not be empty, or that it must be
one of a list of possible choices. This would ease the development of
software using that data because the developer would have less
error-checking code to write."
Users from a database or computer science background
should be aware that SGML systems -- and that includes XML --
are not database management systems: they are text markup systems.
While there are many similarities, such as the ones described here,
some of the concepts of one are simply non-existent in the other: XML
does not possess some database-like features in the same way that
DBMSs do not possess markup-like ones.
"Several groups have submitted proposals to the W3C for alternative ways
to express document type schemata. In addition to offering schema
constraints like data typing and the others described here, many take
advantage of other current trends in software development such as
object-oriented methodologies. The W3C Schema Working Group is currently
reviewing these proposals and developing their own proposal based on the
most useful features suggested by the existing proposals and the members
of the Working Group."
How will XML affect my document
links?
The linking
abilities of XML systems are much more powerful than those of
HTML, so you'll be able to do much more with them. Existing
HREF-style links will remain usable, but the new linking
technology is based on the lessons learned in the development of other
standards involving hypertext, such as
TEI and
HyTime,
which let you manage bidirectional and multi-way links, as well as links
to a span of text (within your own or other documents) rather than to a
single point. These features have been available to standard SGML
users in browsers like
DynaText,
Panorama
and
Multidoc Pro for many years, so there
is considerable experience and expertise available in using them.
The
XML Linking Specification
(XLink) and
XML
Extended Pointer Specification (XPointer) documents contain a
detailed draft specification. An XML link can be either a URL or a
TEI-style Extended Pointer,
or both. A URL on its own is assumed to be a resource (as with HTML); if
an XPointer follows it, it is assumed to be a sub-resource of that URL;
an XPointer on its own is assumed to apply to the current document.
An XPointer is always preceded by one of
#, ?, or |.
The # and ? mean the same as
in HTML applications; the | means the sub-resource
can be found by applying the XPointer to the resource, but the method of
doing this is left to the application.
XPointer vždy začíná s #, ?, or |. # a ? znamenají to samé jako v HTML, | naznačuje, jak dané místo nalézt ve zdroji, ale způsob provedení je ponechán na prohlížeči.
The
TEI
Extended Pointer Notation (EPN) is much more powerful than the
'fragment address' on the end of some URLs, as
it allows you to specify the location of a link end using the structure
of the document as well as (or in addition to) known, fixed points like
IDs.
For example, the linked second
occurrence of the word
'XPointer' two paragraphs back could be referred
to as http://www.ucc.ie/xml/faq.sgml#ID(faq-hypertext)CHILD(2,*)(6,*),
meaning the sixth child object within the second child object after the
element whose ID is faq-hypertext. Count
the objects from the start of this question in the
SGML version (which has the ID
"#faq-hypertext"):
the title of the question;
<section id="faq-hypertext">
<TITLE>How will XML affect my document links?</TITLE>
the second paragraph:
-
the character data from the start of the
paragraph to the first item of markup:
<para>The
- the markup item:
XML Linking
Specification (XLink)
- the text item:
and
- the markup item:
XML Extended Pointer Specification
(XPointer)
- the next stretch of character data:
documents contain a detailed specification. An XML link
can be either a URL or a TEI-style Extended Pointer (
- and the next markup item:
XPointer
If you view this file with Panorama
or
MultiDoc Pro you can click on the
highlighted cross-reference button at the start of the example sentence,
and it will display the locations in Extended Pointer Notation of all
the links to it, including the word "XPointer" mentioned.
(Doing this in an HTML browser is not meaningful, as they do not support
bidirectional linking or EPN.) David Megginson has produced an
additional function for Emacs/psgml which will deduce an XPointer for
any location in an SGML or XML file.
Can I do mathematics using XML?
Yes, if the
document
type you use provides for math. The mathematics-using community
is developing software, and there is a
MathML proposal at the W3C,
which is a native XML application. It would also be possible to make XML
fragments from the long-expired HTML3,
HTML
Pro, or
ISO
12083 Math, or
OpenMath,
or one of your own making. Browsers which display some math embedded in
SGML already exist (eg
DynaText,
Panorama,
Multidoc Pro).
The sophistication could vary from math
expressions like
xi
through simple inline equations such as E = mc2
to display complicated equations. The
Techexplorer plugin from IBM
can be used with regular HTML browsers to render TeX math, and the
Amaya testbed browser
at the W3C has an
experimental MathML display.
How does XML handle metadata?
There are no predefined elements in XML, because it is an
architecture, not an application, so it is not part of XML's job to
specify how or if authors should or should not implement metadata. You
are therefore free to use any suitable method from simple attributes to
the embedding of entire Dublin Core/Warwick Framework metadata records.
Browser makers may also have their own architectural recommendations or
methods to propose.
Can I use Java, ActiveX, etc
in XML files?
This depends on what facilities the browser makers implement. XML
is about describing information; scripting languages and languages for
embedded functionality are software which enables the information to
be manipulated at the user's end.
XML itself provides a way to define the markup needed to
implement scripting languages: as a neutral standard it neither
encourages not discourages their use, and does not favour one language
over another, so the field is wide open.
Can I use Java to create or manage XML files?
Yes, any programming language can be used to output data
from any source in XML format. There is a growing number of front-ends
and back-ends for programming environments and data management
environments to automate this.
-
Mark Watson writes in article
344c3443.4494773@news.infonex.net: "I posted the spec to a Java
toolkit for creating XML documents from relational database queries,
and for save/loading XML documents to local files, and for transport
via sockets, RMI, and CORBA IIOP. The spec is at: www.markwatson.com/XMLdb_0_1.htm."
-
There is a suite of
Java tutorials (with source code and explanation) available at http://developerlife.com. These
tutorials show the Java2 developer how to use the IBM, Sun and OpenXML
Java parsers to write Java programs that use XML.
How do I control appearance?
The use of a stylesheet is required for XML. Some browsers may
possibly provide simple default styles for popular elements like
<Para>, or <List>
containing <Item>, but in general a
stylesheet gives the author much better control of the layout. But as
with any system where files can be viewed at random by arbitrary users,
the author cannot know what resources (such as fonts) are on the user's
system, so care is needed.
Arbortext's experimental
XML Styler has details of how to use it with XSL. You will
also need the ActiveX controls and
XSL
codebase.
There are also many pre-existing proprietary
stylesheet systems and implementations, many of which are deeply
embedded in the technical documentation community (and thus heavily
supported by one or more products):
- Inso Corp's DynaText
and DynaWeb browser and server products
(their forerunner company, EBT, was where much of today's stylesheet
technology was invented);
- The Synex stylesheet DTD as
used in
Panorama and MultiDoc Pro;
- The US military standard FOSI
(Formatted Output Specification Instance) is implemented in Arbortext's
ADEPT*Editor (and elsewhere);
- SoftQuad's Author/Editor
uses stylesheets controllable by the user.
Most browser and editor vendors appear to
be committing to a move to XSL but with a large installed user base
for their existing systems this will probably not occur quickly.
How do I use
graphics in XML?
Graphics are just links which happen to
have a picture file at the end rather than another piece of text, so they can be
done in any way supported by the XLink and XPointer specifications (see
earlier question), including using
similar syntax to existing HTML images. They can also be done using XML's
built-in
NOTATION and
ENTITY
mechanism in a similar way to standard SGML. The linking specifications, however, give you much better
control over the traversal and activation of links, so an author can specify,
for example, whether or not to have an image appear when the page is
loaded, or on a click from the user, or in a separate window,
without
having to resort to scripting. Which graphic file formats will be
supported is a matter for the browser makers: XML itself doesn't
predict or restrict you. GIF, JPG, TIFF, PNG, and CGM at a minimum would seem
to make sense: there are moves towards creating a networked vector
graphics standard (see next paragraph).
Peter Murray-Rust writes: "GIFs and
JPEGs cater for bitmaps (pixel representations of images). Vector
graphics (scaleable) are being addressed in the W3C's graphics
activity (see
http://www.w3.org/Graphics/Activity).
When a consensus is reached it will be possible to transmit the
graphics representation
within the XML file. For
many graphics objects this will mean greatly decreased download time
and scaling without loss of detail."
You cannot embed a raw graphics file (or any other
binary [non-text] data) directly into an XML file because any bytes
resembling markup would get misinterpreted: you must refer to it by
linking (see below).
Bob DuCharme adds: "All the data in an XML document entity must
be parseable XML. You can define an external entity as either a parsed
entity (parseable XML) or an unparsed entity (anything else). Unparsed
entities can be used for picture files, sound files, movie files, or
whatever you like. They can only be referenced from within a document as
the value of an attribute (much like a bitmap picture on an HTML Web
page is the value of the img element's
src attribute) and not part of the actual
document. In an XML document, this attribute must be declared to be of
type ENTITY, and the entity's declaration must
specify a declared NOTATION, because if the entity
isn't XML, the XML processor needs to know what it is. For example,
in the following document, the colliepic
entity is declared to have a JPEG notation, and it's used as the
value of the empty dog element's picfile
attribute."
<?xml version="1.0"?>
<!DOCTYPE dog [
<!NOTATION JPEG SYSTEM "Joint Photographic Experts Group">
<!ENTITY colliepic SYSTEM "lassie.jpg" NDATA JPEG>
<!ELEMENT dog EMPTY>
<!ATTLIST dog picfile ENTITY #REQUIRED>
]>
<dog picfile="colliepic"/>
"The XLink and XPointer linking
specifications describe other ways to point to a non-XML file such as a
graphic. These offer more sophisticated control over the external entity's
position, handling, and appearance within the XML document."
(It would, however, be possible to include a text-encoded
transformation of a binary file as a CDATA marked
section, using something like UUencode with the markup characters
] and > removed from
the map so that they could not occur and be misinterpreted.)