<<<TOC  >>>ENCSEN/CSZVONTranslations

3. Authors of SGML (including writers of HTML)

Does XML replace HTML?
What does an XML document look like inside?
How does XML handle white-space in my documents?
Which parts of an XML document are case-sensitive?
How can I make my existing HTML files work in XML?
Is there an XML version of HTML?
If XML is just a subset of SGML, can I use XML files directly with SGML tools?
I'm used to authoring and serving HTML. Can I learn XML easily?
Will XML be able to use non-Latin characters?
What's a Document Type Definition (DTD) and where do I get one?
I keep hearing about alternatives to DTDs. What's a schema?
How will XML affect my document links?
Can I do mathematics using XML?
How does XML handle metadata?
Can I use Java, ActiveX, etc in XML files?
Can I use Java to create or manage XML files?
How do I control appearance?
How do I use graphics in XML?

Does XML replace HTML?

No. XML itself does not replace HTML: instead, it provides an alternative which allows you to define your own set of markup elements. HTML is expected to remain in common use for some time to come, and Document Type Definitions for HTML will be available in XML versions as well as in original SGML. XML is designed to make the writing of DTDs much simpler than with full SGML. (See the question on DTDs for what one is and why you'd want one.)
Work is going on to produce XML versions of HTML and other popular existing DTDs, but this may not take off until more stable software is available. Watch comp.text.sgml, comp.text.xml, XML-L, and xml-dev for announcements.

What does an XML document look like inside?

The basic structure is very similar to most other applications of SGML, including HTML. XML documents can be very simple, with no document type declaration, and straightforward nested markup of your own design:
<?xml version="1.0" standalone="yes"?> <conversation> <greeting>Hello, world!</greeting> <response>Stop the planet, I want to get off!</response> </conversation>
Or they can be more complicated, with a DTD specified (see ), and maybe an internal subset, and a more complex structure:
<?xml version="1.0" standalone="no" encoding="UTF-8"?> <!DOCTYPE titlepage SYSTEM "http://www.frisket.org/dtds/typo.dtd" [<!ENTITY % active.links "INCLUDE">]> <titlepage> <white-space type="vertical" amount="36"/> <title font="Baskerville" size="24/30" alignment="centered">Hello, world!</title> <white-space type="vertical" amount="12"/> <!-- In some copies the following decoration is hand-colored, presumably by the author --> <image location="http://www.foo.bar/fleuron.eps" type="URL" alignment="centered"/> <white-space type="vertical" amount="24"/> <author font="Baskerville" size="18/22" style="italic">Vitam capias</author> </titlepage>
Or they can be anywhere between: a lot will depend on how you want to define your document type (or whose you use) and what it will be used for. See the question on valid and well-formed files.

How does XML handle white-space in my documents?

The SGML rules regarding white-space have been changed for XML, so all white-space, including linebreaks, TAB characters, and regular spaces, even between elements where no text can appear, is passed by the parser unchanged to the application (browser, formatter, viewer, etc). This means:
<chapter> <section> <title> My title for Section 1. </title> <p> ... </p> </section> </chapter>
The parser must, however, still inform the application that white-space has occurred in element content, if it can detect it. (Users of standard SGML may recognize that this information was not in the ESIS, but it is in the grove.) In the above example, the application will receive all the pretty-printing linebreaks, TABs, and spaces between the elements as well as those embedded in the section title. It is the function of the application (browser, formatter, viewer, etc) to decide which type of white-space to discard and which to retain.

Which parts of an XML document are case-sensitive?

All of it, both markup and text. This is significantly different from HTML and most other SGML document types. It was introduced to allow markup in non-Latin-alphabet languages and to obviate problems with case-folding in scripts which are caseless.

How can I make my existing HTML files work in XML?

Make them well-formed DTD-less documents (see below) and write a stylesheet. A DTD (Document Type Description) is optional in XML, but HTML files converted to XML format currently have to be DTDless because there are few working XML versions of the current SGML-based HTML DTDs yet (they need to be substantially edited to remove their dependence on those features of SGML which are excluded from XML).
It is necessary to convert existing HTML files to be well-formed because XML does not allow end-tag minimization (missing </p>, etc) which is allowed in most HTML DTDs. Many HTML authoring tools already produce almost (but not quite) well-formed XML. As a preparation for XML, the W3C's HTML Tidy program can clean up some of the formatting mess left beind by inadequate HTML editors.
If you want to move your files out of HTML into some other DTD entirely, there is a pilot site run by CommerceNet (http://www.xmlx.com/) for the exchange of XML DTDs, and a pilot FPI server at http://www.ucc.ie/cgi-bin/public with several common SGML DTDs to start from.
If you have created your HTML files conforming to one of the several HTML Document Type Definitions (DTDs), and they validate OK, then they can be converted as follows:
Be aware that many HTML browsers may not accept XML-style EMPTY elements with the trailing slash, so the above changes may not be backwards-compatible. An alternative is to add a dummy end-tag to all EMPTY elements, so <IMG src="foo.gif"> becomes <IMG src="foo.gif"></IMG>.
If you have a lot of valid HTML files, could write a script to do this in a programming language which understands SGML/XML markup (such as Omnimark, Balise, SGMLC, or a system using one of the SGML libraries for Perl, Python, or Tcl), or you could even use editor macros if you know what you're doing.
If your HTML files are invalid (HTML created by most WYSIWYG editors is invalid) then they will almost certainly have to be converted manually, although if the deformities are regular and carefully constructed, the files may actually be almost well-formed, and you could write a program or script to do as described above. To test for invalidity and non-conformance, check the following:
Markup which is valid but which is meaningless or void may need to be edited out before conversion (such as repeated empty paragraphs or linebreaks, empty tables, invisible 'spacing' GIFs etc: XML uses stylesheets, so you won't need any of these).
V pravidlech pro 'správně zformátované' XML soubory naleznete podrobné údaje o tom, co musíte zkontrolovat při konverzi do XML.

Is there an XML version of HTML?

There are XML versions of the HTML DTD in preparation but none ready yet:

If XML is just a subset of SGML, can I use XML files directly with SGML tools?

Yes, provided you use SGML software which knows about the new WebSGML Adaptations to ISO 8879 (features needed to support XML, such as the special form for EMPTY elements; some aspects of the SGML Declaration such as NAMECASE GENERAL NO; multiple attribute declarations, etc).
An alternative is to use an SGML DTD to let you create an SGML file, but one which does not use empty elements; and then remove the DocType Declaration so it becomes a well-formed DTDless XML file.
At the moment there are few tools which handle XML files unchanged because of the format of these EMPTY elements, but this is changing. The nsgmls parser has an XML conformance switch, introduced for use with Jade, and the first XML-specific editors and parsers are in use (see the question on software).

I'm used to authoring and serving HTML. Can I learn XML easily?

Yes, very easily, but at the moment there is still a need for tutorials, simple tools, and more examples of XML documents. Well-formed XML documents may look similar to HTML except for some small but very important points of syntax.
The big practical difference is that XML has to stick to the rules. HTML browsers let you create broken HTML because they elide all the broken bits: with XML your files have to be correct or they simply won't work.

Will XML be able to use non-Latin characters?

Yes, the XML Specification explicitly says XML uses ISO 10646, the international standard 31-bit character repertoire which covers most human (and some non-human) languages. This is currently congruent with Unicode.
The spec says (2.2): "All XML processors must accept the UTF-8 and UTF-16 encodings of ISO 10646...". UTF-8 is an encoding of Unicode into 8-bit characters: the first 128 are the same as ASCII, the rest are used to encode the rest of Unicode into sequences of between 2 and 6 bytes. UTF-8 in its single-octet form is therefore the same as ISO 646 IRV (ASCII), so you can continue to use ASCII for English or other unaccented languages using the Latin alphabet. Note that UTF-8 is incompatible with ISO 8859-1 (ISO Latin-1) after code point 126 decimal (the end of ASCII). UTF-16 is like UTF-8 but with a scheme to represent the next 16 planes of 64k characters as two 16-bit characters.
"...the mechanisms for signalling which of the two are in use, and for bringing other encodings into play, are [...] in the discussion of character encodings." The XML Specification explains how to specify in your XML file which coded character set you are using.
Use of UCS-4 can only legally be specified in SGML or XML when the WebSGML Adaptations to ISO 8879 are implemented: this enables numbers longer than eight digits to be used in the SGML Declaration.
"Regardless of the specific encoding used, any character in the ISO 10646 character set may be referred to by the decimal or hexadecimal equivalent of its bit string": so no matter which character set you personally use, you can still refer to specific individual characters from elsewhere in the encoded repertoire by using &#dddd; (decimal character code) or &#xHHHH; (hexadecimal character code, in uppercase). The terminology can get confusing, as can the numbers: see the ISO 10646 Concept Dictionary. Rick Jelliffe has "XML-ized" the ISO character entity sets.

What's a Document Type Definition (DTD) and where do I get one?

A DTD is a file (or several files to be used together), written in XML, which contains a formal definition of a particular type of document. It sets out what names can be used for element types, where they may occur, and how they all fit together. For example, if you want a document type to be able to describe <List>s which contain <Item>s, part of your DTD would contain something like <!ELEMENT List (Item)+> <!ELEMENT Item (#PCDATA)>This fragment defines a list as an element type containing one or more items (that's the plus sign), and items as element types containing just text. XML is the formal specification language which processors read to automatically parse the DTD and then use that information to identify where every element type comes and how each relates to the other, so that stylesheets, navigators, browsers, search engines, databases, printing routines, and other applications can be used. The above fragment lets you create lists which get stored as:<List><Item>Chocolate</Item><Item>Music</Item><Item>Surfing</Item></List>How the list appears in print or on the screen depends on your stylesheet: you do not normally need to put anything in the XML to affect formatting in the way that had to be done with HTML before stylesheets.
In effect, a DTD provides applications with advance notice of what names and structures can be used in a particular document type. Using a DTD means you can be certain that all documents which belong to a particular type will be constructed and named in a conformant manner.
There are thousands of SGML DTDs already in existence in all kinds of areas (see the SGML Web pages for examples). Many of them can be downloaded and used freely; or you can write your own. As with any language, you need to learn it to do this (see for example Developing SGML DTDs by Maler and el Andaloussi, Prentice Hall, 1997, 0-13-309881-8): but XML is much simpler than full SGML: see the list of restrictions which shows what has been cut out. Existing SGML DTDs do need to be converted to XML for use with XML systems: read the question on converting SGML DTDs to XML, and expect to see announcements of popular DTDs eventually becoming available in XML format.

I keep hearing about alternatives to DTDs. What's a schema?

Bob DuCharme writes: "Many XML developers are dissatisfied with the syntax of the markup declarations described in the XML spec for two reasons. First, they feel that if XML documents are so good at describing structured information, then the description of a document type's structure (its 'schema') should be in an XML document instead of written with its own special syntax. In addition to being more consistent, this would make it easier to edit and manipulate the schema with regular document manipulation tools. Secondly, they feel that traditional DTD notation doesn't allow schema designers the power to impose enough constraints on the data, for example, the ability to say that a certain element type must always have a positive integer value, that it may not be empty, or that it must be one of a list of possible choices. This would ease the development of software using that data because the developer would have less error-checking code to write."
Users from a database or computer science background should be aware that SGML systems -- and that includes XML -- are not database management systems: they are text markup systems. While there are many similarities, such as the ones described here, some of the concepts of one are simply non-existent in the other: XML does not possess some database-like features in the same way that DBMSs do not possess markup-like ones.
"Several groups have submitted proposals to the W3C for alternative ways to express document type schemata. In addition to offering schema constraints like data typing and the others described here, many take advantage of other current trends in software development such as object-oriented methodologies. The W3C Schema Working Group is currently reviewing these proposals and developing their own proposal based on the most useful features suggested by the existing proposals and the members of the Working Group."

How will XML affect my document links?

The linking abilities of XML systems are much more powerful than those of HTML, so you'll be able to do much more with them. Existing HREF-style links will remain usable, but the new linking technology is based on the lessons learned in the development of other standards involving hypertext, such as TEI and HyTime, which let you manage bidirectional and multi-way links, as well as links to a span of text (within your own or other documents) rather than to a single point. These features have been available to standard SGML users in browsers like DynaText, Panorama and Multidoc Pro for many years, so there is considerable experience and expertise available in using them.
The XML Linking Specification (XLink) and XML Extended Pointer Specification (XPointer) documents contain a detailed draft specification. An XML link can be either a URL or a TEI-style Extended Pointer, or both. A URL on its own is assumed to be a resource (as with HTML); if an XPointer follows it, it is assumed to be a sub-resource of that URL; an XPointer on its own is assumed to apply to the current document.
An XPointer is always preceded by one of #, ?, or |. The # and ? mean the same as in HTML applications; the | means the sub-resource can be found by applying the XPointer to the resource, but the method of doing this is left to the application.
XPointer vždy začíná s #, ?, or |. # a ? znamenají to samé jako v HTML, | naznačuje, jak dané místo nalézt ve zdroji, ale způsob provedení je ponechán na prohlížeči.
The TEI Extended Pointer Notation (EPN) is much more powerful than the 'fragment address' on the end of some URLs, as it allows you to specify the location of a link end using the structure of the document as well as (or in addition to) known, fixed points like IDs. For example, the linked second occurrence of the word 'XPointer' two paragraphs back could be referred to as http://www.ucc.ie/xml/faq.sgml#ID(faq-hypertext)CHILD(2,*)(6,*), meaning the sixth child object within the second child object after the element whose ID is faq-hypertext. Count the objects from the start of this question in the SGML version (which has the ID "#faq-hypertext"):
the title of the question; <section id="faq-hypertext"> <TITLE>How will XML affect my document links?</TITLE>
the second paragraph:
If you view this file with Panorama or MultiDoc Pro you can click on the highlighted cross-reference button at the start of the example sentence, and it will display the locations in Extended Pointer Notation of all the links to it, including the word "XPointer" mentioned. (Doing this in an HTML browser is not meaningful, as they do not support bidirectional linking or EPN.) David Megginson has produced an additional function for Emacs/psgml which will deduce an XPointer for any location in an SGML or XML file.

Can I do mathematics using XML?

Yes, if the document type you use provides for math. The mathematics-using community is developing software, and there is a MathML proposal at the W3C, which is a native XML application. It would also be possible to make XML fragments from the long-expired HTML3, HTML Pro, or ISO 12083 Math, or OpenMath, or one of your own making. Browsers which display some math embedded in SGML already exist (eg DynaText, Panorama, Multidoc Pro).
The sophistication could vary from math expressions like xi through simple inline equations such as E = mc2 to display complicated equations. The Techexplorer plugin from IBM can be used with regular HTML browsers to render TeX math, and the Amaya testbed browser at the W3C has an experimental MathML display.

How does XML handle metadata?

Because XML lets you define your own markup language, you can make full use of the extended hypertext features (see the question on Links) of XML to store or link to metadata in any format (eg ISO 11179, Dublin Core, Warwick Framework, Resource Description Framework (RDF), and Platform for Internet Content Selection (PICS)).
There are no predefined elements in XML, because it is an architecture, not an application, so it is not part of XML's job to specify how or if authors should or should not implement metadata. You are therefore free to use any suitable method from simple attributes to the embedding of entire Dublin Core/Warwick Framework metadata records. Browser makers may also have their own architectural recommendations or methods to propose.

Can I use Java, ActiveX, etc in XML files?

This depends on what facilities the browser makers implement. XML is about describing information; scripting languages and languages for embedded functionality are software which enables the information to be manipulated at the user's end.
XML itself provides a way to define the markup needed to implement scripting languages: as a neutral standard it neither encourages not discourages their use, and does not favour one language over another, so the field is wide open.
Scripting languages are provided for in a proposal for an Extensible Style Language, XSL (see question on Stylesheets).

Can I use Java to create or manage XML files?

Yes, any programming language can be used to output data from any source in XML format. There is a growing number of front-ends and back-ends for programming environments and data management environments to automate this.

How do I control appearance?

The use of a stylesheet is required for XML. Some browsers may possibly provide simple default styles for popular elements like <Para>, or <List> containing <Item>, but in general a stylesheet gives the author much better control of the layout. But as with any system where files can be viewed at random by arbitrary users, the author cannot know what resources (such as fonts) are on the user's system, so care is needed.
Arbortext's experimental XML Styler has details of how to use it with XSL. You will also need the ActiveX controls and XSL codebase.
There are also many pre-existing proprietary stylesheet systems and implementations, many of which are deeply embedded in the technical documentation community (and thus heavily supported by one or more products):
Most browser and editor vendors appear to be committing to a move to XSL but with a large installed user base for their existing systems this will probably not occur quickly.

How do I use graphics in XML?

Graphics are just links which happen to have a picture file at the end rather than another piece of text, so they can be done in any way supported by the XLink and XPointer specifications (see earlier question), including using similar syntax to existing HTML images. They can also be done using XML's built-in NOTATION and ENTITY mechanism in a similar way to standard SGML. The linking specifications, however, give you much better control over the traversal and activation of links, so an author can specify, for example, whether or not to have an image appear when the page is loaded, or on a click from the user, or in a separate window, without having to resort to scripting. Which graphic file formats will be supported is a matter for the browser makers: XML itself doesn't predict or restrict you. GIF, JPG, TIFF, PNG, and CGM at a minimum would seem to make sense: there are moves towards creating a networked vector graphics standard (see next paragraph).
Peter Murray-Rust writes: "GIFs and JPEGs cater for bitmaps (pixel representations of images). Vector graphics (scaleable) are being addressed in the W3C's graphics activity (see http://www.w3.org/Graphics/Activity). When a consensus is reached it will be possible to transmit the graphics representation within the XML file. For many graphics objects this will mean greatly decreased download time and scaling without loss of detail."
You cannot embed a raw graphics file (or any other binary [non-text] data) directly into an XML file because any bytes resembling markup would get misinterpreted: you must refer to it by linking (see below).
Bob DuCharme adds: "All the data in an XML document entity must be parseable XML. You can define an external entity as either a parsed entity (parseable XML) or an unparsed entity (anything else). Unparsed entities can be used for picture files, sound files, movie files, or whatever you like. They can only be referenced from within a document as the value of an attribute (much like a bitmap picture on an HTML Web page is the value of the img element's src attribute) and not part of the actual document. In an XML document, this attribute must be declared to be of type ENTITY, and the entity's declaration must specify a declared NOTATION, because if the entity isn't XML, the XML processor needs to know what it is. For example, in the following document, the colliepic entity is declared to have a JPEG notation, and it's used as the value of the empty dog element's picfile attribute."
<?xml version="1.0"?> <!DOCTYPE dog [ <!NOTATION JPEG SYSTEM "Joint Photographic Experts Group"> <!ENTITY colliepic SYSTEM "lassie.jpg" NDATA JPEG> <!ELEMENT dog EMPTY> <!ATTLIST dog picfile ENTITY #REQUIRED> ]> <dog picfile="colliepic"/>
"The XLink and XPointer linking specifications describe other ways to point to a non-XML file such as a graphic. These offer more sophisticated control over the external entity's position, handling, and appearance within the XML document."
(It would, however, be possible to include a text-encoded transformation of a binary file as a CDATA marked section, using something like UUencode with the markup characters ] and > removed from the map so that they could not occur and be misinterpreted.)

<<<TOC  >>>ENCSEN/CSZVONTranslations