2. Research
HTML: The Good Stuff
HTML is probably the most portable markup language in the world. It is supported by over 100 million Web browsers, and is becoming the de facto standard for transmitting information between people. HTML has many advantages:
- HTML browsers are cheap or free, and very powerful; with a combination of third-party add-ins and server-side content support, a vast range of information is being delivered to HTML browsers.
- HTML document browser interfaces are easy to build into existing products because of the simplicity of HTML.
- HTML is easy to learn because it is very simple. There are only a couple dozen tags, but less than half of them are used in most situations.
- In working with HTML for just a few years, it has become very evident to users that the hypertext model really does work across systems that are otherwise unrelated. Any page can link to any other publicly accessible page simply by entering the address.
- There are some specialized structures in HTML, but they are mostly used to effect a certain formatting look.
Because of the simplicity and low cost of HTML, a huge information base has been formed that makes HTML even more valuable.
HTML: The Bad Stuff
HTML's simplicity, while making it valuable as a basic way of delivering simply structured information, causes it to fall short of being a long-term method of delivering complex information types. HTML is a very weak presentation tool that lacks even the most fundamental page-oriented formatting capabilities, like hanging indents, white-space control, justification, kerning, and hyphenation. HTML does not handle multiple-column snaking very well, either.
However, because of the nearly universal compatibility, web site designers are getting around these problems by using tables to fake multiple columns and indents, GIF graphics to create certain designs with type and white space, and other such machinations. In such cases, HTML itself has simply become a shell that contains the real markup.
HTML is also a weak markup tool, because it does not allow for creating custom tags or presenting tags with different styles. There is no real modularity or hierarchical relationships between elements. This limits HTML to delivering page-oriented information instead of being a method to deliver intelligent information.
HTML provides linking capabilities, but the linking is rudimentary; it is only a one-to-one link, and requires an anchor on the target end in order to access anything within the document. This is fine for most purposes, but such simple linking capabilities will limit HTML's long-term viability.
Another major problem with HTML is its instability. First, there was HTML, then HTML+, then HTML 2, then a series of decimal-point specifications in the threes, and now a level-4 HTML. Plus, browser manufacturers have created extensions to the "standard" HTML, like the "blink" and "center" tags. This has caused other browser manufacturers to play "catch-up".
The combination of this instability and HTML's simplicity has caused a situation where there are numerous codes that break when presented in a browser. The presence of "Best Viewed with Microsoft Internet Explorer", or "Optimized for Netscape" banners attests to the fact that a page has been crafted to work best with a particular browser, at the expense of all of the others. This balkanization of HTML has made webmasters and users frustrated and looking for a better solution.
One way the HTML keepers are trying to extend HTML is by providing the ability to create more customized styles, while keeping the same markup. This is being done by cascading style sheets (CSS), which is a technical recommendation by the W3C (World Wide Web Consortium). CSS separates structure (HTML markup) from format (how it looks). While this is a good step, we are still stuck with the basic HTML tag set.
SGML: The Good Stuff
SGML is an international standard that is more than ten years old. It was originally designed to provide a way of describing text-based information so organizations could interchange information easily. Since then, SGML has become valuable in describing information sets so companies can get beyond the restrictions of paper-only publishing. SGML provides a way of creating markup languages customized for each document type, and separating the content from eventual formatting.
SGML is not tied to a particular operating system or application, and so it is portable from platform to platform. It is a standard maintained by the International Standards Organization (ISO). This means that it is very stable and while there are provisions for updating and changing ISO standards, the organization makes it very difficult to do so. This has turned out to be a real advantage, since it provides companies with a known syntax.
Because of its standing as a stable standard, there are many products available in every category, from editors to document management solutions, to typesetting and web delivery applications. Many vendors are providing tools and support in each category. These products range in price from free to very expensive. Since SGML is platform independent, these tools can be mixed and matched as desired, relieving companies of the risk of having their information locked into one vendor's product.
SGML does not provide a fixed set of tags, but, rather, a syntax for creating your own tags. Many industries have formed consortia for the purpose of creating common tag sets to interchange information using their terms and expressions.
SGML: The Bad Stuff
SGML is complicated to understand and difficult to integrate into an application. SGML requires a "parser", which is difficult to write and maintain. Since SGML was created in the early days of desktop computers, it is overly concerned with maximizing limited memory and disk space by providing a complex set of "minimization" rules and exceptions.
This complexity results in SGML being more expensive than a simple tag set like HTML. Each document must have a "document type definition" created, which requires the owner of the document to perform a "document analysis" to discover its structure. There is still a relatively small market for SGML, so the pool of people who are experienced in document analysis and document type definition design are hard to find and expensive to keep.
Also, because of the complexity of the standard, and the smaller market for vendors, tools that support SGML are more expensive than those that support HTML.
PDF: The Good Stuff
PDF, the portable document format, was designed by Adobe Systems Inc., in order to provide a system-independent way of delivering page-based information. PDF files are created by printing to a PDF driver or by "distilling" a PostScript file. The resulting PDF file can be read using a tool from Adobe called "Acrobat Reader". This tool is available for free on most popular systems in use today.
PDF provides electronic pages with impressive page fidelity. Type, graphics, and color are all reproduced as they are on paper. Even hot links and other electronic object types, like movies and sounds, can be added to a PDF file. PDF files are cheap to create, and are used by many companies to deliver page-formatted information without the high cost of postage.
Since the end user gets something that looks very much like paper, training costs are low.
PDF: The Bad Stuff
However, PDF creates large files with little structural information. PDF files are not nearly as flexible as other electronic formats because the main goal is to recreate a paper page, and not to provide a way of delivering intelligent document structure to a user.
There is little support for searching, although Adobe has products that can index many different PDF files for cross-document searching and navigating. That is, "turning" pages, "flipping" from section to section, and "scanning" the page for text of interest. Other than that, navigation is limited.
Another problem is page fidelity. PDF pages are not necessarily pixel-by-pixel replicas of a page that might be printed by the owner of the document. This is partly because the fonts used to create the document originally might not be on the machine that eventually views the document.
We Need Something New...
Wouldn't it be nice to have a solution that has the low cost and simplicity of HTML, the power and flexibility of SGML, and the pleasing formatting capabilities of PDF? The proponents of XML, the Extensible Markup Language, say their standard fits the bill.
The goal was to provide a way of creating, processing, and presenting documents cheaply, quickly, and easily. The new standard needed to create a customized set of tags, be compatible with HTML, and provide the power of SGML, without the unneeded complexity.
XML: Extensible Markup Language
When the designers of XML started, they had ten design goals:
- XML shall be straightforwardly usable over the Internet.
- XML shall support a wide variety of applications.
- XML shall be compatible with SGML.
- It shall be easy to write programs which process XML documents.
- The number of optional features in XML is to be kept to the absolute minimum, ideally zero.
- XML documents should be human-legible and reasonably clear.
- The XML design should be prepared quickly.
- The design of XML shall be formal and concise.
- XML documents shall be easy to create.
- Terseness in XML markup is of minimal importance.
These goals led the development of XML, and helped the design committee stay focused on what was important. The result is XML, a technical recommendation of the W3C. XML is portable between systems because it uses SGML as its core technology. XML is easy to learn; the paper that describes the syntax of XML is only about 30 pages long, and can be understood by any web page designer.