6. Why HTML Isn't Enough
HTML has a lot going for it, but HTML also has several limitations that become apparent for applications that are larger or more functional than home pages and small websites. The following paragraphs explain these limitations in detail.
Limited structure Most of HTMLs limitations can be traced to its fixed set of tags, which primarily serve to specify formatting of documents delivered on the Web. In other words, HTML tags support only a fixed and trivially simple structure.
In this, HTML shares the limitations of other presentation-specific markup languages, such as RTF, which is designed for documents that are delivered in print. The reason SGML was invented was, in part, to separate information from formatting in order to provide a powerful and extensible way to mark up information.
HTMLs lack of structure creates significant barriers to using HTML for applications beyond simple browsing, such as reuse, interchange, and automation. Each of these is covered below.
Limited reuse Many organizations publish the same information in multiple forms; its very common to have both printed and Web forms of the same data. Information originally created in HTML can be reused for printing, and information originally created for printing can be reused for Web delivery.
However, to achieve reuse requires conversion thats usually followed by manual intervention to fix up the appearance (i.e., the formatting) of the resulting document. And that means that each time the source information changes, the conversion and fix-up process must be repeated. This is an expensive, time-consuming, and labor-intensive process, and one of the reasons for the adoption of SGML by organizations with lots of data to distribute.
Limited interchange Because the Internet is simple and ubiquitous, it provides an ideal medium for organizations that want to interchange data. However, HTML undermines interchange because its small, fixed set of tags primarily indicates only the appearance of an element of a document. HTML provides nothing to denote the data within a document, which cripples attempts to achieve reuse.
For example, a computer manufacturer may wish to capture semiconductor data from its suppliers and feed that data into its computer-aided design (CAD) systems. Its CAD systems require data such as the function, tolerances, and timing of each pin of an integrated circuit. HTML provides no way to tag such data unambiguously. In fact, even if the original source data contains the necessary tagging to eliminate uncertainty, which is likely to be the case if the source data is in SGML, the resulting down-translation to HTML strips all the intelligence away.
Limited automation Automation saves labor, reduces costs, speeds delivery, and improves quality. There are many opportunities for adding automation to the
use of the Web, particularly for intranets and extranets. Examples include almost any forms-based application, such as insurance enrollments, medical claims processing, and online banking.
However, HTML poses a significant barrier to achieving automation. All highly automated processes are built on a data format thats highly expressive and absolutely consistent. HTML lacks the necessary expressiveness, since its limited to a fixed set of presentation-oriented tags, and lacks as well the absolute consistency, since theres no way to impose a rigorous data structure on top of those tags.
Searching produces too many "hits" One of the most valuable capabilities of the Web is provided by search engines that allow a user to find everything on the Web related to an inquiry. As the volume of information available on the Web continues to skyrocket, however, the amount of data retrieved for a typical search has risen to unusable proportions. Searchers of information must choose between queries that are so narrow that relevant information may be omitted from the results, and queries so general that they produce far too many hits to be useful.
The reason that Web searches turn up too many hits is that we typically search all the content of every page. Although searches can be limited to titles, those searches are almost certain to exclude relevant hits.
One of the best ways to improve Web searching would be to provide content-specific elements. For example, the word "bonds" could be tagged as a name, or a chemical term, or a financial term. Then searches for content related to "bonds" could be limited to a specific domain of inquiry.
Moving target: HTML 2.0 to 3.2 to 4.0 to ?? Since HTML is an evolving standard, its capabilities are continually being extended through the introduction of new tags. For those who are maintaining large amounts of information in HTML, the release of new revisions of HTML usually requires reviewing and retagging the existing data. In fact, many webmasters are relieved that Microsoft and Netscape have increased the intervals between new versions of their browsers from six months to one year, because that means that they dont have to retag their websites as often.
To avoid the retagging problem entirely, many organizations create their source information in SGML and down-translate to HTML. The level of effort for changing an SGML-to-HTML translator may be as little as a few hours, while the effort to retag hundreds or thousands of pages can stretch into many weeks.