Contents
The following sections discuss issues surrounding the structuring of text. Elements that present text (alignment elements, font elements, style sheets, etc.) are discussed elsewhere in the specification. For information about characters, please consult the section on the document character set.
The document character set includes a wide variety of white space characters. Many of these are typographic elements used in some applications to produce particular visual spacing effects. HTML considers only the following characters to be white space characters:
Line breaks are also considered to be white space characters. Note that although 
 and 
 are defined in [ISO10646] to unambiguously separate lines and paragraphs, respectively, these do not constitute line breaks in HTML, nor does this specification include them in the more general category of white space characters.
This specification does not indicate the behavior, rendering or otherwise, of space characters other than those explicitly identified here as white space characters.
For all HTML elements except PRE, any sequence of white space characters immediately following a start tag should be ignored by user agents, and any subsequent sequence of contiguous white space characters should be interpreted as inter-word space. Thus, the following two examples must be rendered identically:
<!-- No leading spaces --> <P> Thomas is watching TV. </P>
<!-- Two leading spaces --> <P> Thomas is watching TV. </P>
Since the (abstract) notion of inter-word space varies from script (written language) to script, user agents should collapse sequences of white space characters in script-sensitive ways. For example, in Latin scripts, inter-word space is typically rendered as an ASCII space ( ), while in Thai it is a zero-width word separator (​). In Japanese and Chinese, no inter-word space is typically rendered at all.
The PRE element is used for preformatted text, where white space is significant. The PRE element is described below.
Word space processing can and should be done even in the absence of language information specified by the lang attribute.
<!ENTITY % phrase "EM | STRONG | DFN | CODE | SAMP | KBD | VAR | CITE | ABBR"> <!ELEMENT (%fontstyle;|%phrase;) - - (%inline;)*> <!ATTLIST (%fontstyle;|%phrase;) %attrs; -- %coreattrs, %i18n, %events -- >
Start tag: required, End tag: required
Attributes defined elsewhere
Phrase elements add structural information to text fragments. The usual meanings of phrase elements are following:
EM and STRONG are used to indicate emphasis. The other phrase elements have particular significance in technical documents. These examples illustrate the rendering of some of the textual markup elements:
As <CITE>Harry S. Truman</CITE> said, <Q lang="en-US">The buck stops here.</Q> More information can be found in <CITE>[ISO-0000]</CITE>. Please refer to the following reference number in future correspondence: <STRONG>1-234-55</STRONG>
The presentation of phrase elements depends on the user agent. Generally, visual user agents present EM text in italics and STRONG text in bold font. Speech synthesizer user agents may change the synthesis parameters, such as volume, pitch and rate accordingly.
The ABBR element allows authors to clearly indicate abbreviated expressions of various kinds. Western languages make extensive use of acronyms or "initialisms" such as "GmbH", "NATO", and "F.B.I.", as well as abbreviations like "M.", "Inc.", "et al.", "etc.". Both Chinese and Japanese use analogous abbreviation mechanisms, wherein a long name is referred to subsequently with a subset of the Han characters from the original occurrence. All of these expressions can be tagged with ABBR, providing useful information to user agents and tools such as spell checkers, speech synthesizers, translation systems and search-engine indexers. The content of the ABBR element specifies the abbreviated expression itself, as it would normally appear in running text. The title attribute on ABBR may be used to provide the full or expanded form of the expression.
Here are some sample uses of ABBR:
<ABBR title="World Wide Web">WWW</ABBR> <ABBR lang="fr" title="Société Nationale de Chemins de Fer"> SNCF </ABBR> <ABBR lang="es" title="Doña"> Dña </ABBR> <ABBR title="Abbreviation">abbr.</ABBR>
Note that abbreviated forms often have idiosyncratic pronunciations. For example, while "IRS" and "BBC" are typically pronounced letter by letter, "NATO" and "UNESCO" are pronounced phonetically. Still other abbreviated forms (e.g., "URL" and "SQL") are spelled out by some people and pronounced as words by other people. If necessary, authors should use style sheets to specify how a specific abbreviated form is to be pronounced.
Note. In earlier versions of HTML and earlier drafts of HTML 4.0, this element was called ACRONYM.
<!ELEMENT BLOCKQUOTE - - (%flow;)* -- long quotation --> <!ATTLIST BLOCKQUOTE %attrs; -- %coreattrs, %i18n, %events -- cite %URL; #IMPLIED -- URL for source document or msg -- > <!ELEMENT Q - - (%inline;)* -- short inline quotation --> <!ATTLIST Q %attrs; -- %coreattrs, %i18n, %events -- cite %URL; #IMPLIED -- URL for source document or msg -- >
Start tag: required, End tag: required
Attribute definitions
Attributes defined elsewhere
These two elements designate quoted text. BLOCKQUOTE is for long quotations (block-level content) and Q is intended for short quotations (inline content) that don't require paragraph breaks.
This example formats an excerpt from "The Two Towers", by J.R.R. Tolkien, as a blockquote.
<BLOCKQUOTE cite="http://www.mycom.com/tolkien/twotowers.html"> <P>They went in single file, running like hounds on a strong scent, and an eager light was in their eyes. Nearly due west the broad swath of the marching Orcs tramped its ugly slot; the sweet grass of Rohan had been bruised and blackened as they passed.</P> </BLOCKQUOTE>
Visual user agents generally render BLOCKQUOTE as an indented block.
Visual user agents must add delimiting quotation marks when rendering Q; users must not put delimiting quotation marks inside a Q element. Furthermore, user agents should add quotation marks in a language-sensitive manner (see the lang attribute). Many languages use different quotation styles for outer and inner quotations, which should be respected by user-agents implementing this element.
Quotation marks We recommend that style sheet implementations provide a mechanism for inserting quotation marks before and after a quotation delimited by BLOCKQUOTE in a manner appropriate to the current language context and the degree of nesting of quotations.
However, as some authors have used BLOCKQUOTE merely as a mechanism to indent text, in order to preserve the intention of the authors, user agents should not insert quotation marks in the default style.
The usage of BLOCKQUOTE to indent text is deprecated in favor of style sheets.
<!ELEMENT (SUB|SUP) - - (%inline;)* -- subscript, superscript --> <!ATTLIST (SUB|SUP) %attrs; -- %coreattrs, %i18n, %events -- >
Start tag: required, End tag: required
Attributes defined elsewhere
Many scripts (e.g., French) require superscripts or subscripts for proper rendering. The SUB and SUP elements should be used to markup text in these cases.
H<sub>2</sub>O E = mc<sup>2</sup> <SPAN lang="fr">M<sup>lle</sup> Dupont</SPAN>
Authors traditionally divide their thoughts and arguments into sequences of paragraphs. The organization of information into paragraphs is not affected by how the paragraphs are presented: paragraphs that are double-justified contain the same thoughts as those that are left-justified.
The HTML markup for defining a paragraph is straightforward: the P element defines a paragraph.
The visual presentation of paragraphs is not so simple. A number of issues, both stylistic and technical, must be addressed:
We address these questions below. Paragraph alignment and floating objects are discussed later in this document.
<!ELEMENT P - O (%inline;)* -- paragraph --> <!ATTLIST P %attrs; -- %coreattrs, %i18n, %events -- %align; -- align, text alignment -- >
Start tag: required, End tag: optional
Attributes defined elsewhere
The P element represents a paragraph. It cannot contain block-level elements (including P itself). The end tag may be omitted, in which case it is implied by either the next block-level start tag or the end tag of the element that contains the P element, whichever comes first.
<P>This is the first paragraph.</P> <P>This is the second paragraph.</P> ...a block element...
may be rewritten without their end tags:
<P>This is the first paragraph. <P>This is the second paragraph. ...a block element...
since both are implicitly ended by the block elements that follow them. Similarly, if a paragraph is enclosed by a block element, as in:
<DIV> <P>This is the paragraph. </DIV>
the end tag of the enclosing block element (here, DIV) implies the end tag of the P element.
We discourage authors from using empty P elements. User agents should ignore empty P elements.
The SGML specification distinguishes record start characters and record end characters, which in HTML, are defined to be "line feed" (�OA;) and "carriage return" (
), respectively.
On the Internet, some platforms use carriage return/line feed pairs to represent line breaks, some use just line feeds, and others just carriage returns. HTML user agents should interpret each carriage return, line feed, and carriage return/line feed pair as a single line break. Elsewhere in this specification, the term line break refers to any of the three.
All line breaks constitute white space.
For more information about SGML's specification of line breaks, please consult the notes on line brakes in the appendix.
<!ELEMENT BR - O EMPTY -- forced line break --> <!ATTLIST BR %coreattrs; -- id, class, style, title -- clear (left|all|right|none) none -- control of text flow -- >
Start tag: required, End tag: forbidden
Attributes defined elsewhere
The BR element forcibly breaks (ends) the current line of text.
For visual user agents, the clear attribute can be used to determine whether markup following the BR element flows around images and other objects floated to the left or right margin, or whether it starts after the bottom of such objects. Further details are given in the section on alignment and floating objects. Authors are advised to use style sheets to control text flow around floating images and other objects.
With respect to bidirectional formatting, the BR element should be treated by user agents in the same way as a [ISO10646] LINE SEPARATOR character.
Sometimes authors may want to prevent a line break from occurring between two words. The entity (  or  ) acts as a space where user agents should not cause a line break.
In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.
Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.
In HTML, the plain hyphen is represented by the "-" character (- or -). The soft hyphen is represented by the character entity reference ­ (­ or ­)
<!ENTITY % pre.exclusion "IMG|OBJECT|APPLET|BIG|SMALL|SUB|SUP|FONT|BASEFONT"> <!ELEMENT PRE - - (%inline;)* -(%pre.exclusion;) -- preformatted text --> <!ATTLIST PRE %attrs; -- %coreattrs, %i18n, %events -- width NUMBER #IMPLIED >
Start tag: required, End tag: required
Attribute definitions
Attributes defined elsewhere
The PRE element tells visual user agents that the enclosed text is "preformatted". Visual user agents must treat preformatted text as follows:
Non-visual user agents may ignore the spacing and line breaks in this element's content.
Note that the SGML standard requires that the parser remove a newline immediately following the start tag or immediately preceding the end tag of the PRE.
The DTD fragment above indicates which elements may not appear within a PRE declaration. This is the same as in HTML 3.2, and is intended to preserve constant line spacing and column alignment for text rendered in a fixed pitch font. Authors are discouraged from altering this behavior through style sheets.
The following example shows a preformatted verse from Shelly's poem To a Skylark:
<PRE> Higher still and higher From the earth thou springest Like a cloud of fire; The blue deep thou wingest, And singing still dost soar, and soaring ever singest. </PRE>
Here is the same verse as rendered by your user agent:
Higher still and higher From the earth thou springest Like a cloud of fire; The blue deep thou wingest, And singing still dost soar, and soaring ever singest.
The horizontal tab character
The
horizontal tab character (decimal 9 in [ISO10646] and [ISO88591]
) is usually interpreted by visual user agents as the smallest
non-zero number of spaces necessary to line characters up along tab
stops that are every 8 characters. We strongly discourage using
horizontal tabs in preformatted text since it is common practice, when
editing, to set the tab-spacing to other values, leading to misaligned
documents.
Note.The following section is an informative description of the behavior of some current visual user agents when formatting paragraphs. Style sheets allow better control of paragraph formatting.
How paragraphs are rendered visually depends on the user agent. Paragraphs are usually rendered flush left with a ragged right margin. Other defaults are appropriate for right-to-left scripts.
HTML user agents have traditionally rendered paragraphs with white space before and after, e.g.,
At the same time, there began to take form a system of numbering, the calendar, hieroglyphic writing, and a technically advanced art, all of which later influenced other peoples. Within the framework of this gradual evolution or cultural progress the Preclassic horizon has been divided into Lower, Middle and Upper periods, to which can be added a transitional or Protoclassic period with several features that would later distinguish the emerging civilizations of Mesoamerica.
This contrasts with the style used in novels which indents the first line of the paragraph and uses the regular line spacing between the final line of the current paragraph and the first line of the next, e.g.,
At the same time, there began to take form a system of numbering, the calendar, hieroglyphic writing, and a technically advanced art, all of which later influenced other peoples. Within the framework of this gradual evolution or cultural progress the Preclassic horizon has been divided into Lower, Middle and Upper periods, to which can be added a transitional or Protoclassic period with several features that would later distinguish the emerging civilizations of Mesoamerica.
Following the precedent set by the NCSA Mosaic browser in 1993, user agents generally don't justify both margins, in part because it's hard to do this effectively without sophisticated hyphenation routines. The advent of style sheets, and anti-aliased fonts with subpixel positioning promises to offer richer choices to HTML authors than previously possible.
Style sheets provide rich control over the size and style of a font, the margins, space before and after a paragraph, the first line indent, justification and many other details. The user agent's default style sheet renders P elements in a familiar form, as illustrated above. One could, in principle, override this to render paragraphs without the breaks that conventionally distinguish successive paragraphs. In general, since this may confuse readers, we discourage this practice.
By convention, visual HTML user agents wrap text lines to fit within the available margins. Wrapping algorithms depend on the script being formatted.
In Western scripts, for example, text should only be wrapped at white space. Early user agents incorrectly wrapped lines at the beginning (or end) of elements, which resulted in dangling punctuation. For example, consider this sentence:
A statue of the <a href="cih78">Cihuateteus</a>, who are patron ...
Wrapping the line at the end of the anchor tag causes the comma to be stranded at the beginning of the next line:
A statue of the Cihuateteus , who are patron ...
This is an error, since there was no white space at that point in the markup.
<!-- INS/DEL are handled by inclusion on BODY --> <!ELEMENT (INS|DEL) - - (%flow;)* -- inserted text, deleted text --> <!ATTLIST (INS|DEL) %attrs; -- %coreattrs, %i18n, %events -- cite %URL; #IMPLIED -- info on reason for change -- datetime %Datetime; #IMPLIED -- date and time of change -- >
Start tag: required, End tag: required
Attribute definitions
Attributes defined elsewhere
INS and DEL are used to markup sections of the document that have been inserted or deleted with respect to a different version of a document (e.g., in draft legislation where lawmakers need to view the changes).
These two elements are unusual for HTML in that they may serve as either block-level or inline elements (but not both). They may contain one or more words within a paragraph or contain one or more block-level elements such as paragraphs, lists and tables.
This example could be from a bill to change the legislation for how many deputies a County Sheriff can employ from 3 to 5.
<P> A Sheriff can employ <DEL>3</DEL><INS>5</INS> deputies. </P>
The INS and DEL elements must not contain block-level content when these elements behave as inline elements.
ILLEGAL EXAMPLE:
The following is not considered legal HTML.
<P> <INS><DIV>...block-level content...</DIV></INS> </P>
User agents should render inserted and deleted text in ways that make the change obvious. For instance, inserted text may appear in a special font, deleted text may not be shown at all or be shown as struck-through or with special markings, etc.
Both of the following examples correspond to November 5, 1994, 8:15:30 am, US Eastern Standard Time.
1994-11-05T13:15:30Z 1994-11-05T08:15:30-05:00
Used with INS, this gives:
<INS datetime="1994-11-05T08:15:30-05:00" cite="http://www.foo.org/mydoc/comments.html"> Furthermore, the latest figures from the marketing department suggest that such practice is on the rise. </INS>
The document "http://www.foo.org/mydoc/comments.html" would contain comments about why information was inserted into the document.
Authors may also make comments about inserted or deleted text by means of the title attribute for the INS and DEL elements. User agents may present this information to the user (e.g., as a popup note). For example:
<INS datetime="1994-11-05T08:15:30-05:00" title="Changed as a result of Steve B's comments in meeting."> Furthermore, the latest figures from the marketing department suggest that such practice is on the rise. </INS>