10 Text

Contents

The following sections discuss issues surrounding the structuring of text. Elements that format text (alignment elements, font elements, style sheets, etc.) are discussed in later sections of the specification. Please consult the section on SGML for information concerning character syntax.

10.1 White space

The SGML specification distinguishes between record start characters (line feeds) and record end characters (carriage returns). On the Internet, some platforms use carriage return/line feed pairs for line breaks, some use just line feeds, and others just carriage returns. As a result, HTML user agents should consider single carriage returns, single line feeds, and carriage return/line feed pairs to be a single line break. Throughout this specification, the term "line break" will refer to a single line break produced by any combination of carriage returns and line feeds.

A line break occurring immediately following a start tag must be ignored, as must a line break occurring immediately before an end tag. This applies to all HTML elements without exceptions. In addition, for all elements except PRE, leading white space characters, such as spaces, horizontal tabs, form feeds and line breaks, following the start tag must be ignored, and any subsequent sequence of contiguous white space characters must be replaced by a single word space.

The following three examples must be rendered identically:

<P>Thomas is watching TV.</P>

<P>
Thomas is watching TV.
</P>

<P>
  Thomas is watching TV.
</P>

Since the notion of what word space is varies from script (written language) to script, user agents should collapse white space in script-sensitive ways. For example, in Latin scripts, a single word space is just a space (ASCII decimal 32), while in Thai it is a zero-width word separator. In Japanese and Chinese, a word space is ignored entirely.

These rules allow authors to use white space to lay out their markup as desired, clarifying the source HTML with white space that will not be rendered by a user agent.

For instance, the following source HTML:

<P>
  This example shows a paragraph and a list.
</P>

<UL>
  <LI>
    This is the <EM>first</EM> item
  </LI>

  <LI>
    This is the <EM>second</EM> item
  </LI>
</UL>

may be "rewritten" by omitting end tags and using less white space:

<P>This example shows a paragraph and a list.

<UL>
  <LI>This is the <EM>first</EM> item
  <LI>This is the <EM>second</EM> item
</UL>

but should be rendered identically by a user agent.

The PRE element is used for preformatted text, where white space is significant. The PRE element is described below.

Word space processing can and should be done even in the absence of language information specified by the lang attribute.

10.2 Structured text

10.2.1 Phrasal elements: `EM`, `STRONG`, `DFN`, `CODE`, `SAMP`, `KBD`, `VAR`, `CITE`, and `ACRONYM`

<!ENTITY % phrase "EM | STRONG | DFN | CODE |
                   SAMP | KBD | VAR | CITE | ACRONYM">
<!ELEMENT (%fontstyle;|%phrase;) - - (%inline;)*>
<!ATTLIST (%fontstyle;|%phrase;)
  %attrs;                          -- %coreattrs, %i18n, %events --
  >

Start tag: required, End tag: required

Attributes defined elsewhere

id, class (document-wide identifiers)
lang (language information), dir (text direction)
title (element titles)
style (inline style information )
onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown, onkeyup (intrinsic events )

Phrasal elements add structural information to text fragments. The usual meanings of phrasal elements are following:

EM:: Indicates emphasis.
STRONG:: Indicates stronger emphasis.
CITE:: Cites a reference or other source.
DFN:: Indicates that this is the defining instance of the enclosed term.
CODE:: Designates a fragment of computer code.
SAMP:: Designates sample output from programs, scripts, etc.
KBD:: Indicates text to be entered by the user.
VAR:: Indicates an instance of a variable or program argument.
ACRONYM:: Indicates an acronym (e.g., WWW, HTTP, URL, etc.).

EM and STRONG are useful in general to indicate emphasis. The other phrasal elements have particular significance in technical documents. These examples illustrate the rendering of some of the textual markup elements:

"More information can be found in <CITE>[ISO-0000]</CITE>."

"Please refer to the following reference number in future
correspondence: <STRONG>1-234-55</STRONG>"

The presentation of phrasal elements depends on the user agent. Generally, visual user agents present EM text in italics and STRONG text in bold font. Speech synthesizer agents may change the synthesis parameters, such as volume, pitch and rate accordingly.

The ACRONYM element allows authors to clearly indicate a sequence of characters that compose an acronym (e.g., "NATA", "WWW", "FNAC", "IRS", etc.). The ability to identify acronyms is useful to spell checkers, speech synthesizers, and other user agents and tools.

The content of the ACRONYM element specifies the acronym itself. The title attribute may be used to provide the text to which the acronym expands. Here are some sample acronym definitions:

<ACRONYM title="World Wide Web">WWW</ACRONYM>
<ACRONYM 
   lang="fr" 
   title="Soci&eacute;t&eacute; Nationale de Chemins de Fer">
   SNCF
</ACRONYM>

Note that some acronyms are pronounced letter-by-letter (such as "IRS" or "BBC"); others are pronounced as words (such as "NATO" or "UNESCO"; still others are spelled out by some people and pronounced as words by other people ("URL", "SQL"). Authors should use style sheets to specify how a specific acronym is to be pronounced.

10.2.2 Quotations: The `BLOCKQUOTE` and `Q` elements

<!ELEMENT BLOCKQUOTE - - (%block;)+ -- long quotation -->
<!ATTLIST BLOCKQUOTE
  %attrs;                          -- %coreattrs, %i18n, %events --
  cite        %URL;      #IMPLIED  -- URL for source document or msg --
  >
<!ELEMENT Q - - (%inline;)* -- short inline quotation -->
<!ATTLIST Q
  %attrs;                          -- %coreattrs, %i18n, %events --
  cite        %URL;      #IMPLIED  -- URL for source document or msg --
  >

Start tag: required, End tag: required

Attribute definitions

cite = url: The value of this attribute is a URL that designates a source document or message. This attribute is intended to give information about the source from which the quotation was borrowed.

Attributes defined elsewhere

id, class (document-wide identifiers)
lang (language information), dir (text direction)
title (element titles)
style (inline style information )
onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown, onkeyup (intrinsic events )

These two elements designate quoted text. BLOCKQUOTE is for long quotations and Q is intended for short quotations that don't require paragraph breaks.

This example formats an excerpt from "The Two Towers", by J.R.R. Tolkien, as a blockquote.

<BLOCKQUOTE cite="http://www.mycom.com/tolkien/twotowers.html">
They went in single file, running like hounds on a strong scent,
and an eager light was in their eyes. Nearly due west the broad
swath of the marching Orcs tramped its ugly slot; the sweet grass
of Rohan had been bruised and blackened as they passed.
</BLOCKQUOTE>

Visual user agents generally render BLOCKQUOTE as an indented block.

Quotation marks We recommend that style sheets implementations provide a way to insert quotation marks before and after a quotation delimited by Q or BLOCKQUOTE in a manner appropriate to the current language context (see the lang attribute) and the degree of nesting of quotations.

However, as some authors have used BLOCKQUOTE merely as a mechanism to indent text, in order to preserve the intention of the authors, user agents should not insert quotation marks in the default style.

Furthermore, if authors include quotation marks in a Q or BLOCKQUOTE element, user agents should not insert additional quotation marks.

The usage of BLOCKQUOTE to indent text is deprecated in favor of style sheets.

10.2.3 Subscripts and superscripts: the `SUB` and `SUP` elements

<!ELEMENT (SUB|SUP) - - (%inline;)* -- subscript, superscript -->
<!ATTLIST (SUB|SUP)
  %attrs;                          -- %coreattrs, %i18n, %events --
  >

Start tag: required, End tag: required

Attributes defined elsewhere

id, class (document-wide identifiers)
lang (language information), dir (text direction)
title (element titles)
style (inline style information )
onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown, onkeyup (intrinsic events )

Many scripts (e.g., French) require superscripts or subscripts for proper rendering. The SUB and SUP elements should be used to markup text in these cases.

Here, we use SUP to raise the "lle" in the French "M^lle Dupont":

      M<sup>lle</sup> Dupont

10.3 Lines and Paragraphs

Authors traditionally divide their thoughts and arguments into sequences of paragraphs. The organization of information into paragraphs is not affected by how the paragraphs are presented: paragraphs that are double-justified contain the same thoughts as those that are left-justified.

The HTML markup for defining a paragraph is straightforward: the P element defines a paragraph.

The visual presentation of paragraphs is not so simple. A number of issues, both stylistic and technical, must be addressed:

Treatment of white space
Line breaking and word wrapping
Justification
Hyphenation
Written language conventions and text directionality
Formatting of paragraphs with respect to surrounding content

We address these questions below. Paragraph alignment and floating objects are discussed later in this document.

10.3.1 Paragraphs: the `P` element

<!ELEMENT P - O (%inline;)* -- paragraph -->
<!ATTLIST P
  %attrs;                          -- %coreattrs, %i18n, %events --
  %align;                          -- align, text alignment --
  >

Start tag: required, End tag: optional

Attributes defined elsewhere

id, class (document-wide identifiers)
lang (language information), dir (text direction)
title (element titles)
style (inline style information )
align (alignment)
onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown, onkeyup (intrinsic events )

The P element represents a paragraph. It cannot contain block-level elements (including P itself). The end tag may be omitted, in which case it is implied by either the next block-level start tag or the end tag of the element that contains the P element, whichever comes first.

For example, the following two paragraphs:

<P>This is the first paragraph.</P>
<P>This is the second paragraph.</P>
...a block element...

may be rewritten without their end tags:

<P>This is the first paragraph.
<P>This is the second paragraph.
...a block element...

since both are implicitly ended by the block elements that follow them. Similarly, if a paragraph is enclosed by a block element, as in:

<DIV>
<P>This is the paragraph.
</DIV>

the end tag of the enclosing block element (here, DIV) implies the end tag of the P element.

We discourage authors from using empty P elements. User agents should ignore empty P elements.

10.3.2 Visual rendering of paragraphs

How paragraphs are rendered visually depends on the user agent. Paragraphs are usually rendered flush left with a ragged right margin. Other defaults are appropriate for right-to-left scripts.

HTML user agents have traditionally rendered paragraphs with white space before and after, e.g.,

  At the same time, there began to take form a system of numbering,
  the calendar, hieroglyphic writing, and a technically advanced
  art, all of which later influenced other peoples.

  Within the framework of this gradual evolution or cultural
  progress the Preclassic horizon has been divided into Lower,
  Middle and Upper periods, to which can be added a transitional
  or Protoclassic period with several features that would later
  distinguish the emerging civilizations of Mesoamerica.

This contrasts with the style used in novels which indents the first line of the paragraph and uses the regular line spacing between the line of the last paragraph and the first line of the next, e.g.,

     At the same time, there began to take form a system of
  numbering, the calendar, hieroglyphic writing, and a technically
  advanced art, all of which later influenced other peoples.
     Within the framework of this gradual evolution or cultural
  progress the Preclassic horizon has been divided into Lower,
  Middle and Upper periods, to which can be added a transitional
  or Protoclassic period with several features that would later
  distinguish the emerging civilizations of Mesoamerica.

Following the precedent set by the NCSA Mosaic browser in 1993, user agents generally don't justify both margins, in part because it's hard to do this effectively without sophisticated hyphenation routines. The advent of style sheets, and anti-aliased fonts with subpixel positioning promises to offer richer choices to HTML authors than previously possible.

Style sheets provide rich control over the size and style of a font, the margins, space before and after a paragraph, the first line indent, justification and many other details. The user agent's default style sheet renders P elements in a familiar form, as illustrated above. One could, in principle, override this to render paragraphs without the breaks that conventionally distinguish successive paragraphs. In general, since this may confuse readers, we discourage this practice.

By convention, visual HTML user agents wrap text lines to fit within the available margins. Wrapping algorithms depend on the script being formatted.

In Western scripts, for example, text should only be wrapped at white space. Early user agents incorrectly wrapped lines at the beginning (or end) of elements, which resulted in dangling punctuation. For example, consider this sentence:

   A statue of the <a href="cih78">Cihuateteus</a>, who are patron ...

Wrapping the line at the end of the anchor tag causes the comma to be stranded at the beginning of the next line:

  A statue of the Cihuateteus
  , who are patron ...

This is an error, since there was no white space at that point in the markup.

10.3.3 Controlling line breaks

It is possible to force or forbid a line break in HTML.

Forcing a line break: the `BR` element

<!ELEMENT BR - O EMPTY            -- forced line break -->
<!ATTLIST BR
  %coreattrs;                      -- id, class, style, title --
  clear (left|all|right|none) none -- control of text flow --
  >

Start tag: required, End tag: forbidden

Attributes defined elsewhere

The BR element forcibly breaks (ends) the current line of text.

For visual user agents, the clear attribute can be used to determine whether markup following the BR element flows around images and other objects floated to the left or right margin, or whether it starts after the bottom of such objects. Further details are given in the section on alignment and floating objects. Authors are advised to use style sheets to control text flow around floating images and other objects.

With respect to bidirectional formatting, the BR element should be treated by user agents in the same way as a [UNICODE] LINE SEPARATOR character.

Prohibiting a line break

Sometimes authors may want to prevent a line break from occurring between two words. The   entity ( ,  ) acts as a space where user agents should not cause a line break.

10.3.4 Hyphenation

In HTML, there are two types of hyphens: the plain hyphen and the soft hyphen. The plain hyphen should be interpreted by a user agent as just another character. The soft hyphen tells the user agent where a line break can occur.

Those browsers that interpret soft hyphens must observe the following semantics: If a line is broken at a soft hyphen, a hyphen character must be displayed at the end of the first line. If a line is not broken at a hyphen, the user agent must not display a hyphen character. For operations such as searching and sorting, the soft hyphen should always be ignored.

In HTML, the plain hyphen is represented by the "-" character (-, -). The soft hyphen is represented by the named character entity  (, )

10.3.5 Preformatted text: The `PRE` element

<!ENTITY % pre.exclusion "IMG|OBJECT|APPLET|BIG|SMALL|SUB|SUP|FONT|BASEFONT">

<!ELEMENT PRE - - (%inline;)* -(%pre.exclusion;) -- preformatted text -->
<!ATTLIST PRE
  %attrs;                          -- %coreattrs, %i18n, %events --
  width       NUMBER     #IMPLIED
  >

Start tag: required, End tag: required

Attribute definitions

width = integer: This attribute provides a hint to visual user agents about the desired width of the formatted block. The user agent can use this information to select an appropriate font size or to indent the content appropriately. The desired width is expressed in number of characters. This attribute is not widely supported currently.

Attributes defined elsewhere

id, class (document-wide identifiers)
lang (language information), dir (text direction)
title (element titles)
style (inline style information )
onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown, onkeyup (intrinsic events )

The PRE element tells visual user agents that the enclosed text is "preformatted". Visual user agents must treat preformatted text as follows:

They may leave white space intact.
They may render text with a fixed-pitch font.
They may disable automatic word wrap.
They must not disable bidirectional processing.

Note that the SGML standard requires that the parser remove a newline immediately following the start tag or immediately preceding the end tag of the PRE.

The DTD fragment above indicates which elements may not appear within a PRE declaration. This is the same as in HTML 3.2, and is intended to preserve constant line spacing and column alignment for text rendered in a fixed pitch font. Authors are discouraged from altering this behavior through style sheets.

The following example shows a preformatted verse from Shelly's poem To a Skylark:

<PRE>
       Higher still and higher
         From the earth thou springest
       Like a cloud of fire;
         The blue deep thou wingest,
And singing still dost soar, and soaring ever singest.
</PRE>

Here is the same verse as rendered by your user agent:

       Higher still and higher
         From the earth thou springest
       Like a cloud of fire;
         The blue deep thou wingest,
And singing still dost soar, and soaring ever singest.

The horizontal tab character
The horizontal tab character (encoded in [UNICODE], US ASCII, and [ISO88591] as decimal 9) is usually interpreted by visual user agents as the smallest non-zero number of spaces necessary to line characters up along tab stops that are every 8 characters. We strongly discourage using horizontal tabs in preformatted text since it is common practice, when editing, to set the tab-spacing to other values, leading to misaligned documents.

10.4 Marking document changes: The INS and DEL elements

<!-- INS/DEL are handled by inclusion on BODY -->
<!ELEMENT (INS|DEL) - - (%inline;)* -- inserted text, deleted text -->
<!ATTLIST (INS|DEL)
  %attrs;                          -- %coreattrs, %i18n, %events --
  cite        %URL;     #IMPLIED   -- info on reason for change --
  datetime    CDATA     #IMPLIED   -- when changed: ISO date format --
  >

Start tag: required, End tag: required

Attribute definitions

cite = url: The value of this attribute is a URL that designates a source document or message. This attribute is intended to point to information explaining why a document was changed.
datetime = cdata: The value of this attribute specifies the date and time when the change was made. This value must have a format as specified in [ISO8601] and limited by the profile defined in the section below on dates and times.

Attributes defined elsewhere

id, class (document-wide identifiers)
lang (language information), dir (text direction)
title (element titles)
style (inline style information )
onclick, ondblclick, onmousedown, onmouseup, onmouseover, onmousemove, onmouseout, onkeypress, onkeydown, onkeyup (intrinsic events )

INS and DEL are used to markup sections of the document that have been inserted or deleted with respect to a different version of a document (e.g., in draft legislation where lawmakers need to view the changes).

These two elements are unusual for HTML in that they are neither block-level nor inline elements. They may contain one or more words within a paragraph or contain one or more block-level elements such as paragraphs, lists and tables.

User agents may render inserted and deleted text in ways that make the change obvious. For instance, inserted text may appear in a special font, deleted text may not be shown at all or be shown as struck-through or with special markings, etc.

User agents that do not recognize the DEL element must render that element's content nonetheless.

10.4.1 Date and time format

[ISO8601] allows many options and variations in the representation of dates and times. This specification uses one of those formats in its definition of legal values of the datetime attribute.

The format is:

  YYYY-MM-DDThh:mm:ssTZD

where:

     YYYY = four-digit year
     MM   = two-digit month (01=January, etc.)
     DD   = two-digit day of month (01 through 31)
     hh   = two digits of hour (00 through 23) (am/pm NOT allowed)
     mm   = two digits of minute (00 through 59)
     ss   = two digits of second (00 through 59)
     TZD  = time zone designator

The time zone designator is one of:

Z: indicates UTC (Coordinated Universal Time).
+hh:mm: indicates that the time is a local time which is hh hours and mm minutes ahead of UTC.
-hh:mm: indicates that the time is a local time which is hh hours and mm minutes behind UTC.

Exactly the components shown here must be present, with exactly this punctuation. Note that the "T" appears literally in the string, to indicate the beginning of the time element, as specified in [ISO8601]

If a generating application does not know the time to the second, it may use the value "00" for the seconds (and minutes and hours if necessary).

Both of the following examples correspond to November 5, 1994, 8:15:30 am, US Eastern Standard Time.

     1994-11-05T13:15:30Z
     1994-11-05T08:15:30-05:00

Used with INS, this gives:

<INS datetime="1994-11-05T08:15:30-05:00"
        cite="http://www.foo.org/mydoc/comments.html">
Furthermore, the latest figures from the marketing department
suggest that such practice is on the rise.
</INS>

The document "http://www.foo.org/mydoc/comments.html" would contain comments about why information was inserted into the document.

Authors may also make comments about inserted or deleted text by means of the title attribute for the INS and DEL elements. User agents may present this information to the user (e.g., as a popup note). For example:

<INS datetime="1994-11-05T08:15:30-05:00"
        title="Changed as a result of Steve B's comments in meeting.">
Furthermore, the latest figures from the marketing department
suggest that such practice is on the rise.
</INS>

10 Text

10.1 White space

10.2 Structured text

10.2.1 Phrasal elements: EM, STRONG, DFN, CODE, SAMP, KBD, VAR, CITE, and ACRONYM

10.2.2 Quotations: The BLOCKQUOTE and Q elements

10.2.3 Subscripts and superscripts: the SUB and SUP elements

10.3 Lines and Paragraphs

10.3.1 Paragraphs: the P element

10.3.2 Visual rendering of paragraphs

10.3.3 Controlling line breaks

Forcing a line break: the BR element

Prohibiting a line break

10.3.4 Hyphenation

10.3.5 Preformatted text: The PRE element

10.4 Marking document changes: The INS and DEL elements

10.4.1 Date and time format

10.2.1 Phrasal elements: `EM`, `STRONG`, `DFN`, `CODE`, `SAMP`, `KBD`, `VAR`, `CITE`, and `ACRONYM`

10.2.2 Quotations: The `BLOCKQUOTE` and `Q` elements

10.2.3 Subscripts and superscripts: the `SUB` and `SUP` elements

10.3.1 Paragraphs: the `P` element

Forcing a line break: the `BR` element

10.3.5 Preformatted text: The `PRE` element