This specification describes an abstract data set containing the information available from an XML document.
This is a W3C Working Draft for review by members of the W3C and other interested parties in the general public. Because it is the first public release, it contains many queries and open issues, all of which are clearly indicated in the document.
While it is a Working Draft or a Proposed Recommendation it is subject to change. It may be updated, replaced or rendered obsolete by other W3C documents at any time. It is inappropriate to use W3C Working Drafts as reference material or to cite them as other than "work in progress."
This work is part of the W3C XML Activity.
Please review and send comments to
email@example.com, which is a
publicly-archived mailing list.
This document specifies an abstract data set called the XML information set (Infoset), a description of the information available in a well-formed XML document [XML].
An XML document's information set consists of two or more information items (the information set for any well-formed XML document will contain at least the document information item and one element information item). An information item is an abstract representation of some component of an XML document: each information item has a set of associated properties, some of which are required to be available through the information set, and some of which are optionally available.
The XML information set does not require or favor a specific interface or class of interfaces. This specification presents the information set as a tree for the sake of clarity and simplicity, but there is no requirement that the XML information set be made available through a tree structure; other types of interfaces, including (but not limited to) event-based and query-based interfaces are also capable of providing information conforming to the information set. As long as the information in the information set is made available to XML applications in one way or another, the requirements of this document are satisfied.
Note: In this document, the words "must", "should", and "may" assume the meanings specified in RFC 2119 [RFC2119], except that the words do not appear in upper case.
Note: To the best of the editors' knowledge and belief, the information set scheme described in this document satisfies the requirements of the XPointer-Information Set Liaison Statement [XPointer-Liason].
Note: To the best of the editors' knowledge and belief, the interface specified by the Document Object Model, level one core Recommendation [DOM] conforms to the XML Information Set as currently specified.
The XML information set can contain eleven different types of information items (in the following list, read "required" as "required if present in the original XML document"; see also Processor Limitations, below):
XML Definition: document (Section 2, Documents)
XML Syntax:  Document (Section 2.1, Well-Formed XML Documents)
There is always one document information item in the information set, and all other information items are related to the document information item, either directly or indirectly.
The document information item must have the following properties available in some form:
Query: Should comments and the document type declaration be required rather than optional?
The document information item may also have the following properties available in some form:
XML Definition: element (Section 3, Logical Structures)
XML Syntax:  Element (Section 3, Logical Structures)
There is one element information item for each element appearing in the XML document. Exactly one of the element information items correspond to the document element (the root of the element tree), and all other element information items are contained within the document element, either directly or indirectly.
An element information item must have the following properties available in some form:
Query: Should comments be required rather than optional?
Query: When Namespace processing is being performed, should the original prefix also be available?
Query: Should attribute starting with "xmlns" be included even when performing Namespace processing?
xmlnswill be excluded from the set; if Namespace processing is not being performed, attribute with names beginning with
xmlnsare included in the set. If there are no non-#IMPLIED attributes specified or defaulted for the element, this set will be empty.
An element information item may also have the following properties available in some form:
XML Definition: attribute (Section 3.1, Start-Tags, End-Tags, and Empty-Element Tags)
XML Syntax:  Attribute (Section 3.1, Start-Tags, End-Tags, and Empty-Element Tags)
There is one attribute information item for each attribute (specified or defaulted) for each element in the document instance; when Namespace processing is being performed, attributes with names beginning with "xmlns" will not have corresponding information items.
Query: Should xml:lang and xml:space also be excluded and modeled as character properties instead?
Attributes declared in the DTD with a default value of
#IMPLIED and not specified in the element's start tag are
not represented by attribute information items.
An attribute information item must have the following properties available in some form:
Query: When Namespace processing is being performed, should the original prefix also be available?
In addition, for each attribute information item, the following property may optionally be available in some form:
XML Definition: processing instruction (Section 2.6, Processing Instructions)
XML Syntax:  PI (Section 2.6, Processing Instructions)
There is one processing instruction information item for every processing instruction in the document. The XML declaration and text declarations for external parsed entities are not considered processing instructions.
A processing instruction information item must have the following properties available in some form:
A processing instruction information item may also have the following properties available in some form:
XML Definition: Section 4.4.3, Included If Validating
There is one reference to unknown entity information item for each reference to an entity not included by a non-validating XML processor, either because the processor has not read the declaration or because the processor does not include external parsed entities.
A validating XML processor will never generate reference to unknown entity information items for a valid XML document.
A reference to unknown entity information item must have the following information available in some form:
A reference to unknown entity information item may also have the following properties available in some form:
XML Definition: characters (Section 2.2, Characters)
XML Syntax:  Char (Section 2.2, Characters)
There is one character information item for each non-markup character that appears within the document element, either literally, as a character reference, or within a CDATA section. There is also one character information item for each character that appears in a normalized attribute value.
Note, however, that a CR (#xD) character that is followed by a LF
(#xA) character is not represented by any information item.
Furthermore, a CR character that is not followed by a LF
character is treated as a LF character. This rule does not apply to
CR characters created by character references such as
Each character is a logically-separate information item, but processing software is free to chunk characters into larger groups as necessary.
A character information item must have the following properties available in some form:
A character information item may also have the following properties available in some form:
Query: Should the inherited values of xml:lang and xml:space also be modeled as optional character properties?
XML Definition: comment (Section 2.5, Comments)
XML Syntax:  Comment (Section 2.5, Comments)
Query: Should comment information items be required?
The optional comment information item corresponds to a single XML comment in the original document.
Query: Should the contents of the comment be optional, so that only its position may be reported?
If a comment information item is included, the following properties must be available:
A comment information item may also have the following properties available in some form:
XML Definition: document type declaration (section 2.8, Prolog and Document Type Declaration)
XML Syntax:  doctypedecl (section 2.8, Prolog and Document Type Declaration)
If the XML document has a document type declaration, then the information set may optionally contain a single document type declaration information item.
A document type declaration information item may have the following properties available in some form:
XML Definition: entity (section 4, Physical Structures)
XML Syntax:  EntityDecl (section 4.2, Entity Declarations)
Entity information items are optional, except for information items representing unparsed external (NDATA) entities, which are required to appear in the information set.
There is at most one entity information item for each entity, internal or external, declared in the DTD: when the same entity is declared more than once, only the first declaration is used. There is also at most one entity information item for the document instance, and at most one for the DTD external subset (if there is one).
Query: Is it confusing to represent the external DTD subset with an entity information item? (The XML Recommendation treats the external subset essentially as an external parameter entity, except that it does not have an entity name.)
The entity information item, if included, must have the following information available in some form:
An entity information item may also have the following information available in some form:
Query: Should the information from the XML declaration or text declaration also be optionally available?
XML Definition: notation (section 4.7, Notation Declarations)
XML Syntax:  NotationDecl (section 4.7, Notation Declarations)
There is one notation information item for each notation declared in the DTD.
A notation information item must have the following properties available:
XML Definition: attribute declaration (section 3.3, Attribute-List Declarations)
XML Syntax:  AttDef (section 3.3, Attribute-List Declarations)
Attribute declaration information items are an optional part of the information set. There is at most one attribute declaration information item for each attribute declared in an ATTLIST declaration within the DTD: if an attribute is declared more than once for the same element, only the first declaration is used.
An attribute declaration information item, if present, must have the following properties available in some form:
If an attribute declaration information item is provided for an XML document, the following properties may be available in some form:
Query: Should any of this information be required?
Namespace processing [Namespaces] represents a virtual transformation of an XML document, where elements and attributes acquire new, two-part names based on declarations made with specially-named attributes. As a result, for any single XML document there are two possible instantiations of the XML Information Set: one without Namespace processing, and one with.
Query: Is it best for the Information Set to explicitly allow for a document without Namespace processing?
The XML Information Set provides a single model that is capable of describing a document either without or with Namespace processing, at user option: element and attribute names have both URI parts and local parts, and the URI parts will simply be null when Namespace processing is not in force.
Consider the following example:
<?xml version="1.0"?> <msg:message dc:date="19990421" xmlns:dc="http://purl.org/metadata/dublin_core#" xmlns:msg="http://www.message.net/" >Phone home!</msg:message>
Without Namespace processing, the Information Set for this document will contain the following items in some form (for simplicity's sake, some properties have been omitted):
With Namespace processing, the Information Set for the same XML document will contain the following items in some form:
http://www.message.net/" and the local part "
If an XML document contains no names that include colons and no attribute names that begin with the letters "xmlns", then the XML Information Set will be instantiated identically with or without Namespace processing.
An XML processor conforms to the XML Information Set if it provides all the required information items and all required associated information. For instance, attributes are required information items, and an XML processor that does not report the existence of attributes, as well as their names (and URI parts if Namespace processing is being performed) and values, does not conform to the XML Information Set.
Some information items are optional, and some required information items have optional information associated with them. If a processor is required to or chooses to report an information item, then it is required to supply at least what the XML Information Set defines for that item in order to conform. For instance, if a processor chooses to supply entity information items, which are optional, then it is required to supply names for the entities, since the XML Information Set specifies that entity information items are required to make knowledge available about entity names. However, since entity information items are optional, a processor which does not supply them at all also conforms to the XML Information Set.
XML Processors may optionally provide additional information not found in the XML Information Set; for instance, the XML Information Set excludes whitespace that occurs between attributes from the information set, but an XML Processor that provides this information conforms as long as it provides the information that is required by the XML Information Set.
The information set for an XML document can contain only information that a processor has actually read.
The XML 1.0 Recommendation [XML] explicitly allows non-validating XML processors to omit parsing the external DTD subset and external entities (both parsed general entities and parameter entities). As a result, it is possible that a non-validating processor will omit reading attribute and entity declarations or actual markup that will affect the quantity and quality of information included in the information set.
Wherever this specification designates information as required, it is important to note that the information is required only if the processor actually reads the part of the XML document in which the information appears. Validating processors must report all required information; non-validating processors may omit information that appears outside of the top-level document entity (either in the external DTD subset or in an external text entity) if they do not read the other entities.
Although the XML 1.0 Recommendation [XML] is primarily concerned with XML syntax, it also includes some specific reporting requirements for processors.
The reporting requirements include errors, which are outside the scope of this specification, and document information; all of the XML 1.0 requirements for document information reporting have been integrated into the XML information set specification (numbers in parentheses refer to sections of the Recommendation):
The following information is not represented in the current version of the XML Information Set:
Query: Should any of this information be included?
Furthermore, the XML Infoset does not provide any method of assigning a single series of numbers to all child nodes of an element or of the document that is guaranteed to be reliable regardless of the underlying XML processor. Although such a method would be desirable, it is considered unachievable, due to the difficulties produced by references to unknown entities and optional information items.
In other words, there is no reliable way to specify something like "the second child of this element" without restricting both the type of processor and the types of children being counted. For more information, see the section on processor limitations.