Warning:
This wiki has been archived and is now read-only.

Current Status

From MicroXML Community Group
Jump to: navigation, search

Syntax

This grammar adds back to the Initial Status the things that appear to have attracted consensus, specifically:

  • comments;
  • empty element syntax;
  • Unicode characters in names.

It also adds back processing instructions before the root element only, though there is no consensus on PIs yet, in order to match the grammar in the 2012-09-08 Editor's Draft. Note that the production numbers will remain stable until there is consensus.

# Documents
[1] document ::= (comment | pi | s)* element (comment | s)*

# Elements
[4] element ::= startTag content endTag
              | emptyElementTag
[5] content ::= (element | comment | pi | dataChar | charRef)*
[6] startTag ::= '<' name (s+ attribute)* s* '>'
[7] emptyElementTag ::= '<' name (s+ attribute)* s* '/>'
[8] endTag ::= '</' name s* '>'

# Attributes
[9] attribute ::= attributeName s* '=' s* attributeValue
[10] attributeValue ::= '"' ((attributeValueChar - '"') | charRef)* '"'
                      | "'" ((attributeValueChar - "'") | charRef)* "'"
[11] attributeValueChar ::= char - ('<'|'>'|'&')
[12] attributeName ::= name - 'xmlns'

# Data characters
[13] dataChar ::= char - ('<'|'&'|'>')

# Character references
[14] charRef ::= hexCharRef | namedCharRef
[16] hexCharRef ::= '&#x' [0-9a-fA-F]+ ';'
[17] namedCharRef ::= '&' charName ';'
[18] charName ::= 'amp' | 'lt' | 'gt' | 'quot' | 'apos'

# Comments
[19] comment ::= '<!--' ((char - '-') | ('-' (char - '-')))* '-->'

# Processing Instructions
[22] pi ::= '<?' target (s+ attribute)* s* '?>'
[23] target = name - (('X' | 'x') ('M' | 'm') ('L' | 'l'))

# Names
[24] name ::= nameStartChar nameChar*
[25] nameStartChar ::= [A-Z] | [a-z] | "_" | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | [#x370-#x37D]
                     | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF]
                     | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[26] nameChar ::= nameStartChar | [0-9] | "-" | "." | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]

# White space
[27] s ::= #x9 | #xA | #x20

# Characters
[28] char ::= s | ([#x21-#x10FFFF] - forbiddenChar)
[29] forbiddenChar ::= [#x7F-#x9F] | surrogateCodePoint
                     | [#xFDD0-#xFDEF] | [#xFFFE-#xFFFF] | [#x1FFFE-#x1FFFF]
                     | [#x2FFFE-#x2FFFF] | [#x3FFFE-#x3FFFF] | [#x4FFFE-#x4FFFF]
                     | [#x5FFFE-#x5FFFF] | [#x6FFFE-#x6FFFF] | [#x7FFFE-#x7FFFF]
                     | [#x8FFFE-#x8FFFF] | [#x9FFFE-#x9FFFF] | [#xAFFFE-#xAFFFF]
                     | [#xBFFFE-#xBFFFF] | [#xCFFFE-#xCFFFF] | [#xDFFFE-#xDFFFF]
                     | [#xEFFFE-#xEFFFF] | [#xFFFFE-#xFFFFF] | [#x10FFFE-#x10FFFF]
[30] surrogateCodePoint ::= [#xD800-#xDFFF]

Data Model

The data model for MicroXML is defined as a grammar over a particular kind of tree; these trees have one atomic type, a character (equivalent to a Unicode code-point), and two composite types, arrays and maps. In the following, [...] denotes arrays, and {...} denotes maps:

document ::= [element, pi*]
element ::= [name, attributes, content]
pi ::= [name, attributes]
attributes ::= { (name => attributeValue)* }
attributeValue ::= [ char* ]
content ::= [ (char | element)* ]
name ::= [ nameStartChar, nameChar* ]
char, nameStartChar, nameChar ::= <single character as in the grammar for the syntax>

Note that comments are not in the data model.

Parsing

These points appear to have consensus:

  • UTF-8 only
  • Newline normalization as in XML
  • No attribute value normalization: literal newlines in attribute values are preserved
  • No requirement for draconian error handling