java.lang.Object | +----XMLTokenizer | +----XMLParser
The grammar used in this parser is different from the one in the first XML draft. This is the grammar (lowercase = nonterminal, uppercase = terminal):
document: prolog element misc* prolog: encodingdecl? misc* [doctypedecl misc*]? [dtdsummary misc*]? misc: COMMENT | PI doctypedecl: DOCTYPE NAME extid? GT attribute: NAME EQ LITERAL etag: ETAGO NAME GT content: [element | PCDATA | ms | PI | COMMENT]* element: LT NAME attribute* [GT content etag | EMPTY] dtdsummary: [idinfo | defaultinfo]+ encodingdecl: ENCODING EQ qencoding ENDPI extid: LITERAL ms: MSSTART MSDATA MSEND qencoding: LITERAL idinfo: IDINFO NAME EQ quotedpairs NAME EQ quotedpairs ENDPI quotedpairs: LITERAL defaultinfo: DEFAULT NAME [NAME EQ LITERAL]* ENDPI
Some of the differences are:
White space is handled at the lexical level (see XMLtokenizer)
No internal DTD subset
No `required markup declaration;' parsers should never need the DTD in parsing, only in generating.
The DTD summary is reduced to ID info and default info (ID info is not used by this parser, but it would be easy to add).
Entities other than character entities are not permitted. Character entities are handled invisibly by the tokenizer and are not reported to the parser.
Character entities are permitted in element content.
The parser doesn't validate.
Also, error messages are not the most helpful. This is a hand-generated parser, so it was easier to only use insert() and not delete(). Some tool should be used to generate the director sets for resynchronizing after a syntax error.
public XMLParser(InputStream aStream, XMLParserUser aUser, int nrerrors[], XMLNode tree[]) throws IOException, UnknownEncoding
public XMLParser(InputStream aStream, XMLParserUser aUser, String encoding, int nrerrors[], XMLNode tree[]) throws IOException, UnknownEncoding