java.lang.Object | +----XMLTokenizer
The XMLTokens interface gives the list of tokens that are recognized. Compared to the terminal symbols in the XML draft, they are generally somewhat larger: on average a single token represents more characters than a token in the XML draft. The reason is that it divides the work between the tokenizer and the parser better, and it allows the parser to work without lookahead, running strictly as an LL(1) parser.
Another difference is inthe way white space is handled. White space is never returned as a token in this parser. It is either ignored or passed as PCDATA, depending on the context. There are three contexts: content mode, markup mode and marked-section mode and each has its own rules. The rules are simple enough that the tokenizer can deal with them, which in turn means that the parser has an even easier job.
The rules for ignoring white space are not the same as in the XML draft, though. In markup, white space only serves to seperate other tokens, so ignoring it there is not an important change. In marked-sections white space is never ignored, which is the same in the XML draft. But in content the rule is different.
In content, there is a simple rule applied by the tokenizer: a newline that immediately precedes a `<' is ignored, as is a newline that immediately follows a `>'. No other white space is ever ignored. A `newline' in this context is either a line feed (U+000A), carriage return (U+000D) or a carriage return followed by a line feed. For example, the following XML code (_ represents a normal space):
is exactly equivalent to:
before_ <dada> some_text_here </dada> _after
That means that to get a real newline just before or after some markup, you will need to insert two newlines. That seems like a small price to pay. Otherwise we would need different rules for the whitespace outside the root element, since it is hard to create files that don't end with a newline.
Ignoring more than one newline, or ignoring other whitespace, would make it impossible to end an element with white space.
Hexadecimal numeric entities don't have to have exactly four digits in this implementation.
Comments may not contain occurrences of `--' (two dashes) according to the XML draft. This tokenizer doesn't check for that; it only ends a comment at `-->'. (It would be easy enough to add, though.)
Processing instructions, other than the predefined XML ones, are not tokenized, but returned as a single string. It is left to the application to look for the `target' (the first word in the string).
public XMLTokenizer(InputStream aStream) throws UnknownEncoding
public XMLTokenizer(InputStream aStream, String encoding) throws UnknownEncoding
public void setEncoding(String encoding) throws UnknownEncoding
public int getLineno()
public int getColno()
public int next() throws IOException
public String value()