Class XMLTokenizer

java.lang.Object
   |
   +----XMLTokenizer

public class XMLTokenizer
extends Object
implements XMLTokens

The XMLTokenizer class is created with a stream of bytes as argument. Every call to next causes it to read a few more bytes and return the token they represent. The encoding of the byte stream is assumed to be UTF8 by default, but other encodings can be used as well.

Differences with XML draft

The XMLTokens interface gives the list of tokens that are recognized. Compared to the terminal symbols in the XML draft, they are generally somewhat larger: on average a single token represents more characters than a token in the XML draft. The reason is that it divides the work between the tokenizer and the parser better, and it allows the parser to work without lookahead, running strictly as an LL(1) parser.

Another difference is inthe way white space is handled. White space is never returned as a token in this parser. It is either ignored or passed as PCDATA, depending on the context. There are three contexts: content mode, markup mode and marked-section mode and each has its own rules. The rules are simple enough that the tokenizer can deal with them, which in turn means that the parser has an even easier job.

The rules for ignoring white space are not the same as in the XML draft, though. In markup, white space only serves to seperate other tokens, so ignoring it there is not an important change. In marked-sections white space is never ignored, which is the same in the XML draft. But in content the rule is different.

In content, there is a simple rule applied by the tokenizer: a newline that immediately precedes a `<' is ignored, as is a newline that immediately follows a `>'. No other white space is ever ignored. A `newline' in this context is either a line feed (U+000A), carriage return (U+000D) or a carriage return followed by a line feed. For example, the following XML code (_ represents a normal space):

 before_<dada>some_text_ here</dada>_after

is exactly equivalent to:

 before_
 <dada>
 some_text_here
 </dada>
 _after

That means that to get a real newline just before or after some markup, you will need to insert two newlines. That seems like a small price to pay. Otherwise we would need different rules for the whitespace outside the root element, since it is hard to create files that don't end with a newline.

Ignoring more than one newline, or ignoring other whitespace, would make it impossible to end an element with white space.

Hexadecimal numeric entities don't have to have exactly four digits in this implementation.

Comments may not contain occurrences of `--' (two dashes) according to the XML draft. This tokenizer doesn't check for that; it only ends a comment at `-->'. (It would be easy enough to add, though.)

Processing instructions, other than the predefined XML ones, are not tokenized, but returned as a single string. It is left to the application to look for the `target' (the first word in the string).

XMLTokenizer(InputStream): Constructor: creates a tokenizer, given a raw byte stream.
XMLTokenizer(InputStream, String): Constructor: creates a tokenizer, given a raw byte stream.

getColno(): Return current column number
getLineno(): Return current line number
next(): Return next token, storing any associated information in a String that is available by calling value().
setEncoding(String): Set the encoding of the input stream.
value(): Return the String associated with the current token.

XMLTokenizer

 public XMLTokenizer(InputStream aStream) throws UnknownEncoding

Constructor: creates a tokenizer, given a raw byte stream. The stream is assumed to be in UTF8.

Parameters:: aStream - raw byte stream
Throws: UnknownEncoding: if the encoding isn't either UTF8 or ISO8859-1

XMLTokenizer

 public XMLTokenizer(InputStream aStream,
                     String encoding) throws UnknownEncoding

Constructor: creates a tokenizer, given a raw byte stream. The stream is assumed to have the given encoding, which can be "UTF8", "ISO8859-1",... (case-insensitive).

Parameters:: aStream - raw byte stream; encoding - the "charset" of the stream
Throws: UnknownEncoding: if the encoding isn't either UTF8 or ISO8859-1

setEncoding

 public void setEncoding(String encoding) throws UnknownEncoding

Set the encoding of the input stream. This changes the way the bytes are interpreted. The encoding may be changed in the middle of parsing, but a few bytes may already have been read into a buffer and will not be interpreted again. Changing among charset in which XML delimiters are encoded the same should normally work (UTF8, ISO8859-*) Currently limited to UTF8 and ISO8859-1.

Parameters:: encoding - a string like "utf8", "iso8859-1", etc.
Throws: UnknownEncoding: if the encoding isn't either UTF8 or ISO8859-1

getLineno

 public int getLineno()

Return current line number

getColno

 public int getColno()

Return current column number

next

 public int next() throws IOException

Return next token, storing any associated information in a String that is available by calling value(). The string is only available until the next call to next(). Not all tokens store information in value().

Returns:: the recognized token (an int)

value

 public String value()

Return the String associated with the current token. Not all tokens set the value to something meaningful.

Returns:: the contents of the current token