All Packages  Class Hierarchy  This Package  Previous  Next  Index

Class w3c.xmlOnline.parser.XMLTokenizer

java.lang.Object
   |
   +----w3c.xmlOnline.parser.XMLTokenizer

public class XMLTokenizer
extends Object
implements ParserDef, Scanner

The XMLTokenizer class is created with a stream of bytes as argument. Every call to next causes it to read a few more bytes and return the token they represent. The encoding of the byte stream is assumed to be UTF8 by default, but other encodings can be used as well.

The corresponding parser is generated from the grammar in Parser.ll1.

Differences with XML draft

The XMLTokens interface gives the list of tokens that are recognized. Compared to the terminal symbols in the XML draft, they are generally somewhat larger: on average a single token represents more characters than a token in the XML draft. The reason is that it divides the work between the tokenizer and the parser better, and it allows the parser to work without lookahead, running strictly as an LL(1) parser.

Another difference is in the way white space is handled. White space is never returned as a token in this parser. It is either ignored or passed as PCDATA, depending on the context. There are three contexts: content mode, markup mode and marked-section mode and each has its own rules. The rules are simple enough that the tokenizer can deal with them, which in turn means that the parser has an even easier job.

The rules for ignoring white space are not the same as in the XML draft, though. In markup, white space only serves to seperate other tokens, so ignoring it there is not an important change. In marked-sections white space is never ignored, which is the same in the XML draft. But in content the rule is different.

In content, there is a simple rule applied by the tokenizer: a newline that immediately precedes a `<' is ignored, as is a newline that immediately follows a `>'. No other white space is ever ignored. A `newline' in this context is either a line feed (U+000A), carriage return (U+000D) or a carriage return followed by a line feed. For example, the following XML code (_ represents a normal space):

 before_<dada>some_text_ here</dada>_after

is exactly equivalent to:

 before_
 <dada>
 some_text_here
 </dada>
 _after

That means that to get a real newline just before or after some markup, you will need to insert two newlines. That seems like a small price to pay. Otherwise we would need different rules for the whitespace outside the root element, since it is hard to create files that don't end with a newline.

Ignoring more than one newline, or ignoring other whitespace, would make it impossible to end an element with white space.

Hexadecimal numeric entities don't have to have exactly four digits in this implementation.

Comments may not contain occurrences of `--' (two dashes) according to the XML draft. This tokenizer doesn't check for that; it only ends a comment at `-->'. (It would be easy enough to add, though.)

Processing instructions, other than the predefined XML ones, are not tokenized, but returned as a single string. It is left to the application to look for the `target' (the first word in the string).

Structured attributes are supported. They look like

 <*attribname>... attribute content... <>

As an experiment, a + is allowed instead of the * as well (to see if it looks better). Similarly, end-tags may use : instead of /

Three types of short end-tags are supported: </>, <:> and <>

Version:: $Id: w3c.xmlOnline.parser.XMLTokenizer.html,v 1.7 1997/06/09 22:25:28 bbos Exp $
Author:: Bert Bos





  


   
	XMLTokenizer(InputStream)
  
  Constructor: creates a tokenizer, given a raw byte stream.
  
 
	XMLTokenizer(InputStream, String)
  
  Constructor: creates a tokenizer, given a raw byte stream.


  


   
	getColno()
  
  Return current column number

  
 
	getLineno()
  
  Return current line number

  
 
	next()
  
  Return next token, storing any associated information
 in a String that is available by calling value().
  
 
	setEncoding(String)
  
  Set the encoding of the input stream.
  
 
	value()
  
  Return the String associated with the current token.



  



XMLTokenizer
 public XMLTokenizer(InputStream aStream) throws UnknownEncoding


   Constructor: creates a tokenizer, given a raw byte stream.
 The stream is assumed to be in UTF8.

  

     Parameters:
    
 aStream - raw byte stream
    
 Throws: UnknownEncoding
    
 if the encoding isn't either UTF8 or ISO8859-1
  


XMLTokenizer
 public XMLTokenizer(InputStream aStream,
                     String encoding) throws UnknownEncoding


   Constructor: creates a tokenizer, given a raw byte stream.
 The stream is assumed to have the given encoding, which
 can be "UTF8", "ISO8859-1",... (case-insensitive).

  

     Parameters:
    
 aStream - raw byte stream
    
 encoding - the "charset" of the stream
    
 Throws: UnknownEncoding
    
 if the encoding isn't either UTF8 or ISO8859-1
  



  


setEncoding
 public void setEncoding(String encoding) throws UnknownEncoding


   Set the encoding of the input stream. This changes the way the
 bytes are interpreted. The encoding may be changed in the
 middle of parsing, but a few bytes may already have been
 read into a buffer and will not be interpreted again.
 Changing among charset in which XML delimiters are
 encoded the same should normally work (UTF8, ISO8859-*)
 Currently limited to UTF8 and ISO8859-1.

  

     Parameters:
    
 encoding - a string like "utf8", "iso8859-1", etc.
    
 Throws: UnknownEncoding
    
 if the encoding isn't either UTF8 or ISO8859-1
  


getLineno
 public int getLineno()


   Return current line number



getColno
 public int getColno()


   Return current column number



next
 public int next() throws IOException


   Return next token, storing any associated information
 in a String that is available by calling value(). The
 string is only available until the next call to next().
 Not all tokens store information in value().

  

     Returns:
    
 the recognized token (an int)
  


value
 public String value()


   Return the String associated with the current token.
 Not all tokens set the value to something meaningful.

  

     Returns:
    
 the contents of the current token
  


All Packages  Class Hierarchy  This Package  Previous  Next  Index