<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
<!--NewPage-->
<html>
<head>
<!-- Generated by javadoc on Mon Jun 09 23:45:36 GMT+03:30 1997 -->
<title>
  Class w3c.xmlOnline.parser.XMLTokenizer
</title>
</head>
<body>
<a name="_top_"></a>
<pre>
<a href="packages.html">All Packages</a>  <a href="tree.html">Class Hierarchy</a>  <a href="Package-w3c.xmlOnline.parser.html">This Package</a>  <a href="w3c.xmlOnline.parser.UTF8PrintStream.html#_top_">Previous</a>  <a href="Package-w3c.xmlOnline.parser.html">Next</a>  <a href="AllNames.html">Index</a></pre>
<hr>
<h1>
  Class w3c.xmlOnline.parser.XMLTokenizer
</h1>
<pre>
java.lang.Object
   |
   +----w3c.xmlOnline.parser.XMLTokenizer
</pre>
<hr>
<dl>
  <dt> public class <b>XMLTokenizer</b>
  <dt> extends Object
  <dt> implements <a href="w3c.xmlOnline.parser.ParserDef.html#_top_">ParserDef</a>, <a href="w3c.xmlOnline.parser.Scanner.html#_top_">Scanner</a>
</dl>
The XMLTokenizer class is created with a stream of bytes as
 argument. Every call to next causes it to read a few more bytes and
 return the token they represent. The encoding of the byte stream is
 assumed to be UTF8 by default, but other encodings can be used as
 well.
 <p>The corresponding parser is generated from the grammar in
 <a href="Parser.ll1">Parser.ll1</a>.
 <h2>Differences with XML draft</h2>
 <p>The XMLTokens interface gives the list of tokens that are
 recognized. Compared to the terminal symbols in <a
 href="http://www.w3.org/pub/WWW/TR/WD-xml-961114.html">the XML
 draft,</a> they are generally somewhat larger: on average a single
 token represents more characters than a token in the XML draft. The
 reason is that it divides the work between the tokenizer and the
 parser better, and it allows the parser to work without lookahead,
 running strictly as an LL(1) parser.
 <p>Another difference is in the way white space is handled. White
 space is never returned as a token in this parser. It is either
 ignored or passed as PCDATA, depending on the context. There are
 three contexts: content mode, markup mode and marked-section mode
 and each has its own rules. The rules are simple enough that the
 tokenizer can deal with them, which in turn means that the parser
 has an even easier job.
 <p>The rules for ignoring white space are not the same as in the
 XML draft, though. In markup, white space only serves to seperate
 other tokens, so ignoring it there is not an important change. In
 marked-sections white space is never ignored, which is the same in
 the XML draft. But in content the rule is different.
 <p>In content, there is a simple rule applied by the tokenizer: a
 newline that immediately precedes a `&lt;' is ignored, as is a
 newline that immediately follows a `&gt;'. No other white space is
 ever ignored. A `newline' in this context is either a line feed
 (U+000A), carriage return (U+000D) or a carriage return followed by
 a line feed. For example, the following XML code (_ represents a
 normal space):
 <pre>
 before_&lt;dada&gt;some_text_ here&lt;/dada&gt;_after
 </pre>
 <p>is exactly equivalent to:
 <pre>
 before_
 &lt;dada&gt;
 some_text_here
 &lt/dada&gt;
 _after
 </pre>
 <p>That means that to get a real newline just before or after some
 markup, you will need to insert <em>two</em> newlines. That seems
 like a small price to pay. Otherwise we would need different rules
 for the whitespace outside the root element, since it is hard to
 create files that don't end with a newline.
 <p>Ignoring <em>more</em> than one newline, or ignoring other
 whitespace, would make it impossible to end an element with white
 space.
 <p>Hexadecimal numeric entities don't have to have exactly four
 digits in this implementation.
 <p>Comments may not contain occurrences of `--' (two dashes)
 according to the XML draft. This tokenizer doesn't check for that;
 it only ends a comment at `--&gt;'. (It would be easy enough to
 add, though.)
 <p>Processing instructions, other than the predefined XML ones, are
 not tokenized, but returned as a single string. It is left to the
 application to look for the `target' (the first word in the
 string).
 <p>Structured attributes are supported. They look like
 <pre>
 &lt;*attribname&gt;... attribute content... &lt;&gt;
 </pre>
 <p>As an experiment, a + is allowed instead of the * as well
 (to see if it looks better). Similarly, end-tags
 may use : instead of /
 <p>Three types of short end-tags are supported: <code>&lt;/&gt;</code>,
 <code>&lt;:&gt;<code> and <code>&lt;&gt;</code>
<p>
<dl>
  <dt> <b>Version:</b>
  <dd> $Id: w3c.xmlOnline.parser.XMLTokenizer.html,v 1.7 1997/06/09 22:25:28 bbos Exp $
  <dt> <b>Author:</b>
  <dd> Bert Bos
</dl>
<hr>
<a name="index"></a>
<h2>
  <img src="images/constructor-index.gif" width=275 height=38 alt="Constructor Index">
</h2>
<dl>
  <dt> <img src="images/yellow-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#XMLTokenizer(java.io.InputStream)"><b>XMLTokenizer</b></a>(InputStream)
  <dd>  Constructor: creates a tokenizer, given a raw byte stream.
  <dt> <img src="images/yellow-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#XMLTokenizer(java.io.InputStream, java.lang.String)"><b>XMLTokenizer</b></a>(InputStream, String)
  <dd>  Constructor: creates a tokenizer, given a raw byte stream.
</dl>
<h2>
  <img src="images/method-index.gif" width=207 height=38 alt="Method Index">
</h2>
<dl>
  <dt> <img src="images/red-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#getColno()"><b>getColno</b></a>()
  <dd>  Return current column number

  <dt> <img src="images/red-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#getLineno()"><b>getLineno</b></a>()
  <dd>  Return current line number

  <dt> <img src="images/red-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#next()"><b>next</b></a>()
  <dd>  Return next token, storing any associated information
 in a String that is available by calling value().
  <dt> <img src="images/red-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#setEncoding(java.lang.String)"><b>setEncoding</b></a>(String)
  <dd>  Set the encoding of the input stream.
  <dt> <img src="images/red-ball-small.gif" width=6 height=6 alt=" o ">
	<a href="#value()"><b>value</b></a>()
  <dd>  Return the String associated with the current token.
</dl>
<a name="constructors"></a>
<h2>
  <img src="images/constructors.gif" width=231 height=38 alt="Constructors">
</h2>
<a name="XMLTokenizer"></a>
<a name="XMLTokenizer(java.io.InputStream)"><img src="images/yellow-ball.gif" width=12 height=12 alt=" o "></a>
<b>XMLTokenizer</b>
<pre>
 public XMLTokenizer(InputStream aStream) throws <a href="w3c.xmlOnline.parser.UnknownEncoding.html#_top_">UnknownEncoding</a>
</pre>
<dl>
  <dd> Constructor: creates a tokenizer, given a raw byte stream.
 The stream is assumed to be in UTF8.
<p>
  <dd><dl>
    <dt> <b>Parameters:</b>
    <dd> aStream - raw byte stream
    <dt> <b>Throws:</b> <a href="w3c.xmlOnline.parser.UnknownEncoding.html#_top_">UnknownEncoding</a>
    <dd> if the encoding isn't either UTF8 or ISO8859-1
  </dl></dd>
</dl>
<a name="XMLTokenizer(java.io.InputStream, java.lang.String)"><img src="images/yellow-ball.gif" width=12 height=12 alt=" o "></a>
<b>XMLTokenizer</b>
<pre>
 public XMLTokenizer(InputStream aStream,
                     String encoding) throws <a href="w3c.xmlOnline.parser.UnknownEncoding.html#_top_">UnknownEncoding</a>
</pre>
<dl>
  <dd> Constructor: creates a tokenizer, given a raw byte stream.
 The stream is assumed to have the given encoding, which
 can be "UTF8", "ISO8859-1",... (case-insensitive).
<p>
  <dd><dl>
    <dt> <b>Parameters:</b>
    <dd> aStream - raw byte stream
    <dd> encoding - the "charset" of the stream
    <dt> <b>Throws:</b> <a href="w3c.xmlOnline.parser.UnknownEncoding.html#_top_">UnknownEncoding</a>
    <dd> if the encoding isn't either UTF8 or ISO8859-1
  </dl></dd>
</dl>
<a name="methods"></a>
<h2>
  <img src="images/methods.gif" width=151 height=38 alt="Methods">
</h2>
<a name="setEncoding(java.lang.String)"><img src="images/red-ball.gif" width=12 height=12 alt=" o "></a>
<a name="setEncoding"><b>setEncoding</b></a>
<pre>
 public void setEncoding(String encoding) throws <a href="w3c.xmlOnline.parser.UnknownEncoding.html#_top_">UnknownEncoding</a>
</pre>
<dl>
  <dd> Set the encoding of the input stream. This changes the way the
 bytes are interpreted. The encoding may be changed in the
 middle of parsing, but a few bytes may already have been
 read into a buffer and will not be interpreted again.
 Changing among charset in which XML delimiters are
 encoded the same should normally work (UTF8, ISO8859-*)
 Currently limited to UTF8 and ISO8859-1.
<p>
  <dd><dl>
    <dt> <b>Parameters:</b>
    <dd> encoding - a string like "utf8", "iso8859-1", etc.
    <dt> <b>Throws:</b> <a href="w3c.xmlOnline.parser.UnknownEncoding.html#_top_">UnknownEncoding</a>
    <dd> if the encoding isn't either UTF8 or ISO8859-1
  </dl></dd>
</dl>
<a name="getLineno()"><img src="images/red-ball.gif" width=12 height=12 alt=" o "></a>
<a name="getLineno"><b>getLineno</b></a>
<pre>
 public int getLineno()
</pre>
<dl>
  <dd> Return current line number
<p>
</dl>
<a name="getColno()"><img src="images/red-ball.gif" width=12 height=12 alt=" o "></a>
<a name="getColno"><b>getColno</b></a>
<pre>
 public int getColno()
</pre>
<dl>
  <dd> Return current column number
<p>
</dl>
<a name="next()"><img src="images/red-ball.gif" width=12 height=12 alt=" o "></a>
<a name="next"><b>next</b></a>
<pre>
 public int next() throws IOException
</pre>
<dl>
  <dd> Return next token, storing any associated information
 in a String that is available by calling value(). The
 string is only available until the next call to next().
 Not all tokens store information in value().
<p>
  <dd><dl>
    <dt> <b>Returns:</b>
    <dd> the recognized token (an int)
  </dl></dd>
</dl>
<a name="value()"><img src="images/red-ball.gif" width=12 height=12 alt=" o "></a>
<a name="value"><b>value</b></a>
<pre>
 public String value()
</pre>
<dl>
  <dd> Return the String associated with the current token.
 Not all tokens set the value to something meaningful.
<p>
  <dd><dl>
    <dt> <b>Returns:</b>
    <dd> the contents of the current token
  </dl></dd>
</dl>
<hr>
<pre>
<a href="packages.html">All Packages</a>  <a href="tree.html">Class Hierarchy</a>  <a href="Package-w3c.xmlOnline.parser.html">This Package</a>  <a href="w3c.xmlOnline.parser.UTF8PrintStream.html#_top_">Previous</a>  <a href="Package-w3c.xmlOnline.parser.html">Next</a>  <a href="AllNames.html">Index</a></pre>
</body>
</html>
