WD-DOM-Level-2-19990719


6. Document Object Model Filters and Iterators

Editors
Mike Champion, Software AG
Joe Kesselman, IBM
Jonathan Robie, Software AG

Table of contents


6.1. Overview of the DOM Level 2 Iterator, Filter, and TreeWalker Interfaces

The DOM Level 2 Iterator, Filter, and TreeWalker interfaces extend the functionality of the DOM to allow simple and efficient traversal of document subtrees, node lists, or the results of queries.

This proposal contains Iterator, Filter, and TreeWalker interfaces, but no query interfaces. In the future, it is likely that a separate specification will be prepared for query interfaces, which may be query-language independent.

6.1.1. Iterators

An iterator allows the nodes of a data structure to be returned sequentially. When an iterator is first created, calling nextNode() returns the first node. When no more nodes are present, nextNode() returns a null. Since DOM structures may change as a document is loaded, if nextNode() finds no more nodes, it is still quite possible that further nodes may be added in the next instant. An iterator may be active while the data structure it navigates is being edited, so an iterator must behave gracefully in the face of change. Additions and deletions in the underlying data structure do not invalidate an iterator; in fact, an iterator is never invalidated.

Using ordered set semantics, the position of the iterator is determined by the relative position in the ordered set. There is no current node. When an iterator is created for a list, the position is set before the first element:

 ->A B C D E F G H I

Each call to next() returns a node and advances the position. For instance, if we start with the above position, the first call to next() returns "A" and advances the iterator:

 A ->B C D E F G H I

The relative position of the iterator remains valid when other nodes are added or deleted. For instance, if "A" is deleted, the position of the iterator is unchanged with respect to the remaining nodes:

->B C D E F G H I

Similarly, if the node ahead of the iterator is deleted or moved, the iterator "slides forward". Therefore, if "B" is deleted, the position remains unchanged with respect to the remaining nodes:

->C D E F G H I

For the same reason, moving the "C" node to the end of the set does not change the current position with respect to the remaining nodes:

->D E F G H I C

When nodes are added as children of the node to the left of the iterator, there is some difference of opinion as to what constitutes the most consistent behavior. Suppose the iterator is advanced past "D" using next(), then a new node is added as a child of "D" in the original tree. Since children of a node occur after the node in document order, one approach is to have the new child appear after the node, but before the current position of the iterator:

D a ->E F G H I C

The newly inserted node "a" occurs before the iterator, so it will not be encountered when the iterator is moved forward. This is convenient when an iterator is being used to add nodes to the tree, since the programmer does not need to skip over the newly inserted nodes. In this case, if the iterator were moved backward, this new node would be the first one encountered. A second consistent approach is to say that nodes added as children of the node to the left of the iterator appear after the current position of the iterator:

D ->a E F G H I C

Using this approach, the newly added nodes are encountered as the iterator moves forward. We believe either approach is justifiable, and we have not decided which of the two approaches is best.

Note that the relative position of the iterator is not the same as the absolute position within the set. The position of the iterator is relative to the node before it and the node after it, which is why the position floats gracefully when nodes are deleted or inserted before or after the position of the iterator.

Iterators are created using the createNodeIterator method found in Document. When an iterator is created, flags can be used to determine which nodes will be "visible" and which nodes will be "invisible" while traversing the tree. Nodes that are "invisible" are skipped over by the iterator as though they did not exist. These flags can be combined using OR:


NodeIterator iter=document.createNodeIterator(root, SHOW_ELEMENT | SHOW_PROCESSING_INSTRUCTION | SHOW_COMMENT | SHOW_ENTITY_REFERENCE, NULL);
                        

6.1.2. Filters

Filters allow the user to "filter out" nodes. Each filter contains a user-written function that looks at a node and determines whether or not it should be filtered out. To use a filter, you create an iterator that uses the filter. The iterator applies the filter to each node, and if the filter rejects the node, the iterator skips over the node as though it were not present in the document. Filters are easy to write, since they need not know how to navigate the structure on which they operate, and they can be reused for different kinds of iterators that operate on different data structures.

Consider a filter that finds the named anchors in an HTML document. In HTML, an HREF can refer to any <A> element that has a NAME attribute. Here is a filter that looks at a node and determines whether it is a named anchor:


class NamedAnchorFilter implements NodeFilter
{
    short acceptNode(Node n) {
        if (n instanceof Element) {
            Element e = n;
            if (e.getNodeName() != "A")
                return FILTER_SKIP;
            if (e.getNodeNameAttributeNode("NAME") != NULL) {
               return FILTER_ACCEPT;
            }
        }
        return FILTER_SKIP;
    }
}
          

To use this filter, the user would create an instance of the filter and create an iterator using it:


NamedAnchorFilter myFilter;
NodeIterator iter=document.creatNodeIterator(node, SHOW_ELEMENT, myFilter);

If SHOW_ENTITY_REFERENCE is not set, entities are expanded. If SHOW_ENTITY_REFERENCE is set, entity references will be encountered by the iterator. There is no setting that shows both the entity reference and its expansion.

6.1.3. TreeWalker

The TreeWalker interface provides many of the same benefits as the Iterator interface. The main difference between these two interfaces is that the TreeWalker presents a tree-oriented view of the nodes in a subtree, and an Iterator presents a list-oriented view. In other words, an Iterator allows you to move forward or back, but a TreeWalker allows you to move to the parent of a node, to one of its children, or to a sibling.

Using a TreeWalker is quite similar to navigation using the Node directly, and the navigation methods for the two interfaces are analogous. For instance, here is a function that processes the nodes of a subtree in document order using the Node navigation methods:

processMe(Node n) {

   doSomething(n);

   if (n.firstChild != null) {
     processMe(n.firstChild);
   }

   if (n.nextSibling != null) {
     processMe(n.nextSibling);
   }
}

Here is the code to do the same thing using a TreeWalker:

processMe(TreeWalker tw) {

   doSomething(tw.current());

   if (tw.firstChild() != null) {
     processMe(tw);
   }

   if (tw.nextSibling() != null) {
     processMe(tw);
   }
     
   tw.parent();
}

The main difference between these two functions is that the TreeWalker version must take into account the fact that changing the internal position of the TreeWalker will also affect any calling function that continues to use the TreeWalker. Therefore, a function that uses a TreeWalker should be careful about the position after the function is finished.

The advantage of using a TreeWalker instead of direct Node navigation is that the TreeWalker allows the user to choose an appropriate view of the tree. Flags may be used to show or hide comments or processing instructions, entities may be expanded or left as entity references, and sequences of text nodes may be merged into a single virtual text node. In addition, Filters may be used to present a custom view of the tree. Suppose a program needs a view of a document that shows which tables occur in each chapter, listed by chapter. In this view, only the chapter elements and the tables that they contain are seen. The first step is to write an appropriate filter:

class TablesInChapters implements NodeFilter {

      short acceptNode(Node n) {
              if (n instanceof Element) {
                      Element e = n;
                      
                      if (e.nodeName == "CHAPTER")
                              return FILTER_ACCEPT;

                      if (e.nodeName == "TABLE")
                              return FILTER_ACCEPT;

                      if (e.nodeName == "SECT1"
                       || e.nodeName == "SECT2"
                       || e.nodeName == "SECT3"
                       || e.nodeName == "SECT4"
                       || e.nodeName == "SECT5"
                       || e.nodeName == "SECT6"
                       || e.nodeName == "SECT7")
                              return FILTER_SKIP;

              }

              return FILTER_REJECT;
      }
}

Now the program can create an instance of this Filter, create a TreeWalker that uses it, and pass this TreeWalker to our ProcessMe() function:


TablesInChapters tablesInChapters;
TreeWalker tw(root, SHOW_ELEMENT, TablesInChapters);
ProcessMe(tw);

Without making any changes to the above ProcessMe() function, it now processes only the <CHAPTER> and <TABLE> elements. The programmer can write other filters or set other flags to choose different sets of nodes; if functions use TreeWalker to navigate, they will support any view of the document defined with a TreeWalker.

6.2. Formal Interface Definition

Interface NodeIterator

NodeIterators are used to step through a set of nodes, e.g. the set of nodes in a NodeList, the document subtree governed by a particular node, the results of a query, or any other set of nodes. The set of nodes to be iterated is determined by the factory that creates the iterator.

Any iterator that returns nodes may implement the NodeIterator interface. Users and vendor libraries may also choose to create iterators that implement the NodeIterator interface.

IDL Definition

interface NodeIterator {
  readonly attribute long             whatToShow;
  // Constants for whatToShow
  const unsigned long       SHOW_ALL                       = 0xFFFF;
  const unsigned long       SHOW_ELEMENT                   = 0x00000001;
  const unsigned long       SHOW_ATTRIBUTE                 = 0x00000002;
  const unsigned long       SHOW_TEXT                      = 0x00000004;
  const unsigned long       SHOW_CDATA_SECTION             = 0x00000008;
  const unsigned long       SHOW_ENTITY_REFERENCE          = 0x00000010;
  const unsigned long       SHOW_ENTITY                    = 0x00000020;
  const unsigned long       SHOW_PROCESSING_INSTRUCTION    = 0x00000040;
  const unsigned long       SHOW_COMMENT                   = 0x00000080;
  const unsigned long       SHOW_DOCUMENT                  = 0x00000100;
  const unsigned long       SHOW_DOCUMENT_TYPE             = 0x00000200;
  const unsigned long       SHOW_DOCUMENT_FRAGMENT         = 0x00000400;
  const unsigned long       SHOW_NOTATION                  = 0x00000800;

  readonly attribute NodeFilter       filter;
  Node               nextNode();
  Node               previousNode();
};

Attributes
whatToShow
This attribute determines whether entities are expanded, and whether comments, processing instructions, or text are presented via the iterator.
Definition group Constants for whatToShow

These are the available values for the whatToShow parameter. They are the same as the set of possible types for Node, and their values are derived by using a bit position corresponding to the value of NodeType for the equivalent node type.

Defined Constants
SHOW_ALL

Show all nodes.

SHOW_ELEMENT

Show element nodes.

SHOW_ATTRIBUTE

Show attribute nodes. This is meaningful only when creating an iterator with an attribute node as its root; in this case, it means that the attribute node will appear in the first position of the iteration. Since attributes are not part of the document tree, they do not appear when iterating over the document tree.

SHOW_TEXT

Show text nodes.

SHOW_CDATA_SECTION

Show CDATASection nodes.

SHOW_ENTITY_REFERENCE

Show Entity Reference nodes.

SHOW_ENTITY

Show Entity nodes. This currently has no effect.

SHOW_PROCESSING_INSTRUCTION

Show ProcessingInstruction nodes.

SHOW_COMMENT

Show Comment nodes.

SHOW_DOCUMENT

Show Document nodes.

SHOW_DOCUMENT_TYPE

Show DocumentType nodes.

SHOW_DOCUMENT_FRAGMENT

Show DocumentFragment nodes.

SHOW_NOTATION

Show Notation nodes. This currently has no effect.

filter
The filter used to screen nodes.
Methods
nextNode
Returns the next node in the set and advances the position of the iterator in the set. After a NodeIterator is created, the first call to nextNode() returns the first node in the set.
Return Value
The next Node in the set being iterated over, or NULL if there are no more members in that set.

This method has no parameters.
This method raises no exceptions.
previousNode
Returns the previous node in the set and moves the position of the iterator backwards in the set.
Return Value
The previous Node in the set being iterated over, or NULL if there are no more members in that set.

This method has no parameters.
This method raises no exceptions.
Interface NodeFilter

Filters are simply objects that know how to "filter out" nodes. If an iterator is given a filter, before it returns the next node, it applies the filter. If the filter says to accept the node, the iterator returns it; otherwise, the iterator looks for the next node and pretends that the node that was rejected was not there.

The DOM does not provide any filters. Filter is just an interface that users can implement to provide their own filters. The introduction to this chapter gives an example of how a user can implement a filter to perform a specific function.

Filters do not need to know how to iterate, nor do they need to know anything about the data structure that is being iterated. This makes it very easy to write filters, since the only thing they have to know how to do is evaluate a single node. One filter may be used with a number of different kinds of iterators, encouraging code reuse.

If a filter is installed for a TreeWalker or Iterator, the system may use that filter for various tasks, especially during fix-up. Filters should make no assumptions about how frequently they will be called.

IDL Definition

interface NodeFilter {
  // Constants returned by acceptNode
  const short               FILTER_ACCEPT                  = 1;
  const short               FILTER_REJECT                  = 2;
  const short               FILTER_SKIP                    = 3;

  short              acceptNode(in Node n);
};

Definition group Constants returned by acceptNode

The following constants are returned by the acceptNode() method:

Defined Constants
FILTER_ACCEPT

Accept the node. Navigation methods defined for Iterator or TreeWalker will return this node.

FILTER_REJECT

Reject the node. Navigation methods defined for Iterator or TreeWalker will not return this node. For TreeWalker, the children of this node will also be rejected. Iterators treat this as a synonym for FILTER_SKIP.

FILTER_SKIP

Reject the node. Navigation methods defined for Iterator or TreeWalker will not return this node. For both Iterator and Treewalker, the children of this node will still be considered.

Methods
acceptNode
Parameters
n

The node to check to see if it passes the filter or not.

Return Value
Returns a constant to determine whether the node is accepted, rejected, or skipped, as defined above. Note: If an exception is thrown in this method, the results are unspecified.

This method raises no exceptions.
Interface TreeWalker

TreeWalkers are used to navigate a document tree or subtree using the view of the document defined by its whatToShow flags and any filters that are defined for the TreeWalker. Any function which performs navigation using a TreeWalker will automatically support any view defined by a TreeWalker.

IDL Definition

interface TreeWalker {
  readonly attribute long             whatToShow;
  // Constants for whatToShow
  const unsigned long       SHOW_ALL                       = 0xFFFF;
  const unsigned long       SHOW_ELEMENT                   = 0x00000001;
  const unsigned long       SHOW_ATTRIBUTE                 = 0x00000002;
  const unsigned long       SHOW_TEXT                      = 0x00000004;
  const unsigned long       SHOW_CDATA_SECTION             = 0x00000008;
  const unsigned long       SHOW_ENTITY_REFERENCE          = 0x00000010;
  const unsigned long       SHOW_ENTITY                    = 0x00000020;
  const unsigned long       SHOW_PROCESSING_INSTRUCTION    = 0x00000040;
  const unsigned long       SHOW_COMMENT                   = 0x00000080;
  const unsigned long       SHOW_DOCUMENT                  = 0x00000100;
  const unsigned long       SHOW_DOCUMENT_TYPE             = 0x00000200;
  const unsigned long       SHOW_DOCUMENT_FRAGMENT         = 0x00000400;
  const unsigned long       SHOW_NOTATION                  = 0x00000800;

  readonly attribute NodeFilter       filter;
  Node               current();
  Node               parentNode();
  Node               firstChild();
  Node               lastChild();
  Node               previousSibling();
  Node               nextSibling();
};

Attributes
whatToShow
This attribute determines whether entities are expanded, and whether comments, processing instructions, or text are presented via the iterator.
Definition group Constants for whatToShow

These are the available values for the whatToShow parameter. They are the same as the set of possible types for Node, and their values are derived by using a bit position corresponding to the value of NodeType for the equivalent node type.

Defined Constants
SHOW_ALL

Show all nodes.

SHOW_ELEMENT

Show element nodes.

SHOW_ATTRIBUTE

Show attribute nodes. This is meaningful only when creating an TreeWalker with an attribute node as its root; in this case, it means that the attribute node will appear in the first position of the iteration. Since attributes are not part of the document tree, they do not appear when iterating over the document tree.

SHOW_TEXT

Show text nodes.

SHOW_CDATA_SECTION

Show CDATASection nodes.

SHOW_ENTITY_REFERENCE

Show Entity Reference nodes.

SHOW_ENTITY

Show Entity nodes. This currently has no effect.

SHOW_PROCESSING_INSTRUCTION

Show ProcessingInstruction nodes.

SHOW_COMMENT

Show Comment nodes.

SHOW_DOCUMENT

Show Document nodes.

SHOW_DOCUMENT_TYPE

Show DocumentType nodes.

SHOW_DOCUMENT_FRAGMENT

Show DocumentFragment nodes.

SHOW_NOTATION

Show Notation nodes. This currently has no effect.

filter
The filter used to screen nodes.
Methods
current
Returns the current node without changing position.
Return Value
The current node.

This method has no parameters.
This method raises no exceptions.
parentNode
Moves to the parent node. This method will never position beyond the root of the subtree for which the TreeWalker was created.
Return Value
The new node. If the current node is the root of the subtree for which the TreeWalker was created, returns null, and retains the current node.

This method has no parameters.
This method raises no exceptions.
firstChild
Moves the TreeWalker to the first child of the current node, and returns the new node. If the current node has no children, returns null, and retains the current node.
Return Value
The new node, or null if the current node has no children.

This method has no parameters.
This method raises no exceptions.
lastChild
Moves the TreeWalker to the last child of the current node, and returns the new node. If the current node has no children, returns null, and retains the current node.
Return Value
The new node, or null if the current node has no children.

This method has no parameters.
This method raises no exceptions.
previousSibling
Moves the TreeWalker to the previous sibling of the current node, and returns the new node. If the current node has no previous sibling, returns null, and retains the current node.
Return Value
The new node, or null if the current node has no previous sibling.

This method has no parameters.
This method raises no exceptions.
nextSibling
Moves the TreeWalker to the next sibling of the current node, and returns the new node. If the current node has no next sibling, returns null, and retains the current node.
Return Value
The new node, or null if the current node has no next sibling.

This method has no parameters.
This method raises no exceptions.
Interface DocumentIF

Document contains methods that creates iterators to traverse a node and its children in document order (depth first, pre-order traversal, which is equivalent to the order in which the start tags occur in the text representation of the document).

IDL Definition

interface DocumentIF {
  short              createNodeIterator(in Node root, 
                                        in short whatToShow, 
                                        in NodeFilter filter);
};

Methods
createNodeIterator
Parameters
root

The node which will be iterated together with its children.

whatToShow

This flag determines whether entities are expanded, and whether comments, processing instructions, or text are presented via the iterator. See the description of Iterator for the set of possible values.

These flags can be combined using OR:


NodeIterator iter=doc.createNodeIterator(root, SHOW_ELEMENT | SHOW_PROCESSING_INSTRUCTION | SHOW_COMMENT | SHOW_ENTITY_REFERENCE, myFilter);

If SHOW_ENTITY_REFERENCE is not set, entities are expanded. If SHOW_ENTITY_REFERENCE is set, entity references will be encountered by the iterator. There is no setting that shows both the entity reference and its expansion.

(ED: Several people have suggested that the functionality of whatToShow be implemented using filters. We feel that it is better to implement them using iterators, since it makes it possible to provide a more efficient implementation. A filter must examine each node individually; an iterator can make use of internal data structures to examine only those nodes that are desired.)
filter
Return Value
The newly created NodeIterator.

This method raises no exceptions.