W3C | Submissions

Team Comment on the Data Extraction Language Submission

W3C is pleased to receive the Data Extraction Language submission specification from Republica Corp.

The Data Extraction Language is an XML vocabulary for describing a set of rules to transform structured text data to XML. DEL defines markup to extract relevant information from a text file and to construct an XML file.

There are many languages and systems designed to assist in extracting information from text, including Perl (the "e" in the original acronym stood for "extraction") and the wide field of lexical scanners and parser generators (such as lex and yacc). Also related is XSLT, an XML transformation language which can, to a certain extent, parse text files to generate XML output. Although XSLT 1.0 does not define functions to parse text using regular expressions, it allows the definition of extensions to do so. A few implementations make this mechanism available, making XSLT an appropriate standard to compare DEL to. XSLT 2.0 is expected to support regular expressions through XPath 2.0 (see item 3 in the XPath 2.0 requirements).

Not all of DEL's features are covered by XSLT, for example, the possibility of generating CDATA sections, or to control the way the result XML file is output with the "Document Ready" function. On the other hand XSLT provides many useful functions, such as sorting and numbering, that DEL lacks. One particular feature that DEL could have borrowed from XSLT which would have made the language simpler is the way the result tree is built. While XSLT uses namespaces to allow instantiating the result tree directly, DEL went for the more complicated solution of using constructor elements (<map>) as well as a 'cursor' to navigate through the output tree and add new XML constructs.

It is unfortunate that the text of the submission leaves many questions unanswered. Examples are:

Submitter's Reply to the Team Comment

Below I will try to provide brief answers to the questions posed in the W3C Team comment.

* Where exactly do 'over' and 'upto' start or end in the input stream? Both 'over' and 'upto' start at the current ('cursor' or 'offset') position in the input stream. 'over' causes the DEL processor to advance in the input stream until a pattern/regexp is fully matched i.e. input stream cursor/offset is set immediately after the last character matched by the pattern/regexp. 'upto' tells the DEL processor to advance the input stream cursor/offset and stop at the first character of the pattern/regexp match. This first character of the match will be included in the search range of next <extract> command(s).

* What is a dataStreamError? dataStreamError marks a status of the DEL processor set (only) by the <extract> command. dataStreamError -status causes the DEL processor to break out (=continue with the next command on the same nesting level) of nearest parent <repeat> loop. If there is no <repeat> command among the parent elements, the entire DEL script will fail with dataStreamError (which means the DEL processor could not find a match for a pattern/regexp given in a <extract> command and thus is unable to continue with the DEL execution). In case the <extract> command that set the dataStreamError status is not directly a child element of <repeat> command, the DEL processor will step through parent elements until it encounters a <repeat> element (to break out of).

* What does the 'text' function compare exactly? In the absence of a data model, one would imagine that it is string values. But having all examples use numbers can be misleading, it could be assumed that "1.0" and "1" compare as equal. Moreover, the content of "value" attributes is not defined. Only the examples show that it can either be a number or a variable name, but that is not written explicitly. We assume the 'text' function refers to the <test> command. The content of Value1 and Value2 attributes depends on the TestType attribute:

TestTypeValue1Value2Operation performed
by DEL processor
equalstringstringstring comparison
unequalstringstringstring comparison
lesserintegerintegerinteger comparison
greaterintegerintegerinteger comparison
re_equalregexpstringregexp matching
re_unequalregexpstringregexp matching
containsregexpstringregexp matching

In all cases listed above, the Value1 and Value2 attributes can contain a reference to register, in which case the register name gets replaced with the current content of the register.

* What exactly does "re_equal" test? "re_equal" tests if a string (a literal or a register content) completely matches a regexp. It is similar to the <extract> function's exptype "content" in a sense it requires the regexp match to occur immediately at the beginning of the string.

Next Steps

This submission will be referred to the attention of the XSL Working Group, as the use cases and parsing mechanism could serve as a starting point for the definition of regular expression matching in XPath 2.0.

Disclaimer: Placing a Submission on a Working Group agenda does not imply endorsement by either the W3C Team or the participants of the Working Group, nor does it guarantee that the Working Group will agree to take any specific action on a Submission.


Max Froumentin, Team Contact for the XSL Working Group <mf@w3.org>