W3C

DEL - Data Extraction Language

W3C Note 31 October 2001

This version:
http://www.w3.org/TR/2001/NOTE-data-extraction-20011031
Latest version:
http://www.w3.org/TR/data-extraction
Previous version:
Editors:
Eero Lempinen, Republica Corp.
Harri Saarikoski, Republica Corp.

Abstract

This document outlines the Data Extraction Language. DEL is an XML format for describing data conversion processes from other data formats to XML. A DEL script specifies how to locate and extract fragments from input data and where to insert them in the resulting XML format. The DEL processor executing the DEL script can use the extracted data to either create a new XML document or modify an existing XML document by creating new elements and attributes at locations specified with XPath expressions.

Status of this document

This document was submitted to the World Wide Web Consortium on June 21 2001 (see Submission Request, W3C Staff Comment) intention that the W3C use it as a basis for furthering the work on any-to-XML transformations. For a full list of all acknowledged Submissions, please see Acknowledged Submissions to W3C.

This document is a NOTE made available by the W3C for discussion only. Publication of this Note by W3C indicates no endorsement by W3C or the W3C Team, or any W3C Members. W3C has had no editorial control over the preparation of this Note. This document is a work in progress and may be updated, replaced, or rendered obsolete by other documents at any time.

A list of current W3C technical documents can be found at the Technical Reports page.


Table of contents

1 INTRODUCTION
2 DEFINITION OF DATA EXTRACTION LANGUAGE
2.1 wrapper
2.2 template
2.3 repeat
2.4 map
2.5 extract
2.6 test
2.7 runtemplate
2.8 set
2.9 charsetmap (values)
2.10 stringconvert (register)
2.11 count
3 BASIC EXAMPLE OF DATA EXTRACTION LANGUAGE
3.1 input data
3.2 DEL script
3.3 output XML
4 ADVANCED EXAMPLE WITH CHARACTER SET MAPPING AND OUTPUT TEMPLATE
4.1 input data
4.2 DEL script
4.3 XML template
4.4 output XML
APPENDIX 1. DATA EXTRACTION LANGUAGE DTD

 

1 INTRODUCTION

This document outlines the Data Extraction Language. DEL is an XML [XML] format for describing data conversion processes from other data formats to XML. A DEL script specifies how to locate and extract fragments from input data and where to insert them in the resulting XML format.

The DEL processor executing the DEL script can use the extracted data to either create a new XML document or modify an existing XML document by creating new elements and attributes at locations specified with XPath [XPath] expressions. A DEL script along with the source data are given to the DEL processor which performs the actual data conversion according to the script. The output from the DEL processor is a well-formed XML document containing the desired parts of the source data.

Locating data fragments in the input data can be done by searching for patterns and the matching regular expressions [REGEX]. The extracted data fragments are first temporarily stored to DEL processor's registers (or stack) in order to be refined before outputting and possibly re-used as a search pattern. The data is then read from the registers or stack and placed into its proper position in the DOM [DOM] tree of the resulting XML document. In placing the data to XML, a cursor function is used to keep track of the current position. The cursor position can be modified using XPath expressions.

 

2 DEFINITION OF DATA EXTRACTION LANGUAGE

The following sections describe the use of Data Extraction Language elements, their attributes and attribute values.

NOTE: The attribute values of "stack" and "regX" (where X is a user-defined register name with one character at least) are parsed before use. Then the actual value is taken from memory (stack or register). Otherwise, the actual value is the given value of the attribute.

2.1 wrapper

Function: wrapper element is the container (root element) for DEL script rules.

NOTE: When creating a new output XML, the first map element should include maptype attribute with value "createDocument".

Contains: First template and map elements and then optionally one of the following elements: repeat, map, extract, test, set or runtemplate.

Attributes: No attributes.

Syntax:

<wrapper>

</wrapper>

2.2 template

Function: template element is used as container for common sequences of other elements. runtemplate element (for loading the template, see 2.7) is then used to call the content of template element.

Contains: Optionally one of the following elements:

repeat, map, extract, test, set or runtemplate.

Attributes: name (REQUIRED)

Syntax:

<template name="attribute_value">

</template>

Example:

<template name="MakeDate">

</template>

2.3 repeat

Function: repeat element repeats its content, ie. the elements located under it.

Contains: Optionally one of the following elements:

repeat, map, extract, test, set or runtemplate.

Attributes: times (REQUIRED)

Times attribute:

NOTE: In repeat loop, when trying to extract data but not finding the given (regular) expression where expected (at the beginning of the data under processing), the DEL processor sets "dataStreamError" status to stop the repeat loop. In case "dataStreamError" status is set outside repeat loop, the whole wrapping process stops.

Syntax:

<repeat times="attribute_value">

</repeat>

<repeat times="*">

</repeat>

2.4 map

Function: map element inserts content to the output as XML node(s).
It moves the cursor in the output XML to specify the insertion point for the node(s). It also keeps track of the current element.

Contains: No elements

Attributes: The attributes and their possible values are:

Syntax:

<map maptype="attribute_value" node="attribute_value" content="value"/>

Example:

<map maptype="createDocument" node="root"/>

2.5 extract

Function: extract element gets data from the source data.

Contains: Optionally one of the following elements:

repeat, map, extract, test, set or runtemplate.

Attributes: exptype (REQUIRED), expression (REQUIRED, will be parsed), save (OPTIONAL, will be parsed)

Exptype and expression attributes: exptype attribute tells the DEL processor which part of the data should be extracted and where it is save.

expression attribute specifies exptype.

Possible exptype attribute values are:

Save attribute: save attribute tells where the extractable data is saved.

Possible save values are:

Syntax:

<extract exptype="attribute_value" expression="attribute_value" save="attribute_value"/>

Example

<extract exptype="over" expression="&lt;table&gt;"/>

Tip!

In case of large source data, try a multi-level extraction script. In such a script, extract element contains other DEL elements (which in turn can contain more). This allows you to 'chop up' the data into more manageable chunks.

Consider an XML source file with <table> as root element containing <tr> elements (table rows) and those containing <td> (cells):

<table>

  <tr>

    <td>1</td>

    <td>4</td>

  </tr>

  <tr>

    <td>5</td>

  </tr>

  <tr>

    <td>74</td>

    <td>99</td>

    <td>100</td>

  </tr>

</table>

Now look at the script below. The first extract (top level) element moves the cursor to a larger section from the source data (section within "<tr>" tags). Then the second extract element puts the cursor within the "<td>" tags located in that larger section. And the third extract element extracts the content within "<td>" tags and saves it to a register from where it will be later placed to result XML:

<extract exptype="over" expression="</tr>">

  <extract exptype="over" expression="<td>"/>

  <extract exptype="upto" expression="</td>" save="reg1"/>

</extract>

Such a multi-level script can also be reused and edited more effectively than one where all extract elements are on the same level.

2.6 test

Function: test element compares two values. The content of the test element will be processed when the test result is "true".

Contains: Optionally one of the following elements:

repeat, map, extract, test, set or runtemplate.

Attributes: testtype (REQUIRED), value1 (REQUIRED, parsed), value2 (OPTIONAL, parsed)

Testtype, value1 and value2: testtype attribute indicates what kind of test is processed.

Possible testtype values are:

NOTE: Content of the attributes value1 and value2 are parsed before comparing.

Syntax:

<test testtype="attribute_value" value1="attribute_value" value2="attribute_value"></test>

Example

<test testtype="equal" value1="4" value2="reg1">

2.7 runtemplate

Function: runtemplate element runs elements from a predefined template (see defining the template, chapter 2.2).

Contains: No elements.

Attributes: nameref (REQUIRED)

nameref attribute:

Syntax:

<runtemplate nameref="attribute_value"/>

<runtemplate nameref="MakeDate"/>

2.8 set

Function: set element gives instructions to the DEL parser for processing the data.

Contains: No elements.

Attributes:

parameter attribute values (value1 and value2):

NOTE: The place for storing the combination ("append") must be a register (e.g. "regCombo"). Other values can either be registers or strings.

Syntax:

<set parameter="attribute_value" value1="attribute_value" value2="attribute_value"/>

Example:

<set parameter="memory" value1="stack" value2="Testing"/>

2.9 charsetmap (values)

Function: Creates a character set map where you can define which characters to replace with which characters.

For running the character set map, stringconvert needs to be defined (see below).

Attributes: name

Contains: values

Syntax:

<charsetmap name="value">

<values search="replaceable" replace="replacer"/>

</charsetmap>

Example:

<charsetmap name="MyMap">

<values search="AAABBB" replace="GISSE"/>

</charsetmap>

2.10 stringconvert (register)

Function: Loads and runs a character set map defined by charsetmap (see above).

Attributes:

Contains: register

Syntax:

<stringconvert use="mapname" overlapping="true|false">

<register nameref="registername"/>

</stringconvert>

Example

<stringconvert use="MyMap" overlapping="true">

<register nameref="reg1"/>

<register nameref="reg2"/>

</stringconvert>

2.11 count

Function: A simple arithmetic calculator.

Contains: No elements.

Attributes: parameter (REQUIRED), value1 (REQUIRED), value2, value3

parameter attribute values:

Syntax:

<count parameter="function" value1="number|register" value2="number|register" value3="registername"/>

Example:

<count parameter="minus" value1="23" value2="20" valu3="reg3"/>

<count parameter="decimal" value1="1"/>

 

3 BASIC EXAMPLE OF DATA EXTRACTION LANGUAGE

Below are example rules (3.2) that extract data from normal HTML pages (3.1), producing a result XML file (3.3).

3.1 input data

Here is an example input data HTML file that will be processed below:

<html>

<body>

<p>Test Material</p>

<table>

<tr><td>Some numbers<td>Other numbers</tr>

<tr><td>1<td>2

<tr><td>3</td><td>4</tr>

</table>

</body>

</html>

3.2 DEL script

This example DEL script contains the following rules:

<wrapper>

<map maptype="createDocument" node="root"/>

<map maptype="createElement" node="description"/>

<map maptype="createTextNode" node="Extracted data"/>

<extract exptype="over" expression="&lt;table&gt;"/>

<extract exptype="upto" expression="&lt;/table&gt;">

<repeat times="*">

<extract exptype="over" expression="&lt;tr&gt;"/>

<extract exptype="over" expression="&lt;td&gt;"/>

<extract exptype="upto" expression="&lt;" save="stack"/>

<map maptype="moveCursor" node="/root"/>

<map maptype="createElement" node="row"/>

<map maptype="createElement" node="field1" content="stack"/>

<map maptype="moveCursor" node=".."/>

<extract exptype="over" expression="&lt;td&gt;"/>

<extract exptype="upto" expression="&lt;" save="reg1"/>

<map maptype="createElement" node="field2" content="reg1"/>

<test testtype="equal" value1="4" value2="reg1">

<map maptype="createAttribute" node="test" content="success"/>

</test>

</repeat>

</extract>

</wrapper>

3.3 output XML

The output XML from the above input data (3.1) and rules (3.2) is as follows:

<?xml version="1.0" encoding="ISO-8859-1"?>

<root>

<description>Extracted data</description>

<row>

<field1>Some numbers</field1>

<field2>Other numbers</field2>

</row>

<row>

<field1>1</field1>

<field2>2</field2>

</row>

<row>

<field1>3</field1>

<field2 test="success">4</field2>

</row>

</root>

 

4 ADVANCED EXAMPLE WITH CHARACTER SET MAPPING AND OUTPUT TEMPLATE

Below is another example where DEL script modifies an output XML template is modified with source data. Character set mapping feature is used to search and replace characters.

4.1 input data

Input data can be complex and hard to extract, as follows:

<area id=#2202>

#0,12.10

#23,1.514

#1,4.444

#abba

<area id=#2203>

#amount

#contact

<area id=#2204>

#3

#5

<area id=#2205>

#end

4.2 DEL script

This example script makes a character set conversion using charsetmap and stringconvert elements (first with "false" overlapping and then with "true" overlapping). It produces two result XML files using "getTemplate". It also uses count to add register values and "append" to combine register values.

<wrapper>

',' (comma) à '.' (dot)

'.' (dot) à '_'

'a' à 'b'

'b' à 'c'

Note that no data conversion is made yet. The actual conversion is made using stringconvert command.

<charsetmap name="map1">

<values search="," replace="."/>

<values search="." replace="-"/>

<values search="a" replace="b"/>

<values search="b" replace="c"/>

</charsetmap>

<repeat times="1">

<map maptype="getTemplate" node="document"/>

<extract exptype="mark" expression="regBegin"/>

<extract exptype="re_over" expression="\r\n&lt;">

<map maptype="moveCursor" node="/document/charsetmap/original"/>

<set parameter="memory" value1="regC" value2="0"/>

<repeat times="*">

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regValue"/>

<count parameter="plus" value1="regC" value2="1" value3="regCC"/>

<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>

<map maptype="createElement" node="regRes" content="regValue"/>

<map maptype="moveCursor" node=".."/>

</repeat>

</extract>

<extract exptype="set" expression="regBegin"/>

<extract exptype="re_over" expression="\r\n&lt;">

<map maptype="moveCursor" node="/document/charsetmap/overlappingfalse"/>

<set parameter="memory" value1="regC" value2="0"/>

<repeat times="*">

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regValue"/>

<stringconvert use="map1" overlapping="false">

<register nameref="regValue"/>

</stringconvert>

<count parameter="plus" value1="regC" value2="1" value3="regCC"/>

<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>

<map maptype="createElement" node="regRes" content="regValue"/>

<map maptype="moveCursor" node=".."/>

</repeat>

</extract>

<extract exptype="set" expression="regBegin"/>

<extract exptype="re_over" expression="\r\n&lt;">

<map maptype="moveCursor" node="/document/charsetmap/overlappingtrue"/>

<set parameter="memory" value1="regC" value2="0"/>

<repeat times="*">

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regValue"/>

<stringconvert use="map1" overlapping="true">

<register nameref="regValue"/>

</stringconvert>

<count parameter="plus" value1="regC" value2="1" value3="regCC"/>

<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>

<map maptype="createElement" node="regRes" content="regValue"/>

<map maptype="moveCursor" node=".."/>

</repeat>

</extract>

<extract exptype="mark" expression="regMiddle"/>

<extract exptype="over" expression="2204&gt;"/>

<extract exptype="mark" expression="regCount"/>

<extract exptype="re_over" expression="&lt;">

<map maptype="moveCursor" node="/document/count/original"/>

<repeat times="*">

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regValue"/>

<stringconvert use="map1" overlapping="false">

<register nameref="regValue"/>

</stringconvert>

<count parameter="plus" value1="regC" value2="1" value3="regCC"/>

<set parameter="append" value1="value_" value2="regCC" value3="regRes"/>

<map maptype="createElement" node="regRes" content="regValue"/>

<map maptype="moveCursor" node=".."/>

</repeat>

</extract>

<extract exptype="set" expression="regCount"/>

<map maptype="moveCursor" node="/document/count/result"/>

<set parameter="memory" value1="regC" value2="0"/>

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regValue"/>

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regValue2"/>

<extract exptype="re_over" expression="\r\n"/>

<count parameter="plus" value1="regValue" value2="regValue2" value3="regCC"/>

<map maptype="createTextNode" node="" content="regCC"/>

<extract exptype="set" expression="regMiddle"/>

<map maptype="moveCursor" node="/document/append/original"/>

<set parameter="memory" value1="regC" value2="0"/>

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regV1"/>

<extract exptype="re_over" expression="\r\n"/>

<extract exptype="over" expression="#"/>

<extract exptype="re_upto" expression="\r\n" save="regV2"/>

<map maptype="createElement" node="value1" content="regV1"/>

<map maptype="moveCursor" node=".."/>

<map maptype="createElement" node="value1" content="regV2"/>

<map maptype="moveCursor" node=".."/>

<map maptype="moveCursor" node="/document/append/result"/>

<set parameter="append" value1="regV1" value2="regV2" value3="regV3"/>

<map maptype="createTextNode" node="" content="regV3"/>

<map maptype="documentReady" />

</repeat>

</wrapper>

4.3 XML template

The DEL script (in 4.2) uses the following kind of XML template (using "getTemplate" to call this template):

<document>

<charsetmap>

<original/>

<overlappingfalse/>

<overlappingtrue/>

</charsetmap>

<count>

<original/>

<result/>

</count>

<append>

<original/>

<result/>

</append>

</document>

4.4 output XML

The output XML from the above input data and rules is as follows:

<?xml version="1.0" encoding="ISO-8859-1" ?>

<document>

<charsetmap>

<original>

<value_1>0,12.10</value_1>

<value_1>23,1.514</value_1>

<value_1>1,4.444</value_1>

<value_1>abba</value_1>

</original>

<overlappingfalse>

<value_1>0.12-10</value_1>

<value_1>23-1.514</value_1>

<value_1>1.4-444</value_1>

<value_1>bbcc</value_1>

</overlappingfalse>

<overlappingtrue>

<value_1>0-12-10</value_1>

<value_1>23-1-514</value_1>

<value_1>1-4-444</value_1>

<value_1>cccc</value_1>

</overlappingtrue>

</charsetmap>

<count>

<original>

<value_1>3</value_1>

<value_1>5</value_1>

</original>

<result>8</result>

</count>

</append>

<original>

<value1>amount</value1>

<value1>contact</value1>

</original>

<result>amountcontact</result>

</append>

</document>

APPENDIX 1. DATA EXTRACTION LANGUAGE DTD

<!DOCTYPE wrapper [

<!ELEMENT wrapper (template*, map, (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)*) >

<!ELEMENT template (repeat | test | map | extract | set |runtemplate | charsetmap | stringconvert | count)*>

<!ATTLIST template

name ID #REQUIRED >

<!ELEMENT repeat (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)* >

<!ATTLIST repeat

times CDATA #REQUIRED >

<!ELEMENT test (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)* >

<!ATTLIST test

testtype (equal | unequal | less | greater | re_equal | re_unequal | contains) #REQUIRED

value1 CDATA #REQUIRED

value2 CDATA #IMPLIED >

<!ELEMENT map EMPTY >

<!ATTLIST map

maptype (getTemplate | documentReady | moveCursor | createDocument | createElement | createElementBefore | createElementAfter | createAttribute | createTextNode | createComment | createCDATA | createProcessingInstructions) #REQUIRED

node CDATA #REQUIRED

content CDATA #IMPLIED >

<!ELEMENT extract (repeat | test | map | extract | set | runtemplate | charsetmap | stringconvert | count)* >

<!ATTLIST extract

exptype (length | upto | over | re_upto | re_over| content) #REQUIRED

expression CDATA #REQUIRED

save CDATA #IMPLIED >

<!ELEMENT set EMPTY >

<!ATTLIST set

parameter (doctype | encoding | memory | append | serializeOutput) #REQUIRED

value1 CDATA #REQUIRED

value2 CDATA #IMPLIED

value3 CDATA #IMPLIED >

<!ELEMENT runtemplate EMPTY >

<!ATTLIST runtemplate

nameref IDREF #REQUIRED >

<!ELEMENT charsetmap (values)*>

<!ATTLIST charsetmap

name CDATA #REQUIRED >

<!ELEMENT values EMPTY >

<!ATTLIST values

search CDATA #REQUIRED

replace CDATA #REQUIRED >

<!ELEMENT stringconvert (register)*>

<!ATTLIST stringconvert

use CDATA #REQUIRED

overlapping (true | false) #REQUIRED >

<!ELEMENT register EMPTY>

<!ATTLIST register

nameref CDATA #REQUIRED >

<!ELEMENT count EMPTY

<!ATTLIST count

parameter (plus | minus | multiply | divide | decimal) #REQUIRED

value1 CDATA #REQUIRED

value2 CDATA #REQUIRED

value3 CDATA #IMPLIED >

]>


References

XML
World Wide Web Consortium. Extensible Markup Language (XML) 1.0. February 1998. <http://www.w3.org/TR/REC-xml>
XPath
World Wide Web Consortium. XML Path Language (XPath) 1.0. November 1999. <http://www.w3.org/TR/xpath>
DOM
World Wide Web Consortium. Document Object Model (DOM) Level 2 Core Specification 1.0. November 2000. <http://www.w3.org/TR/DOM-Level-2-Core>
REGEX
GNU Regular Expressions, Regex Edition 0.12a. September 1992. <http://sunsite.utk.edu/gnu/regex/regex_toc.html>