Warning:
This wiki has been archived and is now read-only.

XML to RDF Transformation processes using XSLT

From RDF and XML Interoperability Community Group
Jump to: navigation, search

XML to RDF Transformation processes using XSLT - Best Practices and Challenges

Introduction (Christian and Phil)

Intention of the group The goal of the “RDF and XML Interoperability Community Group” or in short “RAX group” is to

  • identify application areas in which the combined processing of XML and RDF data and tooling is beneficial
  • identify issues that hinder the joint usage of the two technology stacks
  • formulate best practices to resolve the issues or propose standardization topics

The goal does not only take into account the data representation formats XML and RDF, but all related technologies (e.g. for XML: XSLT, XQuery; for RDF: RDF Schema, SPARQL) and selected XML (e.g. OData) or RDF vocabularies. g Intention of this document XML is widely used in many industries to represent textual information – and most of the time accompanying metadata as well. This metadata can be of technical nature like “last_modification_date” or it can be domain driven like “topic_of_this_section”. The main reason for keeping metadata in the XML itself is process efficiency – all relevant information stays within one file. But in general, metadata has a lifecycle on its own, and therefore should be represented in its own right in a proper flexible and scalable format. We think that RDF is the format of choice here. So the question is how to enable metadata usage without sacrificing the advantages of XML. We propose to transform the metadata in the XML to RDF using XSLT. The main reason for XSLT is, that it is highly adopted for many purposes and there is a lot of knowledge available in the market. Unfortunately, there are a couple of challenges with this approach:

  • Resulting XSLT scripts are very complex and hard to maintain
  • Complex transformation processes can cause performance issues
  • When a roundtrip of data transformation is required, XSLT is not a proper way to implement RDF to XML transformation. So for this task, a different technology needs to be introduced
  • Resulting RDF is not capable of preserving all information that is stored in XML

To illustrate the latter, a set of RDF triples expresses a directed, labeled graph that can be rendered in different manners in XML. For instance, the <rdf:Description> element can be used to represent a resource or its type (e.g. skos:Concept) can be used. This is because the RDF/XML syntax is schemaless. As such, transformation between RDF and XML using XSLTs is not optimal and XSLTs need to be tailored to different flavor of RDF/XML (which is costly).

This document intends to offer practical answers to the challenges, so that a wider adoption of XML to RDF transformation will be achieved.

In general, there are a number of scenarios, where this transformation makes a lot of sense. We have identified the following main use cases:


Content enrichment in digital publishing This use case is based on the fact that XML is widely used in the media and publishing industry. Content and metadata are held in XML files. But the obstacles of this approach are getting more severe, the more digital publishing is the master process in relation to the old print oriented process. Metadata gains more and more importance. But this metadata is not always content-specific, so it should not be stored in the XML file. Also relationships between metadata items are not well suited for representation in XML. And most important of all: due to more and more business requests for adding metadata in order to enable additional functionality e.g. on search platforms the traditional process of schema change simply breaks. Changing a schema and adapting all subsequent editorial and production processes to it is a costly thing and when most of the times the content itself is not even touched, then a de-coupling of content and metadata is the only way to go.

Content Enrichment in Localization and Translation This use case aims at using content enrichment and analysis services such as the FREME Platform to carry out such things as terminology spotting, entity spotting and machine translation during the process of translating content from one natural language to another. The FREME services express enrichments in RDF. The translation and localization process in place already uses the well standardized XML vocabulary/application, Extensible Localization Interchange File Format (XLIFF). It is supported by all of the tool-sets within the workflow. Next step would be to embed RDF into XLIFF in ways which least impact disruption to this existing process, but which maximize the ability to carry the enrichment through as much of the workflow as possible.

Content quality improvement This use case is aiming at the fact that data quality issues arise by nature when complex content structures and metadata are stored in isolated XML files. There are possibilities to control some of these issues by XML schema, but in general, a lot of the semantic meaning and relationships between entities in XML files cannot be controlled. On the other hand, applying constraint checks and solving data quality issues in RDF is a well-known and common area. So executing these checks on transformed RDF helps to identify quality issues in the content and it also ensures that new problems coming from e.g. schema changes can be identified already in a test environment and not after updating all XML files to the new schema version.

Linked data serialization A process which generates Linked Data RDF will want to present that data in a number of different serializations, e.g. RDF/XML, Turtle, JSON-LD. This can be achieved by starting from an XML source and applying one of a number of XSLT transforms to generate the required serialization. While RDF/XML is a possible source for generating the other serializations, it is not very suitable for this task, despite being XML. Instead we propose using a simpler XML model which is as close as possible to the RDF graph model. This should allow the development of generic XSLT transforms which can be applied with confidence to any RDF that is to be presented as Linked Data.

As already explained, XSLT transformation is our preferred recommendation, but there are also alternatives to this, which could be better suited in certain situations. Examples are:

Structure of this document This document is structured as follows: section 1 is introduction; section 2 is the main section, describing the transformation process with respect to the process itself as well as best practices, tools and standards; section 3 summarizes the results and gives some recommendations on scope for future work.

XML to RDF via XSLT

Recommended process

Pre-processing of XML

As RDF includes the RDF/XML serialization, we first attempted to use XSLTs to handle the conversion. However, a set of RDF triples express a directed, labeled graph that can be rendered in different manners in XML. For instance, the <rdf:Description> element can be used to represent a resource or its type (e.g. skos:Concept) can be used. This is because the RDF/XML syntax is schemaless. As such, transformation between RDF and XML using XSLTs is not optimal and XSLTs need to be tailored to different flavor of RDF/XML (which is costly). One solution that we have used is to transform the RDF input (in any supported syntax) to be converted into a XML canonical form prior to applying XSLTs. The approach is supplemented by a library of Java functions that can be called from XSLTs to process the graph. This approach has enabled us to migrate data across systems in several projects. However, we have seen issues with the performance of transformation as well as increase in complexity. In other words, it is difficult to maintain as requirements for transformation change.

Best practices (All)

Create documented XSLT repository for re-use

Tools at hand (Christoph)

Krextor

Krextor is a library of high-level XSLT templates and functions for XML→RDF conversion. Krextor enables the specification of mappings from XML-based formats to RDF at levels ranging from a declarative “schema to ontology” mapping (for many practical situations) and low-level XSLT (for full power). Krextor is not schema-aware; the mapping author is expected to know the schema of the XML input and the ontology of the RDF output and has to write the mapping manually. Advantages over from-scratch one-off XSLT implementations include:

  • Krextor employs a high-level abstraction of RDF. Instead of just generating RDF/XML output (which is what most from-scratch one-off XSLTs for XML→RDF do), its basic actions are creating resources and adding properties to them. The rules for generating (“minting”) URIs are specified independently from the rules for mapping XML elements/attributes to RDF vocabulary terms.
  • Templates for many common tasks (e.g. generating URIs from ID/name attributes) are part of Krextor's library.
  • convenient Java and command line interfaces

Krextor's most serious shortcomings are:

  • It is hard to specify your own XML→RDF extraction rules without a strong XSLT background. This is because when things go wrong you will receive low-level error messages from the XSLT processor.
  • Krextor is not optimized for performance (but for expressiveness of mappings and ease of implementing new mappings).
  • Krextor has only been tested with the Saxon XSLT processor, as Krextor's main developer (Christoph Lange) is not aware of any other free processor for XSLT ≥ 2.0. Also, as the current free version of Saxon does not support full XSLT 3.0, Krextor is still limited to XSLT 2.0.
References

XSLTdoc

What
XSLTdoc does a similar job for XSLT 1 and 2 as what Javadoc does for Java.
How to write
All elements of an XSLT stylesheet can be documented by prepending them with XML elements from the XSLTdoc namespace. Out of this, XSLTdoc generates an HTML documentation.
How to run
manually using Saxon or Java; there is also a command-line frontend based on Node.js.

Standards available (short paragraphs. Christoph initially)

XML

The eXtensible Markup Language (XML) is a formal language for describing tree-oriented data structures (“documents”) with possible links across the hierarchy. XML has first been standardized by the W3C in 1998, as a profile, i.e. sub-language, of the SGML standard that had existed since 1986. XML is compatible with the Web in that XML documents, parts of them, as well as the vocabulary (“schema”) and the datatypes used to describe the nodes of documents can be identified by URIs. An XML document consists of elements, whose children include further elements, or attributes, or text nodes. The most widely known instantiation of XML is HTML, but generally XML is widely used for standard formats for exchanging documents and data.

XML Schema and some related languages is used to define the grammar of XML documents, e.g., to specify the datatype of an attribute.

RDF

The Resource Description Framework (RDF) is a formal language for describing graph-oriented data structures. The goal of RDF is to enable information systems to exchange data on the Web without losing their original meaning. Its first official specification, which provided a vocabulary to describe Web resources, was released in 1999 by the W3C. There is now a whole family of W3C standards around RDF.

RDF data models are composed of statements of the form subject–predicate–object, called triples. The subject represents the resource being annotated, while the predicate denotes the relationship between the subject and the object. A resource could be anything (a physical object, a document, an abstract concept, etc.) and is denoted by a URI, i.e. a Web-scale unique identifier. The predicate of a triple, also called “property of the resource” is also denoted by a URI. The object of a triple can be a resource or a data value (called literal), e.g., a string or a number. Most commonly, the XML Schema datatypes are used as datatypes of literals.

The RDF Schema (RDFS) extends the RDF vocabulary to enable the definition of constructs that allows user-defined vocabularies to be semantically characterized. The semantics of predicates can be expressed by the rdfs:domain and rdfs:range properties. The first is used to specify the type(s) that the subject of any triple has in which the respective property is used as a predicate, while the second specifies the type(s) of the object of such a triple. RDFS also enables the definition of a hierarchy of classes, which can serve as types of resources (<resource> rdf:type <class>).

RDFa (Sergey)

XSLT 1.0/2.0/3.0; XQuery

eXtensible Stylesheet Language Transformations is a formal language to specify “tree to tree” transformations of XML documents to other XML documents or to plain text. XSLT was first standardized by the W3C in 1999; XSLT 1.0 is still the version that is most widely supported by (free) software. The more recent versions 2.0 and 3.0 are mainly supported by certain commercial processors.

XSLT is closely related to XQuery, a W3C-standardized language for querying XML documents. The main difference is that, by default, an XSL Transformation recurses over the entire XML input document, whereas XQuery is better at extracting small answers to queries from XML documents. However, their expressiveness is equal, and each of these tasks can also be performed in the other language. Both XSLT and XQuery build on XPath to identify sets of nodes in XML documents and to perform simple computations on them.

XLIFF (most relevant in localization and translation industry) (Phil)

RDF to XML (not XSLT)

  • Best practices
    • XSPARQL

Summary

XSLT is widely in use to implement translations from XML to RDF; it is probably the most widely used translation approach. Such translations are either implemented from scratch in basic XSLT, creating, e.g., RDF/XML output, but there also exist tools and libraries that offer high-level abstractions on top of XSLT.

Thanks to its Turing-complete expressiveness and its native, strong ability to process XML, XSLT can do the job. It is able to implement any translation from XML to RDF.

There are, however, several obstacles. In summary, not all XML→RDF translations can easily be implemented in XSLT.

  • XSLT is complex. Using it requires a good understanding of the XML data model and of XSLT's mode of operation, i.e., recursing over the input tree and identifying nodes to process by pattern matching. XSLT is a functional programming language, which supports implementations composed of small, reusable modules, but which makes it hard to maintain and update a state throughout the translation process.
  • In contrast to, e.g., XQuery, XSLT's syntax is based on XML, which makes translations verbose and hard to read and write for humans.
  • XSLT cannot process all XML structures easily; for example, mixed content, i.e., elements containing both element and text children, which is typical of document-oriented XML languages, can have quite a variable structure, which makes it hard to define general patterns.
    • Major maintenance issues of XSLT (TODO)
  • Recommendations for future progress