Draft Material

From RDF and XML Interoperability Community Group
Jump to: navigation, search

Introduction

Purpose of the page

This page lists use cases and technologies to implement the use cases. All use cases share the task to work with XML, RDF and JSON. Currently there is a strong focus on working with XML and RDF, but this may change in the future.

Structure of the page

There are two main sub sections: use cases and technology solutions. Both are sub divided further. In the use case section, there are sub types of use cases. In the technology solutions section, there are solutions grouped under certain aspects, like: is a proposed extension of the RDF technology stack or XML technology stack; is a best practice; is a tool; etc. The aim of this structure is to detect gaps that may benefit from further standardization, best practice description or tool development.

Use Cases

Data Enrichment in Digital Publishing

Owner: Christian, Quentin, Rob

What:

XML, JSON & Linked Data Challenge:

Data in Public Administration

Owner: Gerard

What:

XML, JSON & Linked Data Challenge:

Mapping of XML Dictionary Data into RDF

Owner: Timea

What: Lexical Dictionaries from XML format to RDF. See blog post: https://ldl4.com/2016/09/26/what-ive-learned-while-triplifying-a-real-dictionary/ See https://lists.w3.org/Archives/Public/public-rax/2016Jul/0010.html

Example: Language is very complex and so are the dictionaries that contain rich data about available headwords. The dictionary data is stored in XML format and the structure is very complex, trying to retain all the aspect of language in the XML tags. The task is started by providing a basic ontology/model for lexicography in general such as Lemon and Ontolex and extended with information particular to the data. The solution was to create XSLTs that can handle the complex XML structure but this is not scalable nor maintainable.

XML, JSON & Linked Data Challenge: XML complex structure

Mapping of XML Book Content (z-bible) into RDF

Owner: Bernát

What:

XML, JSON & Linked Data Challenge:

Mapping of XML EAD into RDF

Owner: Gerard

What: See https://lists.w3.org/Archives/Public/public-rax/2016Jul/0025.html

XML, JSON & Linked Data Challenge:

Enrichment in Localization and Translation

Owner: Phil

What: Our Use Case is that we wish to use content enrichment and analysis services such as the FREME Platform to carry out such things as terminology spotting, entity spotting and machine translation during the process of translating content from one natural language to another. The FREME services express enrichments in RDF.

Our translation and localization process already uses the well standardized XML vocabulary/application, Extensible Localization Interchange File Format (XLIFF). It is supported by all of the tool-sets within our workflow. We therefore wish to embed RDF into XLIFF in ways which least impact disruption to this existing process but which maximize the ability to carry the enrichment through as much of the workflow as possible.

Currently we are utilizing two methods to achieve this:

  1. Embedding RDF/XML within the XLIFF
<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="de" trgLang="en"
    xmlns:its="http://www.w3.org/2005/11/its">
    <file id="f1">
        <unit id="u1" its:annotatorsRef="text-analysis|http://spotlight.dbpedia.org/">
            <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
                xmlns:dbprop="http://dbpedia.org/property/">
                <rdf:Description rdf:about="http://dbpedia.org/resource/Berlin">
                    <dbprop:population rdf:datatype="http://www.w3.org/2001/XMLSchema#integer"
                        >3415091</dbprop:population>
                </rdf:Description>
            </rdf:RDF>
            <segment id="s1">
                <source>Willkommmen in Berlin!</source>
                <target>Welcome to Berlin!</target>
            </segment>
        </unit>
    </file>
</xliff>
  1. Embedding JSON-LD within the XLIFF
<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="de" trgLang="en"
    xmlns:its="http://www.w3.org/2005/11/its">
    <file id="f1">
        <unit id="u1" its:annotatorsRef="text-analysis|http://spotlight.dbpedia.org/">
            <ex:json-ld xmlns:ex="http://example.com"> { "@context": { "dbpedia":
                "http://dbpedia.org/resource/", "dbprop": "http://dbpedia.org/property/", "rdf":
                "http://www.w3.org/1999/02/22-rdf-syntax-ns#", "rdfs":
                "http://www.w3.org/2000/01/rdf-schema#", "xsd": "http://www.w3.org/2001/XMLSchema#"
                }, "@id": "dbpedia:Berlin", "dbprop:population": 3415091 }</ex:json-ld>
            <segment id="s1">
                <source>Willkommmen in Berlin!</source>
                <target>Welcome to Berlin!</target>
            </segment>
        </unit>
    </file>
</xliff>
  1. Utilize the Internationalization Tag Set 2.0 (ITS 2.0) Ontology
<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns="urn:oasis:names:tc:xliff:document:2.0" version="2.0" srcLang="de" trgLang="en"
    xmlns:its="http://www.w3.org/2005/11/its">
    <file id="f1">
        <unit id="u1" its:annotatorsRef="text-analysis|http://spotlight.dbpedia.org/">
            <segment id="s1">
                <source>Willkommmen in Berlin!</source>
                <target>Welcome to <mrk id="m1" type="its:any"
                        its:taIdentRef="http://dbpedia.org/resource/Berlin">Berlin</mrk>!</target>
            </segment>
        </unit>
    </file>
</xliff>

Data acquisition from job postings via GATE

Owner: Christoph (on behalf of his colleague Elisa M. Sibarani)

What: GATE Embedded pipeline annotates raw job posting texts, resulting in XML, from which RDF is to be extracted. Implemented using Krextor.

XML, JSON & Linked Data Challenge: The structure of the GATE XML output (empty element nodes used to select ranges of text and link them to structured annotations) is hard to process using declarative XPath-based approaches.

AutomationML industry automation models integration

Owner: Christoph (together with his colleague Irlán Grangel-González)

What: integrating multiple views on industry automation settings modelled using AutomationML, by converting from AutomationML to RDF, and further from RDF to a deductive database, where conflict resolution rules are applied (paper). Implemented using Krextor.

XML, JSON & Linked Data Challenge: One challenge so far is that in AutomationML certain elements and attributes can occur in many different contexts (i.e. parent elements).

Solutions

General Aspects of RDF/XML Interoperability

Owner: Quentin

What: As part of Wolters Kluwer (WK), we have developed a lingua franca (in the form of an ontology) to express descriptive and structure metadata about our content (see Standardizing Legal Content with OWL and RDF). The original goal of the lingua franca was to implement a generic semantic content integration channel to deliver content stored in local Content Management System to the WK global publishing platform. As part this semantic content integration channel, we developed a validation and conversion tool that leveraged both XHTML and RDF to express information about content to be merged and transformed into the XML format supported by the publishing platform.

As RDF includes the RDX/XML serialization, we first attempted to use XSLTs to handle the conversion. However, a set of RDF triples express a directed, labeled graph that can be rendered in different manners in XML. For instance, the <rdf:Description> element can be used to represent a resource or its type (e.g. skos:Concept) can be used. This is because the RDF/XML syntax is schemaless. As such, transformation between RDF and XML using XSLTs is not optimal and XSLTs need to be tailored to different flavor of RDF/XML (which is costly). One solution that we have used is to transform the RDF input (in any supported syntax) to be converted into a XML canonical form prior to applying XSLTs. The approach is supplemented by a library of Java functions that can be called from XSLTs to process the graph. This approach has enabled us to migrate data across systems in several projects. However, we have seen issues with the performance of transformation as well as increase in complexity. In other words, it is difficult to maintain as requirements for transformation change.

Based on our experience, we have observed that XML is great for representing content, whereas RDF is great for representing meta-information (e.g. title, date, etc.) about the content. The main aspect is to assign URIs to the abstract content and there physical encoding (e.g. XML format). A few standards (e.g. RDFa) have already been defined to integrate meta-information in HTML.

Classification of Solutions

Owner: Felix and everybody

What: See https://lists.w3.org/Archives/Public/public-rax/2016Jul/0014.html

Solutions can be classified with regards to input and output.

  1. Going from RDF > XML (also called “lowering”)
  2. Going from XML > RDF (also called “lifting”)
  3. Doing round tripping (i.e. both (1) and (2))
  4. Embedding RDF in XML (already feasible with technologies like RDFa or JSON-LD)
  5. Embedding XML in RDF (e.g. rdf:XMLLiteral; works well when the RDF is serialized in XML, e.g., using RDF/XML)

Best Practices

Conversion from XML to RDF on the Schema Level, without Round Tripping

Owner: Timea

What: See blog post: https://ldl4.com/2016/09/26/what-ive-learned-while-triplifying-a-real-dictionary/

Example:

Expected role of RAX Group:

Conversion from XML to RDF on the Schema Level, without Round Tripping

Owner: Jean-Pierre Evain

What: See https://lists.w3.org/Archives/Public/public-rax/2016Jul/0015.html

Example: Sport metadata and audiovisual content in its semantic context

Sport metadata is delivered in live data streams that evolve with time (e.g. results). Different providers use different data formats such as CSV, XML or e.g. PDF. In the case of XML, different schemas are being used.

Semantic allows linking resources and associated metadata to contextual sport data (events, locations, athletes, results, etc.) and also to external linked resources. The first task consists of converting ingested data into RDF. The most convenient is converting data to XML (if not natively XML) and to RDF.

In sport applications, a sport ontology is being used, which is not a structural conversion from XSD to RDF. An ontology is a model derived from the XML data (and associated schema). XML to rdf is therefore the transformation of XML instances into rdf individuals.

Equally, linked data is being mapped to the ontology.

RDF to XML is not required. Instead, the results of [SPARQL](https://www.w3.org/TR/sparql11-query/) queries are exported as Java beans or JSON depending on the application and development platforms.

Expected role of RAX Group:

RDF/XML "plain" profile suitable for XSLT transformations

Owner: Martynas

What: defining a constrained RDF/XML profile that is better suited for XSLT transformations (with RDF/XML as the input format), e.g. for rendering as XHTML.

Our use case is using XSLT 2.0 to transform RDF/XML mainly to XHTML, but possibly also other XML formats. RDF/XML has a reputation as being unsuitable for XSLT transformations because of a high degree of variability: multiple ways to specify rdf:type, multiple ways to insert blank nodes etc. We work around this problem by using the RDFFormat.RDFXML_PLAIN writer from Apache Jena that limits the variability by using a set of conventions:

  • properties are grouped into resource descriptions: there should only be one rdf:Description per resource
  • rdf:type is inlined as a simple property (is not used instead of rdf:Description)
  • blank nodes are top-level resources like all others and always carry rdf:nodeID attribute

Using this approach has multiple advantages. The RDF/XML structure is much more predictable which simplifies the XSLT transformations (match patterns, node selections, lookups etc.). No extension functions are necessary, resource lookups can be achieved using the built-in XSLT key() function, e.g. key('resources', @rdf:resource) where the key is defined as <xsl:key name="resources" match="*[*][@rdf:about] | *[*][@rdf:nodeID]" use="@rdf:about | @rdf:nodeID"/>. It is not quite so feasible with XSLT 1.0 however.

Example XSLT: https://github.com/AtomGraph/Web-Client/tree/master/src/main/webapp/static/com/atomgraph/client/xsl/bootstrap/2.3.2

Expected role of RAX Group: we think the Group could standardize this profile, possibly under the name of "RDF/XML plain" following the Jena convention. Pre-processing stylesheets could be provided by the Group that transform adhoc RDF/XML into RDF/XML plain.

Repository of GRDDL/XSLT converters

Owner: Martynas

What: collecting and reusing GRDDL/XSLT stylesheets that "lift" common XML formats to RDF (with RDF/XML as the output format)

GRDDL is an underused XSLT-based specification for extraction of RDF from XML. Basically, it specifies the use of XSLT stylesheets for transformation of XML formats into RDF/XML. GRDDL/XSLT is a much more reusable and platform-independent way of "lifting" XML to RDF compared to converters implemented in imperative languages. There is a number of well-defined XML formats with stable schemas (such as DocBook, DITA, CML etc.) as well as API response formats (e.g. Twitter), for which it would be possible to implement "standard" GRDDL converters, or at least "base" converters that are easily extended or customized. Unfortunately, they do not exist or are hard to find, which makes the developers write their own or look for other approaches/languages, resulting in duplicated efforts, waste of resources, and slow uptake of RDF.

There is a collection of GRDDL stylesheets by OpenLink that may form a part of this effort: https://github.com/openlink/Virtuoso-RDFIzer-Mapper-Scripts/tree/master/xslt

Expected role of RAX Group: The situation could be improved by creating a GitHub repository (preferably under the W3C account) where group members and other users could contribute GRDDL stylesheets (under a permissive license) for consuming standard XML formats and producing RDF/XML. RDF/XML "plain" profile would be well-suited for use in these stylesheets. Some guidelines should also be drawn, for example: no use of extension functions unless absolutely necessary.

Roundtripping: Converting XML to RDF and back again

Owner: Felix et al.

What: See http://archive.xmlprague.cz/2016/files/xmlprague-2016-proceedings.pdf#page=133

By Roundtripping we mean that XML is converted to RDF, some processing in RDF is done, and the output is embedded into the original XML again. The forehand mentioned paper describes several approaches to realize roundtripping:

  1. Embed Linked Data into XML via Structured Markup, e.g. RDFa
  2. Anchor Linked Data in XML Attributes
  3. Embed Linked Data in Metadata Sections of XML Files
  4. Anchor Linked Data via Annotations in XML Content

See the paper for details. The approaches do not need an extension of the RDF or XML technology stack, but tooling. A demo implementation is available at http://api-dev.freme-project.eu/doc/freme-showcase/xml-to-rdf.html

Expected role of RAX Group:

Existing Tools

XML > RDF Conversion Tooling: Krextor (XSLT-based)

Owner: Christoph

What: Krextor (homepage, old but still helpful paper) is a library of high-level XSLT templates and functions for XML→RDF conversion. Krextor enables the specification of mappings from XML-based formats to RDF at levels ranging from a declarative “schema to ontology” mapping (for many practical situations) and low-level XSLT (for full power). Krextor is not schema-aware; the mapping author is expected to know the schema of the XML input and the ontology of the RDF output and has to write the mapping manually.

Advantages over from-scratch one-off XSLT implementations include:

  • Krextor employs a high-level abstraction of RDF. Instead of just generating RDF/XML output (which is what most from-scratch one-off XSLTs for XML→RDF do), its basic actions are creating resources and adding properties to them. The rules for generating URIs are specified independently from the rules for mapping XML elements/attributes to RDF vocabulary terms.
  • Templates for many common tasks (e.g. generating URIs from ID/name attributes) are part of Krextor's library.
  • convenient Java and command line interfaces

Krextor's most serious shortcomings are:

  • It is hard to specify your own XML→RDF extraction rules without a strong XSLT background. This is because when things go wrong you will receive low-level error messages from the XSLT processor.
  • Krextor is not optimized for performance (but for expressiveness of mappings and ease of implementing new mappings) – so maybe it could be employed for rapid prototyping even in performance-critical settings.
  • Krextor has only been tested with the Saxon XSLT processor, as Krextor's main developer (Christoph) is not aware of any other free processor for XSLT ≥ 2.0. Also, as the current free version of Saxon does not support full XSLT 3.0, Krextor is still limited to XSLT 2.0.

Example: see https://github.com/EIS-Bonn/krextor/wiki/Documentation

Expected role of RAX Group:

  • generating feedback on common XML→RDF mapping primitives that Krextor does not yet support
  • coming up with XML schemas, instance documents and RDF output vocabularies that serve as interesting use cases because the respective XML→RDF translation involves structures and patterns not yet (well) supported by Krextor
    • Christoph will be able to comment on these and possibly advise on how to implement them but will most likely not have the resources to actually implement them.
  • plus, to the extent possible:

Extensions of the RDF and / or XML Technology Stack

RDF Conversion through Shape Expressions

Owner: Jose

What: Shape Expressions (ShEx) can be used to validate and transform RDF. Although the primary goal of ShEx is to describe and validate RDF data, it can also be extended with semantic actions which are similar to parsing grammar semantic actions and in that way, it can be used to transform the RDF that has been validated into XML or other languages like Json.

[1] describe ShEx and includes an example of how ShEx can be used to generate XML from RDF.

[1] Shape Expressions: An RDF validation and transformation language, Eric Prud'hommeaux, Jose Emilio Labra Gayo, Harold Solbrig, 10th International Conference on Semantic Systems, Sept. 2014, Leipzig, Germany, PDF paper, slides

Example:

Expected role of RAX Group:

XSPARQL

Owner: Markus

What: Language combines XQuery and SPARQL; suitable for round-tripping. See https://www.w3.org/Submission/xsparql-language-specification/

XSPARQL is specifically suited for queries over XML or RDF or both at the same time, rather than transforming whole RDF graphs or whole XML documents (which requires more effort than, say, in XSLT, for the absence of implicit recursion). (The latter sentence reflects Christoph's view; feel free to improve and/or discuss)

Example: Simple example for lowering and lifting in the specification. A complex example for lifting here: https://github.com/dbpedia/Cmdi-DataID-mappings/blob/master/CmdProfiles/OLAC-DcmiTerms/OLAC-DCTERMS-query.xs

Compiling is done with a fork of the original XSPARQL library https://github.com/AKSW/xsparql (cleaning up some shortcomings, needs further testing...)

Expected role of RAX Group:

XML formats for RDF datasets (quads)

Owner: Martynas

What: currently there is a gap in RDF standardization: standard XML format for RDF datasets (with named graphs aka. quads) is lacking. RDF/XML only covers RDF graphs, not datasets. TriX supports datasets, but is not a standard.

Named graph support for RDF/XML was abandoned by the RDF 1.1 WG.

TriX is well-defined with a simple XML schema, although there has been some confusion as there is an alternative (older?) schema version by Nokia, which defines the root element as <TriX> vs. <trix> by W3C.

TriX is not an adequate alternative for RDF/XML+quads however, as it is a raw "dump" of triples in XML and does not follow the parent/child structure normally found in XML formats and supported by XML processing tools such as XSLT. RDF/XML follows the parent/child convention (to the extent possible when graphs are folded into trees). Another way to look at this: RDF/XML is the XML alternative of Turtle, TriX is the XML alternative of N-Triples, while the XML alternative of TriG is missing.

It should be possible to extend RDF/XML syntax with quads. Some approaches suggested here: https://lists.w3.org/Archives/Public/semantic-web/2016Jun/0022.html

This is a TriG-like example:

<rdfx:Graph rdfx:name="https://www.w3.org/People/Berners-Lee/card">
  <rdf:Description rdf:about="https://www.w3.org/People/Berners-Lee/card#i">
     <foaf:givenName>Tim</foaf:givenName>
     <foaf:familyName>Berners-Lee</foaf:familyName>
  </rdf:Description>
</rdfx:Graph>

Expected role of RAX Group:

  • define and standardize RDF/XML for RDF datasets, either as a backwards-compatible extension or a "new" format
  • standardize TriX based on the W3C schema

Create XML Documents from RDF Data

Owner: Bernát

What: See https://lists.w3.org/Archives/Public/public-rax/2016Jul/0023.html

Example:

Expected role of RAX Group:

XML that does not let RDF introduce Namespaces

Owner: Liam

What: See https://lists.w3.org/Archives/Public/public-rax/2016Jul/0018.html

Example:

Expected role of RAX Group:

STTL for RDF to XML

STTL (paper, specification) is an approach that combines SPARQL and XML templates. It is implemented in the Corese library.

Old Material (Please do not edit): Collection of known solutions & ideas

If you want to use material from below please integrate it where it fits in the above structure.

Other ideas:

  1. Maestro edition of TopBraid Composer
  2. MarkLogic