RDFa technology button RDFa 1.1 Distiller and Parser

Warning: This version implements RDFa 1.1 Core, including the handling of the Role Attribute. The distiller can also run in XHTML+RDFa 1.0 mode (if the incoming XHTML content uses the RDFa 1.0 DTD and/or sets the version attribute). The package available for download, although it may be slightly out of sync with the code running this service.

Distill by URI
 
Show More (non-standard) options

Distill by File Upload
 
Show More (non-standard) options

Distill by direct input
:
 
Show More (non-standard) options

If you intend to use this service regularly on large scale, consider downloading the package and use it locally. Storing a (conceptually) “cached” version of the generated RDF, instead of referring to the live service, might also be an alternative to consider in trying to avoid overloading this server…

What is it?

RDFa 1.1 is a specification for attributes to be used with XML languages or with HTML5 to express structured data. The rendered, hypertext data of XML or HTML is reused by the RDFa markup, so that publishers don’t need to repeat significant data in the document content. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time. pyRdfa is a distiller that generates RDF triples from an XML or HTML5 file annotated by RDFa in various RDF serialization formats. It can either be used directly from a command line or via a CGI service. It corresponds to the RDFa 1.1 Core document, XHTML+RDFa, and HTML+RDFa specifications, as well as to the SVG Tiny 1.2 Recommendation for the SVG version. The forms above can be used to start the service installed at this site. To learn more about RDFa, please consult the RDFa 1.1 Core Document. See also below for the possibilities to download the package.

As installed, this service is a server-side implementation of RDFa. This also means that pages that generate their (X)HTML content dynamically (e.g., using AJAX) will not be properly processed by this distiller.

Distiller options

Output format (option: format; values: turtle, xml, json, nt; default: turtle)
The default output format is Turtle. Alternative formats are RDF/XML, JSON-LD, and N-triples.
Warnings for non RDFa 1.1 Lite usage (option: rdfa_lite; values: true, false; default: false)
If set to true, a warning will be issued if RDFa 1.1 Core attributes, that are not part of the RDFa 1.1 Lite specification, are used. The separate rdfagraph option should be used to make these warnings visible.
Host language (option: host_language; values: xhtml, html, svg, atom, xml; default: html)
For RDFa files downloaded via a URI, the host language is determined based on the content type (see below for further details). When the content is uploaded or input directly, the host language can be set explicitly.
Returned content (option: rdfagraph; values: output, processor, processor,output; default: output)
By default, the generated triples are returned without warning or error triples. If the processor is set, then those triples are returned, too. See the RDFa 1.1. Core document for further details.
Perform vocabulary expansion (option: vocab_expansion; values: true, false; default: false )
RDFa 1.1 provides the possibility to “expand” the vocabulary provided by the vocab attribute, i.e., to retrieve the corresponding RDF file and follow the possible subclass and subproperty relationships. See the RDFa 1.1. Core document for further details.
Include embedded turtle or RDF/XML in the output (option: embedded_rdf; values: true, false; default: true )
The Turtle specification provides a syntax to add Turtle content to any HTML file via the <script> element. Alternatively, some XML applications (e.g., SVG) allow adding RDF/XML content to their regular content. The distiller may extract and add these graphs to the output graph. Note that this graph is completely independent from the rest of the RDFa content, e.g., prefix declarations in RDFa are not valid for the embedded Turtle.
Whitespace preservation in literals (option: space_preserve; values: true, false; default: true)
The RDFa syntax specifies that whitespace characters in the original XHTML must be preserved in the literal output. This options instructs the distiller to “normalize” the whitespace. The default is not to normalize.
Use caching for vocabulary expansion (option: vocab_cache; values: true, false; default: true)
In case vocabulary expansion is set, a built-in caching mechanism is used to store the vocabulary information locally to the processor.
Report on vocabulary caching (option: vocab_cache_report; values: true, false; default: false)
Additional informational triples are generated on vocabulary caching. These triples are added to the “processor” graph, and the option automatically sets the rdfagraph option to processor or processor,default (depending on the original setting of rdfagraph).
Bypass date checks on vocabularies (option: vocab_cache_refresh; values: true, false; default: false)
By default, vocabulary caches are set to be valid until the date value returned in the “Expires” HTTP response header or, if not given, the cache is set to be valid for a day. If this option is set, the cache date is not checked and the cache is re-generated every time a vocab URI is met. This option may be useful when vocabularies are developed and tested.

Determination of host language type

When the RDFa resource is accessed through HTTP, the host language is determined based on the content type of the return header as follows:

text/html:
HTML5+RDFa.
application/xhtml+xml and the file uses the right XHTML+RDFa DTD:
XHTML+RDFa. Note that, depending on the DTD, the distiller uses an RDFa 1.0 or and RDFa 1.1 processing mode.
application/xhtml+xml and the file does not use any DTD (or uses an unknown one):
XHTML5.
application/svg+xml:
SVG Tiny 1.2.
Note that there is no explicit SVG+RDFa specification (in terms of an RDFa 1.1 “Host Language”), i.e., an SVG file is treated as an XML+RDFa with the additional feature that RDF/XML added directly to the SVG document via an metadata element is also extracted and added to the output.
application/atom+xml:
Atom (this host language implements the draft specification defined by Toby Inkster.)
all other cases:
XML+RDFa

Alternative access to the Distiller

If you use Firefox, Safari, Chrome, or Opera, you can also drag the following bookmarklets to your browser bar and use them to distill the current page: “RDFa it (Turtle)!”, “RDFa it (RDF/XML)!”, “RDFa it (N triples)!”.

When using the distiller URI directly, the option names for the default options can be ommited. Some examples:

Extract the RDF from http://www.example.com/rdfa.html, with whitespace preservation and without warnings, serialized in Turtle:
http://www.w3.org/2012/pyRdfa/extract?uri=http://www.example.com/rdfa.html
Extract the RDF from http://www.example.com/rdfa.html, with whitespace preservation and without warnings, serialized in RDF/XML:
http://www.w3.org/2012/pyRdfa/extract?format=xml&uri=http://www.example.com/rdfa.html
Extract the RDF from http://www.example.com/rdfa.html, with whitespace preservation and including warnings, serialized in Turtle:
http://www.w3.org/2012/pyRdfa/extract?graph=default,processor&uri=http://www.example.com/rdfa.html
Use a fixed, pseudo URI to extract the RDF from the current page without specifying its URI (with default options); this can be used, say, as a link for a button on the page:
http://www.w3.org/2012/pyRdfa/extract?uri=referer

Error reporting

The distiller adds either error, warning, or informaation triples into the processor graph. Some of those are defined by the RDFa Core document, some additional messages are generated by the distiller. The latter category includes, e.g., HTTP 404 errors; these are reported using the same error structure as the ones defined by the standard.

Distribution

The underlying package, called pyRdfa, implemented as a Python package, is available for download from GitHub. The package is based on the standard Python 2.x.y distribution, where 'x' should be 5 or higher. (It has been tested on version 2.7.2, which is the highest, and probably the last stable release in Python 2.x; if possible, better use that one). The module does not run (yet) on the Python 3.x family. The documentation of the package can be consulted on-line (but is also part of the distribution).

The core package relies on the RDFLib package. It has been tested on the RDFLib 3.1.0, but it also runs with the RDFLib 2.x versions. RDFLib 3.x is preferred: the serialization modules are superior in quality. (Note, however, that the JSON serialization does not run on RDFLib 2.x versions!) The Python HTML5 parser is used to process HTML5. The general package also relies on a slightly modified version of Deron Meranda’s httpheader module. Finally, for reasons that I do not really understand, in some cases the RDFLib distribution generates an import error on a module called isodate that has to be installed manually. (The HTML5 Parser, the httpheader, and the isodate modules are included in the distribution to make installation easier.)

For the JSON-LD serialization, two more external packages are used: Armin Ronacher’s Ordered Dictionary (odict) package, as well as Bob Ippolito’s simplejson package. odict is needed unless Python 2.7.x is used (an ordered dictionary module has been added to the standard distribution of Python 2.7.x); simplejson is needed for Python 2.5 (json has been added to the standard Python 2.6.x distribution).

To install the package, download the distribution file from github and either move the pyRdfa directory to your PYTHONPATH or modify your PYTHONPATH to to include that directory. Alternatively, you can use the standard 'setup.py' script. The odict and httpheader modules (each consisting of a single Python file) have been added to the pyRdfa package under ‘extras’; you do not have to do anything special to install these. The HTML5 parser must be installed independently; to make this step easier, the compressed tar file has been added to the pyRdfa distribution file. The same is true for the simplejson package although, if you run Python 2.6.x or higher, that module can be ignored.


Ivan Herman, (ivan@w3.org)
Last revised: $Date: 2013-01-27 16:48:17 $ (see in RDF)

This software is available for use under the W3C® SOFTWARE NOTICE AND LICENSE

'Valid XHTML + RDFa' button