RDFa technology button RDFa 1.1 Distiller and Parser

Warning: This version accompanies the developement of RDFa 1.1 Core. As that document is not final yet, this service, and the underlying code, will change frequently until the development of RDFa 1.1 is finalized. The implementation may actually run ahead of the “official” version and implement the version in the editors’ draft already… Also, the package available for download may be out of sync with the code running this service.

 
Show More (non-standard) options

Distill by File Upload
 
Show More (non-standard) options

Distill by direct input
:
 
Show More (non-standard) options

If you intend to use this service regularly on large scale, consider downloading the package and use it locally. Storing a (conceptually) “cached” version of the generated RDF, instead of referring to the live service, might also be an alternative to consider in trying to avoid overloading this server…

What is it?

RDFa is a specification for attributes to be used with XML languages or with HTML5 to express structured data. The rendered, hypertext data of XML or HTML is reused by the RDFa markup, so that publishers don’t need to repeat significant data in the document content. The underlying abstract representation is RDF, which lets publishers build their own vocabulary, extend others, and evolve their vocabulary with maximal interoperability over time. pyRdfa is a distiller that generates the RDF triples from an XML or HTML5 file annotated by RDFa in various RDF serialization formats. It can either be used directly from a command line or via a CGI service. It corresponds to the RDFa 1.1 Core document, XHTML+RDFa, and HTML+RDFa specifications, as well as to the SVG Tiny 1.2 Recommendation for the SVG version. The forms above can be used to start the service installed at this site. To learn more about RDFa, please consult the RDFa 1.1 Core Document. See also below for the possibilities to download the package.

As installed in this service pyRdfa is a server-side implementation of RDFa. This also means that pages that generate their (X)HTML content dynamically (e.g., using AJAX) will not be properly processed by this distiller.

Distiller options

Output format (option: format; values: turtle, xml, json, nt; default: turtle)
The default output format is Turtle. Alternative formats are RDF/XML, JSON-LD, and N-triples.
Warnings for non RDFa 1.1 Lite usage (option: rdfa-lite; values: true, false; default: false)
If set to true, a warning will be issued if RDFa 1.1 Core attributes, that are not part of the RDFa 1.1 Lite specification, are used. The separate graph option should be used to make these warnings visible.
Host language (option: host-language; values: xhtml, html, svg, atom, xml; default: html)
For RDFa files downloaded via a URI, the host language is determined based on the content type (see below for further details). When the content is uploaded or input directly, the host language can be set explicitly.
Returned content (option: graph; values: output, processor, processor,output; default: output)
By default, the generated triples are returned without warning or error triples. If the processor is set, then those triples are returned, too. See the RDFa 1.1. Core document for further details.
Perform vocabulary expansion (option: vocab-expansion; values: true, false; default: false )
RDFa 1.1 provides the possibility to “expand” the vocabulary provided by the vocab attribute, i.e., to retrieve the corresponding RDF file and follow the possible subclass and subproperty relationships. See the RDFa 1.1. Core document for further details.
Include embedded turtle in the output (option: embedded-turtle; values: true, false; default: true )
The Turtle specification provides a syntax to add Turtle content to any HTML (or SVG) file via the <script> element. The distiller may extract and add the graph serialized by this Turtle content to the output graph. Note that this graph is completely independent from the rest of the RDFa content, i.e., prefix declarations in RDFa are not valid for the embedded Turtle.
Whitespace preservation in literals (option: space-preserve; values: true, false; default: true)
The RDFa syntax specifies that whitespace characters in the original XHTML must be preserved in the literal output. This options instructs the distiller to “normalize” the whitespace. The default is not to normalize.
Use caching for vocabulary expansion (option: vocab-cache; values: true, false; default: true)
In case vocabulary expansion is set, a built-in caching mechanism is used to store the vocabulary information locally to the processor.
Report on vocabulary caching (option: vocab-cache-report; values: true, false; default: false)
Additional informational triples are generated on vocabulary caching. These triples are added to the “processor” graph, and the option automatically sets the graph option to processor or processor,default (depending on the original setting of graph).
Bypass date checks on vocabularies (option: vocab-cache-refresh; values: true, false; default: false)
By default, vocabulary caches are set to be valid until the date value returned in the “Expires” HTTP response header or, if not given, the cache is set to be valid for a day. If this option is set, the cache date is not checked and the cache is re-generated every time a vocab URI is met. This option may be useful when vocabularies are developed and tested.

Determination of host language type

When the RDFa resource is accessed through HTTP, the host language is determined based on the content type of the return header as follows:

text/html:
HTML5+RDFa.
application/xhtml+xml:
XHTML+RDFa.
application/svg+xml:
SVG Tiny 1.2.
Note that there is no explicit SVG+RDFa specification (in terms of an RDFa 1.1 “Host Language”), i.e., an SVG file is treated as an XML+RDFa with the additional feature that RDF/XML added directly to the SVG document via metadata element is also extracted and added to the output.
application/atom+xml:
Atom; this is an experimental feature (the RDFa WG may define an Atom+RDFa host language)
all other cases:
XML+RDFa

Alternative access to the Distiller

If you use Firefox, Safari, Chrome, or Opera, you can also drag the following bookmarklets to your browser bar and use them to distill the current page: “RDFa it (Turtle)!”, “RDFa it (RDF/XML)!”, “RDFa it (N triples)!”.

When using the distiller URI directly, the option names for the default options can be ommited. Some examples:

Extract the RDF from http://www.example.com/rdfa.html, with whitespace preservation and without warnings, serialized in Turtle:
http://www.w3.org/2012/pyRdfa/extract?uri=http://www.example.com/rdfa.html
Extract the RDF from http://www.example.com/rdfa.html, with whitespace preservation and without warnings, serialized in RDF/XML:
http://www.w3.org/2012/pyRdfa/extract?format=xml&uri=http://www.example.com/rdfa.html
Extract the RDF from http://www.example.com/rdfa.html, with whitespace preservation and including warnings, serialized in Turtle:
http://www.w3.org/2012/pyRdfa/extract?graph=default,processor&uri=http://www.example.com/rdfa.html
Use a fixed, pseudo URI to extract the RDF from the current page without specifying its URI (with default options); this can be used, say, as a link for a button on the page:
http://www.w3.org/2012/pyRdfa/extract?uri=referer

Distribution

The underlying package, called pyRdfa, implemented as a Python package, is available for download. The package is based on the standard Python 2.x distribution. (It has been tested on version 2.7.2, which is the highest, and probably the last stable release in Python 2.x). The module does not run on the Python 3.x family.

The core package relies on the RDFLib package. It has been tested on the RDFLib 3.1.0, but it also runs with the RDFLib 2.x versions. RDFLib 3.x is preferred: the serialization modules are superior in quality. (Note, however, that the JSON serialization does not run on RDFLib 2.x versions!) The Python HTML5 parser is used to process HTML5. The general package also relies on a slightly modified version of Deron Meranda’s httpheader module. (Both the HTML5 Parser and httpheader are included in the distribution.)

For the JSON-LD serialization, two more external packages are used: Armin Ronacher’s Ordered Dictionary (odict) package, as well as Bob Ippolito’s simplejson package. odict is needed unless Python 2.7.x is used (an ordered dictionary module has been added to the standard distribution of Python 2.7.x); simplejson is needed for Python 2.5 or lower (json has been added to the standard Python 2.6.x distribution).

To install the package, download the distribution file (it is a compressed tar file) and either move the pyRdfa directory to your PYTHONPATH or modify your PYTHONPATH to to include that directory. The odict and httpheader modules (each consisting of a single Python file) have been added to the pyRdfa package under ‘extras’; you do not have to do anything special to install these. The HTML5 parser must be installed independently; to make this step easier, the compressed tar file has been added to the pyRdfa distribution file. The same is true for the simplejson package although, if you run Python 2.6.x or higher, that module can be ignored.


Ivan Herman, (ivan@w3.org)
Last revised: $Date: 2012/01/26 14:09:18 $ (see in RDF)

This software is available for use under the W3C® SOFTWARE NOTICE AND LICENSE

'Valid XHTML + RDFa' button