This service is now discontinued and the underlying software not maintained any more. The underlying software is available publicly if someone is interested re-establishing the service somewhere.

HTML Structured Data Extractor to RDF (Closed)

This service is a common interface to extract structured data from HTML files in RDF. Structured data can be in microdata, RDFa, or Turtle embedded in HTML. The software relies on other services (see below for details); some of those may still be in development, so this service should not be considered as completely final yet.

>
RDFa:  Microdata:  Turtle in HTML:   
 

Distill by File Upload
>
RDFa:  Microdata:  Turtle in HTML:   
 

Distill by direct input
>
:
RDFa:  Microdata:  Turtle in HTML:   
 

What is it?

Structured data in HTML can be expressed using microdata, RDFa, or Turtle embedded in HTML. While RDFa and Turtle are both RDF serialization syntaxes, microdata is not; it is simply a specification for attributes to be used with HTML5 to express structured data. A separate Semantic Web Interest Group Note defines a mapping from HTML5+Microdata to RDF.

There is a (and more complete) distiller for RDFa 1.1; that distiller can also extract embedded Turtle content. A separate microdata distiller is also available. The current service is simply an interface to both, running both services internally and merging the resulting RDF graphs. Note that this service relies on HTML5 only; for the other Host Languages that RDFa 1.1 allows, the user should use directly the RDFa 1.1 distiller.

As installed in this service is a server-side implementation of the conversion. This also means that pages that generate their (X)HTML content dynamically (e.g., using AJAX) will not be properly processed by this distiller.

Distiller options

Source (option: source; values: hturtle, rdfa, microformats)
Defines the source of the structured data. Default is none (when calling the service).
Output format (option: format; values: turtle, xml, json, nt; default: turtle)
The default output format is Turtle. Alternative formats are RDF/XML, JSON-LD, and N-triples.
Perform vocabulary expansion (option: vocab_expansion; values: true, false; default: false )
RDFa 1.1 provides the possibility to “expand” the vocabulary provided by the vocab attribute, i.e., to retrieve the corresponding RDF file and follow the possible subclass and subproperty relationships. See the RDFa 1.1. Core document for further details. (Note that there are discussions to add this feature to the microdata conversion, too.)

Alternative access to the Distiller

When using the distiller URI directly, the option names for the default options can be omitted. Some examples:

Extract the RDF from http://www.example.com/md.html, serialized in Turtle, and source in microdata only:
http://www.w3.org/2012/sde/extract?uri=http://www.example.com/md.html&source=microdata
Extract the RDF from http://www.example.com/md.html, serialized in RDF/XML, source in microdata, RDFa, and Turtle:
http://www.w3.org/2012/sde/extract?format=xml&uri=http://www.example.com/md.html&source=rdfa&source=microdata&source=hturtle
Use a fixed, pseudo URI to extract the RDF from the current page without specifying its URI (with default options); this can be used, say, as a link for a button on the page:
http://www.w3.org/2012/sde/extract?uri=referer&source=microdata


Ivan Herman, (ivan@w3.org)
Last revised: $Date: 2012/04/24 10:21:33 $ (see in RDF)

This software is available for use under the W3C® SOFTWARE NOTICE AND LICENSE

'Valid XHTML + RDFa' button