Re: htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata from Jeni Tennison on 2011-10-21 (public-html-data-tf@w3.org from October 2011)

From: Jeni Tennison <jeni@jenitennison.com>
Date: Fri, 21 Oct 2011 15:02:40 +0100
To: HTML Data Task Force WG <public-html-data-tf@w3.org>
Cc: Gregg Kellogg <gregg@kellogg-assoc.com>
Message-Id: <C615DEBC-0F17-4441-8A7E-9564F97F2F28@jenitennison.com>

Gregg, all,

Getting a direction on this issue is clearly the most challenging aspect of the microdata/RDF mapping.

Let's remind ourselves of the use cases for wanting this mapping in the first place. To me, they are:

1. Semantic search engines such as Sindice use RDF as their backend data model. They want to gather information expressed using microdata alongside information expressed in RDF-based formats and make it available to others to use, as a service. In these cases, the ultimate consumer, who will need to understand the vocabularies used within the microdata, is the program or person who pulls out data from Sindice. Sindice needs to retain the distinctions in the original microdata (eg ordering of items) and might not have built-in knowledge about the vocabulary of interest to the ultimate consumer. In this case, the ultimate consumer is likely to have to map/validate/handle errors in the data they get from Sindice.

2. A consumer such as openelectiondata.org wants to support microdata-based markup of their vocabulary as well as RDFa-based markup, both going into an RDF-based data store. They want to use an off-the-shelf tool to extract the microdata. They want to configure the tool to give them the RDF that is appropriate for their known vocabulary.

3. A browser plugin that captures data for the user uses an RDF model as its backend store. Any time it encounters microdata on a page, it wants to pull that microdata into the store on the fly.

Are there other use cases for mapping microdata to RDF?

We should also remind ourselves that our goal within the TF is not to produce a finished specification but to provide something that others could take forward to Recommendation. We should make sure that we capture options, rationale and recommendations, but we do not have to make any final decisions.

We have a range of different options about how processors determine what mapping to use:

1. all processors use the same (default) mapping for all vocabularies
2. all processors use a default mapping for unknown vocabularies and a customised mapping for known vocabularies where the known vocabulary mappings are:
a. a pre-defined set of popular vocabularies
b. drawn from a registry
c. determined by resolving the vocabulary's schema
3. different processors have different sets of mappings and must specify how they are set
a. all processors have the same default mapping for unknown vocabularies
b. processors must also specify what default mapping they use

I propose that we document these possibilities as an editorial note within the document and have a straw poll about which method we recommend and rule out. I will start a separate thread to do that.

Whichever option is chosen, we need to have a complete list of the things about a vocabulary that a processor needs to know (ie an environment) in order to generate "natural RDF" from microdata; this includes the ranges of properties, how property URIs are derived from the type + defined property name, and error-handling behaviour. It would help if the microdata->RDF spec were written in terms of using environment properties.

There are some items on that list where we need to specify the options in detail. For example:

* three(?) possible methods of generating a URI from a type + defined property name:
* the "natural RDF" mapping currently defined in Gregg's spec
* the type#property mapping
* a variant on Hixie's mapping

* four(?) possible ways of handling multi-valued properties:
* separate triples
* rdf:List values
* rdf:Seq values
* Ordered Lists

If we go for 1, 2 or 3a for how processors determine how to map items of a particular type, we also have to define what the default mapping should be. We can't do that until as have an absolute list of the things that a mapping needs to define and what the options are for each of those, so we'll defer this until we have that list.

If we go for 2 or 3 then we probably will want two other resources:

* as Gregg suggested, the mappings for a set of popular vocabularies (probably those whose prefixes are built-in to RDFa), probably in a separate document, wiki or registry

4. as Martin suggested, a small vocabulary that enables vocabulary owners to describe the mapping for their vocabulary within an RDFS schema or OWL ontology

Whether these need to be done within this TF, I'm not sure. They are both probably useful anyway as examples of (a) what vocabulary mappings need to look like and (b) for expressing them in a machine-readable way.

Gregg,

Does that give you a way forward? Could you write up either in the spec or on the wiki:

* the list of the things that needs to be known to map microdata to RDF
* the options for the type+property name and list value handling

so that we have a clear record of them to refer to? (I'm sure if anyone wants to help Gregg he'd appreciate it.)

I will create a separate thread on the straw poll.

Thank you,

Jeni
--
Jeni Tennison
http://www.jenitennison.com

Received on Friday, 21 October 2011 14:03:08 UTC