Re: htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata from Ivan Herman on 2011-10-19 (public-html-data-tf@w3.org from October 2011)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 19 Oct 2011 12:26:34 +0200
To: Gregg Kellogg <gregg@kellogg-assoc.com>
Cc: HTML Data Task Force WG <public-html-data-tf@w3.org>
Message-Id: <0E1A7D90-3EAD-40C4-A2E9-DFB90459FA7B@w3.org>
Greg,

before reflecting on the issues, can somebody tell me where those mystical processing rules are defined? I looked at the microdata spec, eg, 

http://dev.w3.org/html5/md/#selecting-names-when-defining-vocabularies

which does not tell me what these are. Although I give my comments below, I am a little bit bothered by the fact that we are talking about something that is a bit unspecified.

On Oct 18, 2011, at 20:29 , HTML Data Task Force Issue Tracker wrote:

> 
> htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata
> 
> http://www.w3.org/2011/htmldata/track/issues/1
> 

[snip]

> 
> There are different directions the specification could take:
> 
> * The specification requires that each vocabulary used within a document have a documented
>  vocabulary, and the processor has vocabulary-specific rules built into the processor. Documents
>  using unrecognized vocabularies fall back to a base-level processing, similar to that currently defined.
>  This includes vocabulary specific rules for multi-valued properties, property URI generation and
>  property datatype coercion.
> 
> * Processors extract the vocabulary from @itemtype and attempt to load a RDFS/OWL definition.
>  Property URIs are created by looking for appropriate predicates defined (or referenced) from within
>  this document. Values are coerced to rdf:List only if the predicate has a range of rdf:List. Value
>  datatypes are coerced to the appropriate datatype based on lexical value matching if there is more
>  than one, or by using the specific datatype if only one is listed.
> 
> * We use a generic mapping of Microdata to RDF that does not depend on vocabulary-specific rules.
>  This may include using RDF Collections for multiple values and vocabulary-relative property URI naming,
>  or not as we decide. The processor then may be at odds with defined property generation and
>  value ordering rules from HTML5 Microdata.
> 

Let me concentrate first on probably the most important issue, namely the choice of the URI for the predicate terms.

I believe that, for microdata, we have actually three major sources of data out there.

1. schema.org
2. existing RDF vocabularies that use the microdata syntax to encode data
3. microformats vocabularies that use the microdata syntax to encode data

(Obviously, #1-#3 can be mixed within the same file.)

The RDF/OWL mapping of schema.org does exist, though I am not sure it is considered as final. But even if it is, it is one set or rules (I presume) for the full schema.org hierarchy. It affects that base URI for the predicate URI generation, and a processor may just know that. I.e., this falls under your first option, unless the OWL mapping is adapted (after all, that is still in flux) in which case the third option may also work. 

For most of the RDF vocabularies, I actually think that our current approach (ie, third option, or the 'base' of the first option) on mapping would actually work fairly well. I have the gut feeling that it would cover a vast majority of vocabularies.

For microformats, I am, first of all, not sure that there is already a 'standard' on how a specific microformat vocabulary is mapped on microdata. But, as far as I know, there isn't any standard on a microformat->RDF generation either. What this means is that a generic decision may as well work right away, without further ado; we do not have some sort of a backward compatibility issue.

What I am getting at is that defining a generic mapping as of now, and allowing the processors to have knowledge about the specificities of a particular vocabulary may not be so dramatic as it sounds. Ie, I believe that most of the vocabularies will work just fine with the generic mapping, and there will be only a few cases (say, schema.org) that would require knowledge. We may simply say that an update of those extra knowledge is published by the W3C once every, say, 6 months, a bit like the default prefixes in RDFa are handled. That may even be machine readable, with a very simple vocabulary, without going into the complexities of OWL. In other words, a mixture of your first and last option may just work in practice.

For the datatype: well, the fact is, that microdata does not have datatypes. I think we should just accept that. The resulting RDF has a vocabulary URI; for specific vocabularies these may refer to an RDFS or OWL file, and RDF processors may want to pick those up and massage the RDF generated by the microdata conversion. That is outside the conversion itself, and is in the realm of the 'usual' management of RDF data. And if a specific vocabulary is very dependent on datatypes well, then, sorry, do not use microdata in the first place! As it seems Google will, eventually, understand schema.org in RDFa 1.1, too, so for those cases RDFa 1.1 is also at the user's disposal without further issues.

I am not yet sure what to do about the list handling issue, I must admit. I think I would prefer not to generate lists by default.

Ok, now shoot at me:-)

Ivan



> [1] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0085.html
> [2] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0118.html
> 
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Attachments

application/pkcs7-signature attachment: smime.p7s
Received on Wednesday, 19 October 2011 10:25:01 UTC