Re: htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata

On Oct 19, 2011, at 3:24 AM, "Ivan Herman" <ivan@w3.org> wrote:

> Greg,
> 
> before reflecting on the issues, can somebody tell me where those mystical processing rules are defined? I looked at the microdata spec, eg, 
> 
> http://dev.w3.org/html5/md/#selecting-names-when-defining-vocabularies
> 
> which does not tell me what these are. Although I give my comments below, I am a little bit bothered by the fact that we are talking about something that is a bit unspecified.

The current WD of HTML Microdata describes a means of constructing URIs for non-URI property names [3]. It's rather complex, and I think that the generated URIs might not even be legal. The editor's draft has since withdrawn this definition, so unless we resurrect or revise it, it will eventually just go away.

The main issue with this algorithm is that URIs for properties are scoped to a combination of the relevant itemtype and property path, making them incompatible with typical property URIs from other RDF vocabularies.


> On Oct 18, 2011, at 20:29 , HTML Data Task Force Issue Tracker wrote:
> 
>> 
>> htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata
>> 
>> http://www.w3.org/2011/htmldata/track/issues/1
>> 
> 
> [snip]
> 
>> 
>> There are different directions the specification could take:
>> 
>> * The specification requires that each vocabulary used within a document have a documented
>> vocabulary, and the processor has vocabulary-specific rules built into the processor. Documents
>> using unrecognized vocabularies fall back to a base-level processing, similar to that currently defined.
>> This includes vocabulary specific rules for multi-valued properties, property URI generation and
>> property datatype coercion.
>> 
>> * Processors extract the vocabulary from @itemtype and attempt to load a RDFS/OWL definition.
>> Property URIs are created by looking for appropriate predicates defined (or referenced) from within
>> this document. Values are coerced to rdf:List only if the predicate has a range of rdf:List. Value
>> datatypes are coerced to the appropriate datatype based on lexical value matching if there is more
>> than one, or by using the specific datatype if only one is listed.
>> 
>> * We use a generic mapping of Microdata to RDF that does not depend on vocabulary-specific rules.
>> This may include using RDF Collections for multiple values and vocabulary-relative property URI naming,
>> or not as we decide. The processor then may be at odds with defined property generation and
>> value ordering rules from HTML5 Microdata.
>> 
> 
> Let me concentrate first on probably the most important issue, namely the choice of the URI for the predicate terms.
> 
> I believe that, for microdata, we have actually three major sources of data out there.
> 
> 1. schema.org
> 2. existing RDF vocabularies that use the microdata syntax to encode data
> 3. microformats vocabularies that use the microdata syntax to encode data
> 
> (Obviously, #1-#3 can be mixed within the same file.)
> 
> The RDF/OWL mapping of schema.org does exist, though I am not sure it is considered as final. But even if it is, it is one set or rules (I presume) for the full schema.org hierarchy. It affects that base URI for the predicate URI generation, and a processor may just know that. I.e., this falls under your first option, unless the OWL mapping is adapted (after all, that is still in flux) in which case the third option may also work. 

This could fall under any of the options. If the ontology defined rdfs:domain for all properties, the URI can be determined by matching against all predicates in the scope of the associated type, otherwise, the rule would be to use the type URI as the basis, after removing the bit after '/' or '#'.

A processor would only need to provide explicit vocabulary support when the URI pattern falls outside of the default URI generation algorithm, and schema.org does fall under this pattern (as do most any other standard OWL/RDFS vocabularies)

> For most of the RDF vocabularies, I actually think that our current approach (ie, third option, or the 'base' of the first option) on mapping would actually work fairly well. I have the gut feeling that it would cover a vast majority of vocabularies.

Except for those defined in the HTML spc, yes.

> For microformats, I am, first of all, not sure that there is already a 'standard' on how a specific microformat vocabulary is mapped on microdata. But, as far as I know, there isn't any standard on a microformat->RDF generation either. What this means is that a generic decision may as well work right away, without further ado; we do not have some sort of a backward compatibility issue.

The html5 spec defines vocabularies for vCard, vEvent and license using Hixie's criteria. Otherwise, I don't think there's a standard in place, although schema.org or data-vocabulary.org could be an effective replacement.

> What I am getting at is that defining a generic mapping as of now, and allowing the processors to have knowledge about the specificities of a particular vocabulary may not be so dramatic as it sounds. Ie, I believe that most of the vocabularies will work just fine with the generic mapping, and there will be only a few cases (say, schema.org) that would require knowledge. We may simply say that an update of those extra knowledge is published by the W3C once every, say, 6 months, a bit like the default prefixes in RDFa are handled. That may even be machine readable, with a very simple vocabulary, without going into the complexities of OWL. In other words, a mixture of your first and last option may just work in practice.

My preference would be option 3, where there is a single way to parse, without reference to vocabulary specifics. This would be in direct contrast with the HTML spec, but more in lines of the needs of RDF tool chains.

> For the datatype: well, the fact is, that microdata does not have datatypes. I think we should just accept that. The resulting RDF has a vocabulary URI; for specific vocabularies these may refer to an RDFS or OWL file, and RDF processors may want to pick those up and massage the RDF generated by the microdata conversion. That is outside the conversion itself, and is in the realm of the 'usual' management of RDF data. And if a specific vocabulary is very dependent on datatypes well, then, sorry, do not use microdata in the first place! As it seems Google will, eventually, understand schema.org in RDFa 1.1, too, so for those cases RDFa 1.1 is also at the user's disposal without further issues.

I've suggested elsewhere that this type of massaging may be done through a post-processing stage, much like RDFa vocab entailment, however, it could also take advantage of datatype range information, which would cause the replacement of literals, thus it may be a step which generates a new graph.

> I am not yet sure what to do about the list handling issue, I must admit. I think I would prefer not to generate lists by default.

I'd rather it wasn't there either.

> Ok, now shoot at me:-)
> 
> Ivan

Gregg

>> [1] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0085.html
>> [2] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0118.htm
[3] http://www.w3.org/TR/2011/WD-microdata-20110525/#rdf

> ----
> Ivan Herman, W3C Semantic Web Activity Lead
> Home: http://www.w3.org/People/Ivan/
> mobile: +31-641044153
> PGP Key: http://www.ivan-herman.net/pgpkey.html
> FOAF: http://www.ivan-herman.net/foaf.rdf
> 
> 
> 
> 
> 

Received on Wednesday, 19 October 2011 19:32:45 UTC