Re: htmldata-ISSUE-1 (Microdata Vocabulary): Vocabulary specific parsing for Microdata from Gregg Kellogg on 2011-10-20 (public-html-data-tf@w3.org from October 2011)

From: Gregg Kellogg <gregg@kellogg-assoc.com>
Date: Thu, 20 Oct 2011 14:02:58 -0400
To: Ivan Herman <ivan@w3.org>
CC: Gregg Kellogg <gregg@kellogg-assoc.com>, Martin Hepp <martin.hepp@ebusiness-unibw.org>, Jeni Tennison <jeni@jenitennison.com>, HTML Data Task Force WG <public-html-data-tf@w3.org>
Message-ID: <3F8BAE9F-2E5B-4019-9101-CE8F29296B1B@greggkellogg.net>
On Oct 20, 2011, at 12:23 AM, Ivan Herman wrote:

[snip]
For most of the RDF vocabularies, I actually think that our current approach (ie, third option, or the 'base' of the first option) on mapping would actually work fairly well. I have the gut feeling that it would cover a vast majority of vocabularies.

Except for those defined in the HTML spc, yes.

Sure, but those are documented and fixed, so that is fine.

Well, they're not referenced in the W3C version of the spec, but in WHATWG [4], the vocabularies are defined, and Hixie has made it clear that non-URI property names as distinct from URI-versions of those names, and distinct from the same usage in a sub-item lacking an @itemtype. Our base-level URI generation would create the same URI, for all cases. For example, consider the following:

<div itemscope itemtype="http://microformats.org/profile/hcard">
  <span itemprop="fn">Gregg Kellogg</span>
  <span itemprop="http://microformats.org/profile/fn">Ivan Herman</span>
  <span itemprop="foo" itemscope>
    <span itemprop="fn">Martin Hepp</fn>
  </span>
</div>

The Microdata to RDF algorithm would generate the following:

<> <http://www.w3.org/1999/xhtml/microdata#item> [ a <http://microformats.org/profile/hcard>;
     <http://microformats.org/profile/fn> ("Gregg Kellogg" "Ivan Herman");
     <http://microformats.org/profile/foo> [ <http://microformats.org/profile/fn> """Martin Hepp
  """]] .

However, as Hixie has pointed out [1], using "fn" as a token, vs a URI is intended to have different meanings, as is the use within another item.

Incidentally, note that you can't just take, say, an RDF vocabulary, or a
Microformats vocabulary, and just use it in microdata directly. A
microdata vocabulary has to define processing rules that are often not
provided for RDF and Microformats vocabularies, and has to use the terms
defined in the HTML specification to describe how the terms work. You can
see examples of how to define vocabularies in the HTML standard

This was the motivation behind the rules in [3]. Note, that with this interpretation, three different URIs would be created for the same property:

<> <http://www.w3.org/1999/xhtml/microdata#item> [ a <http://microformats.org/profile/hcard>;
     <http://www.w3.org/1999/xhtml/microdata#http://microformats.org/profile/hcard%23:fn> "Gregg Kellogg";
     <http://microformats.org/profile/fn> "Ivan Herman");
     <http://microformats.org/profile/foo> [ <http://www.w3.org/1999/xhtml/microdata#http://microformats.org/profile/hcard%23foo%20fn> """Martin Hepp
  """]] .

Note that I don't think the specifics of URI generation were important, just that they generated different (predictable) URIs, as the intention in HTML Microdata was that these, in fact, be treated as different properties.

The processing rules in [5] are in conflict with this, and the property URIs are conflated. Hixie has pointed out this conflict, and I questioned, on a separate thread, if it is reasonable for us to do this, or do we risk a formal objection preventing us from going to REC. Ivan/Jeni, do you have an opinion?

For microformats, I am, first of all, not sure that there is already a 'standard' on how a specific microformat vocabulary is mapped on microdata. But, as far as I know, there isn't any standard on a microformat->RDF generation either. What this means is that a generic decision may as well work right away, without further ado; we do not have some sort of a backward compatibility issue.

The html5 spec defines vocabularies for vCard, vEvent and license using Hixie's criteria.

Does it? I am looking at

http://dev.w3.org/html5/md/

and I do not find anything. Is there any other HTML5 document that does it?

In the WHATWG editor's draft [4].

Gregg

Otherwise, I don't think there's a standard in place, although schema.org<http://schema.org> or data-vocabulary.org<http://data-vocabulary.org> could be an effective replacement.


Yes, well, this is a different discussion that is not the topic of this task force. The bottom line is that the microformat->RDF is fully open, no real practice...


What I am getting at is that defining a generic mapping as of now, and allowing the processors to have knowledge about the specificities of a particular vocabulary may not be so dramatic as it sounds. Ie, I believe that most of the vocabularies will work just fine with the generic mapping, and there will be only a few cases (say, schema.org<http://schema.org>) that would require knowledge. We may simply say that an update of those extra knowledge is published by the W3C once every, say, 6 months, a bit like the default prefixes in RDFa are handled. That may even be machine readable, with a very simple vocabulary, without going into the complexities of OWL. In other words, a mixture of your first and last option may just work in practice.

My preference would be option 3, where there is a single way to parse, without reference to vocabulary specifics. This would be in direct contrast with the HTML spec, but more in lines of the needs of RDF tool chains.


Yes, I see. I would like to see a use case where this does not work (and, from the top of my head, I do not see any).

Except that (hence my explicit cc to Martin): let us suppose that, eventually, the GR terms will be incorporated into schema.org<http://schema.org>. What this means is that all GR terms will be in the schema.org/<http://schema.org/> namespace, but that also means that a microdata->RDF mapping would produce the GR terms as

http://schema.org/Blah

which is different than the current terms. Martin, what are your plans with the old URI-s in that situation?


For the datatype: well, the fact is, that microdata does not have datatypes. I think we should just accept that. The resulting RDF has a vocabulary URI; for specific vocabularies these may refer to an RDFS or OWL file, and RDF processors may want to pick those up and massage the RDF generated by the microdata conversion. That is outside the conversion itself, and is in the realm of the 'usual' management of RDF data. And if a specific vocabulary is very dependent on datatypes well, then, sorry, do not use microdata in the first place! As it seems Google will, eventually, understand schema.org in RDFa 1.1, too, so for those cases RDFa 1.1 is also at the user's disposal without further issues.

I've suggested elsewhere that this type of massaging may be done through a post-processing stage, much like RDFa vocab entailment, however, it could also take advantage of datatype range information, which would cause the replacement of literals, thus it may be a step which generates a new graph.

The problem is that this would make it different than RDFa @vocab. RDFa @vocab considers subproperties and subclasses; if applied, and if the @vocab contains, say,

<a> rdfs:subPropertyOf <b> .

then, if he RDFa contains something with <a>, the final graph would include

<x> <a> <y> .
<x> <b> <y> .

which is perfectly o.k. But if we apply the same approach for datatypes and range, then we would get something like

<w> <q> "123", "123"^^xsd:int .

in case the range of <q> is set to xsd:int. But this is not entirely kosher, what we want to do is to have

<w> <q> "123"^^xsd:int .

only. In other words, the management of the vocabulary should be incorporated into the core processing of the microdata->RDF mapping, and not as some sort of a optional post-processing step, which is the RDFa @vocab case.

Ivan


I am not yet sure what to do about the list handling issue, I must admit. I think I would prefer not to generate lists by default.

I'd rather it wasn't there either.

Ok, now shoot at me:-)

Ivan

Gregg

[1] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0085.html
[2] http://lists.w3.org/Archives/Public/public-html-data-tf/2011Oct/0118.htm
[3] http://www.w3.org/TR/2011/WD-microdata-20110525/#rdf
[4] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#mdvocabs
[5] https://dvcs.w3.org/hg/htmldata/raw-file/24af1cde0da1/microdata-rdf/index.html

----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf








----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
PGP Key: http://www.ivan-herman.net/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Thursday, 20 October 2011 18:04:03 UTC