(This blog should have gone out about a week ago. By an unlucky clashes in my agenda, the trip to Mountain View was immediately followed by another trip, which made it difficult to publish this in a really timely manner…)
Structured data is picking up in the search world. The example that took the headlines in the blogosphere (and beyond) this summer: schema.org, jointly initiated by Google, Microsoft, and Yahoo! On 21 September, the three initiators of schema.org co-organized a workshop held on the Microsoft Campus in Mountain View.
Since the original announcement of schema.org in June this year there has been quite a lot of discussions on the blogosphere and elsewhere on the role, importance, and the future of the initiative. There were also lots of miscommunications on all sides, which is rarely helpful. It is important to find the common ground in a community, which, after all, has the general goal of helping the evolution of structured data on the Web. Clearly, the search engines have a major role to play here. The workshop—which gathered around 70-80 people all across the community, including producers of structured data, experts in the field, and representatives from search engines like Baidu or Yandex which don’t (yet?) participate in schema.org—was therefore an important step toward building broader buy-in and, eventually, consensus.
The event schedule included presentations, break-out groups on specific technical issues, and a panel. The technical topics focused on two issues: the structure and the role of the schema.org vocabulary, and the syntax to express those in HTML.
Vocabularies. Schema.org, as it stands today, provides a fairly detailed, but high level vocabulary on specific subjects that can be used in an HTML page. Mark-up in that vocabulary can be consumed directly by a dedicated application (like indeed the search engines), or it can be integrated with other data, e.g., by converting the structured data into RDF and using the usual data integration approaches provided by Semantic Web techniques. (Note that schema.org has published, quite some time ago, an OWL version of the vocabulary). The main issue for the future, however, is how this vocabulary will evolve, and how it will relate to other vocabularies on the Web, and how schema.org will relate to other efforts to develop such vocabularies.
At present, the schema.org vocabulary can readily be used in an HTML content mixed with other terms and classes as long as those are identified using full URIs of those (the details depend on the exact syntax). In particular, HTML page using a mix of schema.org and any other terms is accepted by the search engines. The presence of non-schema.org classes or properties does not have any negative side effects. Although this might sound obvious now, it was one of the issues that weren’t clear when schema.org was initially announced; this led some to believe that such HTML content would be rejected as invalid. The opposite is the case. At the workshop, Martin Hepp presented practical examples of how schema.org vocabularies can be mixed with outside vocabularies in the case of the GoodRelations ontology.
Schema.org also provides an extension mechanism within its own vocabulary. Existing schema.org URI-s can be extended using a “/” character concatenated with other user-defined strings. While simple, this mechanism raises a number of questions that were discussed at the workshop. For example, how would one find out what that specific new term means (the new URI cannot be dereferenced, as these are still on the schema.org domain)? How would term equivalences be secured? As one of the presenters put it, the current extension mechanism is hardly more than a specialized tagging syntax and we all know the issues around similar but non-identical tags on sites like Delicious or Flickr. One interesting approach that surfaced during the discussion, was that search engines will have crawl data available on these “tags” and, maybe, by publishing those data, some of the widely used extensions may converge into something more stable. Some sort of a crowd-sourced term definition mechanism. The future will tell whether this is a viable option, but it certainly is an interesting direction.
In some cases the schema.org initiative might choose to adopt
vocabularies developed by other communities, and include them into
into the “official” schema.org vocabulary hierarchy. Evan Sandhaus
and Andreas Gebhard told
us about the process that is expected to lead to the
adoption of the rNews
vocabulary by schema.org. (This was officially
announced by IPTC since.) The journey to achieve that was
interesting and did raise some questions, too. For example,
vocabularies are not just simply included; instead, each and every
term had to be discussed with the schema.org owners, possibly
leading to some changes and even exclusions. Issues resulting from
this include general process, public accountability, etc., and
still need to be clarified and possibly developed further.
Additionally, there are questions around the exact names, the URIs
used to identify terms: will there be a term in the rNews name
http://www.iptc.org/ns/1.0/) and a
sibling with a similar role in the
space? Will the vocabulary owner publish a separate vocabulary
files containing, for example,
statements for those terms? While this might not be a problem for
the relatively new rNews (they may decide to simply adopt
URIs), it is likely to become an issue for more established
In general, there was an agreement in the room that extensibility of vocabularies is important, and that more work is needed beyond the . While the schema.org vocabulary will play a hugely important role in future, it cannot (and, as the schema.org partners emphasized, does not intend to!) cover all areas of structured data. As announced earlier, the new W3C SWIG Task Force on Web Schemas (led by R.V. Guha, from Google) will become the main discussion forum to discuss those questions.
Syntax. The other major thread of discussion in the past few months was the syntax to be used to include structured data in conjunction with schema.org. Should it be only microdata (as suggested by the schema.org site)? Should it also be based on microformats or on RDFa? All of the above?
In a separate break-out discussion Ben Adida made a presentation
on RDFa 1.1,
showing how this upcoming version of RDFa provides a level of
simplicity that may make it suitable for the purposes of
schema.org, too. The discussion that followed, beyond some
technical issues (e.g., whether RDFa should retain the duality of
concentrated on the more general question whether search engines
should accept multiple syntaxes or whether there should be only
one. After some discussions it was felt that by not
allowing multiple syntaxes a number of long tail applications
could be excluded. Indeed, some applications may need more complex
expressions than what microdata provides today, but those
applications may also want to use schema.org vocabularies for
consumption by search engines. The consensus in the room was that
multiple syntaxes ought to be accepted, although it was also
necessary to look at some technical issues around, for example,
RDFa 1.1 to possibly simplify it and make it (or a subset thereof)
more suitable for average Web authors. The W3C
SWIG Task Force on HTML Data, led by Jeni Tennison, will
play a major role in fleshing out those technical issues. Although
there was no clear commitment (yet?) from the search engines that
RDFa would be accepted
alongside microdata, the feeling was that there is indeed a high
probability that this may happen at some point. The syntax
sessions felt like an important break-through after what had been
a period of contentious discussions.
All in all, it was good day, which may be remembered in future as an important milestone in the evolution of Structured Data on Web sites, and, as a consequence, in the future of the Semantic Web. Thanks to all the organizers, and also to RV Guha, Jeni Tennison, and Dan Brickley for helping to set up the SWIG Task Forces!