W3C

Impressions on the Schema.org Workshop

(This blog should have gone out about a week ago. By an unlucky clashes in my agenda, the trip to Mountain View was immediately followed by another trip, which made it difficult to publish this in a really timely manner…)

Structured data is picking up in the search world. The example that took the headlines in the blogosphere (and beyond) this summer: schema.org, jointly initiated by Google, Microsoft, and Yahoo! On 21 September, the three initiators of schema.org co-organized a workshop held on the Microsoft Campus in Mountain View.

Since the original announcement of schema.org in June this year there has been quite a lot of discussions on the blogosphere and elsewhere on the role, importance, and the future of the initiative. There were also lots of miscommunications on all sides, which is rarely helpful. It is important to find the common ground in a community, which, after all, has the general goal of helping the evolution of structured data on the Web. Clearly, the search engines have a major role to play here. The workshop—which gathered around 70-80 people all across the community, including producers of structured data, experts in the field, and representatives from search engines like Baidu or Yandex which don’t (yet?)  participate in schema.org—was therefore an important step toward building broader buy-in and, eventually, consensus.

The event schedule included presentations, break-out groups on specific technical issues, and a panel. The technical topics focused on two issues: the structure and the role of the schema.org vocabulary, and the syntax to express those in HTML.

Vocabularies. Schema.org, as it stands today, provides a fairly detailed, but high level vocabulary on specific subjects that can be used in an HTML page. Mark-up in that vocabulary can be consumed directly by a dedicated application (like indeed the search engines), or it can be integrated with other data, e.g., by converting the structured data into RDF and using the usual data integration approaches provided by Semantic Web techniques. (Note that schema.org has published, quite some time ago, an OWL version of the vocabulary). The main issue for the future, however, is how this vocabulary will evolve, and how it will relate to other vocabularies on the Web, and how schema.org will relate to other efforts to develop such vocabularies.

At present, the schema.org vocabulary can readily be used in an HTML content mixed with other terms and classes as long as those are identified using full URIs of those (the details depend on the exact syntax). In particular, HTML page using a mix of schema.org and any other terms is accepted by the search engines. The presence of non-schema.org classes or properties does not have any negative side effects. Although this might sound obvious now, it was one of the issues that weren’t clear when schema.org was initially announced; this led some to believe that such HTML content would be rejected as invalid. The opposite is the case. At the workshop, Martin Hepp presented practical examples of how schema.org vocabularies can be mixed with outside vocabularies in the case of the GoodRelations ontology.

Schema.org also provides an extension mechanism within its own vocabulary. Existing schema.org URI-s can be extended using a “/” character concatenated with other user-defined strings. While simple, this mechanism raises a number of questions that were discussed at the workshop. For example, how would one find out what that specific new term means (the new URI cannot be dereferenced, as these are still on the schema.org domain)? How would term equivalences be secured? As one of the presenters put it, the current extension mechanism is hardly more than a specialized tagging syntax and we all know the issues around similar but non-identical tags on sites like Delicious or Flickr. One interesting approach that surfaced during the discussion, was that search engines will have crawl data available on these “tags” and, maybe, by publishing those data, some of the widely used extensions may converge into something more stable. Some sort of a crowd-sourced term definition mechanism. The future will tell whether this is a viable option, but it certainly is an interesting direction.

In some cases the schema.org initiative might choose to adopt vocabularies developed by other communities, and include them into into the “official” schema.org vocabulary hierarchy. Evan Sandhaus and Andreas Gebhard told us about the process that is expected to lead to the adoption of the rNews vocabulary by schema.org. (This was officially announced by IPTC since.) The journey to achieve that was interesting and did raise some questions, too. For example, vocabularies are not just simply included; instead, each and every term had to be discussed with the schema.org owners, possibly leading to some changes and even exclusions. Issues resulting from this include general process, public accountability, etc., and still need to be clarified and possibly developed further. Additionally, there are questions around the exact names, the URIs used to identify terms: will there be a term in the rNews name space (i.e., http://www.iptc.org/ns/1.0/) and a sibling with a similar role in the schema.org name space? Will the vocabulary owner publish a separate vocabulary files containing, for example, owl:equivalentProperty statements for those terms? While this might not be a problem for the relatively new rNews (they may decide to simply adopt schema.org URIs), it is likely to become an issue for more established vocabularies.

In general, there was an agreement in the room that extensibility of vocabularies is important, and that more work is needed beyond the . While the schema.org vocabulary will play a hugely important role in future, it cannot (and, as the schema.org partners emphasized, does not intend to!) cover all areas of structured data. As announced earlier, the new W3C SWIG Task Force on Web Schemas (led by R.V. Guha, from Google) will become the main discussion forum to discuss those questions.

Syntax. The other major thread of discussion in the past few months was the syntax to be used to include structured data in conjunction with schema.org. Should it be only microdata (as suggested by the schema.org site)? Should it also be based on microformats or on RDFa? All of the above?

In a separate break-out discussion Ben Adida made a presentation on RDFa 1.1, showing how this upcoming version of RDFa provides a level of simplicity that may make it suitable for the purposes of schema.org, too. The discussion that followed, beyond some technical issues (e.g., whether RDFa should retain the duality of using @rel and @property attributes), concentrated on the more general question whether search engines should accept multiple syntaxes or whether there should be only one. After some discussions it was felt that by not allowing multiple syntaxes a number of long tail applications could be excluded. Indeed, some applications may need more complex expressions than what microdata provides today, but those applications may also want to use schema.org vocabularies for consumption by search engines. The consensus in the room was that multiple syntaxes ought to be accepted, although it was also necessary to look at some technical issues around, for example, RDFa 1.1 to possibly simplify it and make it (or a subset thereof) more suitable for average Web authors. The W3C SWIG Task Force on HTML Data, led by Jeni Tennison, will play a major role in fleshing out those technical issues. Although there was no clear commitment (yet?) from the search engines that RDFa would be accepted alongside microdata, the feeling was that there is indeed a high probability that this may happen at some point. The syntax sessions felt like an important break-through after what had been a period of contentious discussions.

All in all, it was good day, which may be remembered in future as an important milestone in the evolution of Structured Data on Web sites, and, as a consequence, in the future of the Semantic Web. Thanks to all the organizers, and also to RV Guha, Jeni Tennison, and Dan Brickley for helping to set up the SWIG Task Forces!

About Ivan Herman

Ivan Herman is the leader of the Digital Publishing Activity at W3C. For more details, see http://www.w3.org/People/Ivan/

3 thoughts on “Impressions on the Schema.org Workshop

  1. Thanks for this summary Ivan. I’m happy to see that the issue of extensibility was discussed in some depth, as the lack of a formal extension methodology obviously represents a barrier to the wider adoption and use of schema.org microdata. While an extension mechanism exists, the uncertainty surrounding the adoption of extended properties, classes or items hardly encourages publishers to extend the vocabulary (“… maybe, by publishing those data, some of the widely used extensions may converge into something more stable.”)

    I think the adoption of rNews was a great step forward. While there are, as you outline, issues surrounding the adoption of existing vocabularies, it seems to me to make more sense to incorporate well-established vocabularies into schema.org than inventing them all over again.

  2. Accessibility also plays into how easy it is for the average web author to develop structured data.

    This is why it is important that only one standard syntax exists. This standard should be developed by the W3C since the syntax will need to work within other W3C standards.

    Vocabularies are simply parts of an overall semantically-based web dictionary; where organizations have elected to select a few terms and define them. This is good, but coordination and leadership is necessary to control overlap.

    Outside agencies should ONLY concern themselves with defining and stabilizing their chosen terms. Leave the business of syntax to the W3C and RDF/RDFa. Dublin Core can deal with terms regarding provenance, FOAF can handle terms for social relationships, SIOC can deal with review type terms, and schema.org can handle terms dealing with searching.

    The W3C can add overall leadership by providing coordination services to these agencies to ensure terms are not overlapping and if a particular agency needs to handle additional terms.

    Having different agencies prescribing their own syntax simply undermines the perceived authority the W3C brand holds over web standards. Fragmentation is not the answer. Acknowledged leadership and coordination is the solution.

Comments are closed.