This is a proposal to improve and clarify schema.org's handling of identity issues, in particular for the common case where diverse sites provide information about the same real world entity.
It adds a property to schema.org, 'sameThingAs' that can be used to indicate when a single real-world entity is being described.
Schema.org's data model is of linked entities and relationships, with an emphasis on their description using structured data within ordinary HTML Web pages.
Both HTML5 Microdata and RDFa Lite provide attributes ('itemid' and 'resource', respectively) whose values are identifiers for 'the thing itself'. In Microdata terms, 'itemid' gives us a 'global identifier', whose meaning is contextual, and based on the vocabulary being used. For example, a vocabulary defining a type 'Book' might use an itemid like 'urn:isbn:0-330-34032-8'. Similarly, in RDF, the word 'resource' is effectively a synonym for 'thing', and RDFa Lite's 'resource' attribute allows URI identifiers to be given for each thing being described.
When structured data is deployed within linked HTML pages, property values may also be URLs/URIs. For simplicity and usability, it is common for "identifiers for a page" and "identifiers for the main thing described by a page" to be conflated.
For example, here is what we see currently on the IMDB site, when looking at a page for a particular work (markup fixed for readability):
<a href="/name/nm0010930/" itemprop="actor">Douglas Adams</a>,
<a href="/name/nm0048982/" itemprop="actor">Tom Baker</a>
and <a href="/name/nm3035100/" itemprop="actor">Hans Peter Brondmo</a>
Here, our markup is talking about a CreativeWork, the documentary Hyperland from 1990. The cast list includes a link (typed 'actor') to a page about the actor Tom Baker. There is also a Wikipedia entry about the same documentary, about the actor Tom Baker, and about the writer and co-star Douglas Adams.
While some linked data sources try to carefully maintain the distinction between 'things' and 'pages that stand for those things', this is not always easy for many of the environments where schema.org markup (whether Microdata or RDFa Lite) is deployed.
One schema.org strategy for dealing with this is the 'url' property. From the getting started guide:
Using the url property. Some web pages are about a specific item. For example, you may have a web page about a single person, which you could mark up using the Person item type. Other pages have a collection of items described on them. For example, your company site could have a page listing employees, with a link to a profile page for each person. For pages like this with a collection of items, you should mark up each item separately (in this case as a series of Persons) and add the url property to the link to the corresponding page for each item, like this:
<div itemscope itemtype="http://schema.org/Person">
<a href="alice.html" itemprop="url">Alice Jones</a>
<div itemscope itemtype="http://schema.org/Person">
<a href="bob.html" itemprop="url">Bob Smith</a>
The central challenge here is to allow simplicity for authors and publishers, while making it possible to reconstitute a useful entity-relationship data graph from markup. It is also important to be able to indicate when two different pages are talking about the same underlying real-world entity.
No single solution will work for all parties. The goal of this proposal is to add a simple construct that works alongside 'url' property. While 'url' points from something to a page/record that's mostly about it and is in some sense 'its' page, sameThingAs can be used more freely wherever we have useful identifiers (direct or via-some-page) for an entity of interest.
- Jeni Tennison's detailed discussion from the W3C TAG group.
- The Nature of Connectedness on the Web by Mike Bergman, listing 45 (!) properties attempting "approximateness".
Schema.org Identity Clarifications
1. In various notations, it is possible to distinguish identifiers for the underlying real-world entity, from the record or page identifiers used for publications about that entity. For example, in Microdata, the 'itemid' attribute is available; in RDFa Lite, a comparable 'resource' attribute is available. We confirm explicitly that such identifiers are welcome and encouraged in schema.org markup, although we cannot advise at this stage on exactly which identifiers to use.
2. We add a property to the Thing type, called 'sameThingAs'.
The value of 'sameThingAs' can be another Thing (really, the same thing; there's only one underlying entity). This is used with the kinds of direct entity identifiers we see in (Microdata) 'itemid' and (RDFa Lite) 'resource' attributes. It can also be a document. For example, we might link from a description of Tom Baker to the page on Wikipedia about him.
3. We clarify that the schema.org 'url' property isn't directly applicable in this case, since there is no strong association between Tom Baker and the Wikipedia page, beyond the relationship by topic. We keep 'url' for the stronger case where the page is in some sense 'his'; roughly the notion of a 'homepage'.
Take case of the actor, director, writer Douglas Adams.
There are pages about him,
- Rotten Tomatoes, http://www.rottentomatoes.com/celebrity/douglas_adams/
- IMDB, http://www.imdb.com/name/nm0010930/
- Wikipedia, http://en.wikipedia.org/wiki/Douglas_Adams
- Freebase, http://www.freebase.com/view/en/douglas_adams
Clearly enough, we have 4 of something (pages), and 1 of something (the person). The schema:sameThingAs relationship holds between any pairs here (or any of these and Douglas Adams himself).
The existing W3C 'owl:sameAs' property asserts strong, absolute identity.
If we said 'http://en.wikipedia.org/wiki/Douglas_Adams owl:sameAs http://www.rottentomatoes.com/celebrity/douglas_adams/' we are saying that what we have here are two identifiers for the same thing.
What we want to say with schema:sameThingAs is a little different. We're saying that there is one underlying real world entity, but allowing the relationship type to be used also between documents that indirectly indicate that entity.
- Q: Why didn't you use owl:sameAs A: we would have (rightly) been accused of over-using it.
- Q: Why not just use 'url' from schema.org? A: This was a possibility, but the current design keeps 'url's meaning more restricted.
- Q: When we get two URIs related via sameThingAs, how do we know if each link is 'the thing' or 'a page about the thing'? A: This is somewhat heuristic, but point is that existing data will already be mixing these together...
- IRC chat with Ed Summers, Dan Brickley, Mo McRoberts.