Linking Data

From Spatial Data on the Web Working Group
Jump to: navigation, search

Best Practice theme: Linking Data

(Summary of the email discussion threads on Linked-data. Theme identified in initial work on the BP Consolidated Narratives)

Summary of discussion

  1. There are different levels of abstraction: dataset, feature (Thing) and representation of the feature (abstractions such as geometry and topology); links occur between all of these resource types
  2. The primary resource type (from SDW perspective) is the feature (see question below); datasets are being dealt with by DWBP, geometry and topology are treated as attributes of the feature (just like other data about the feature)
  3. Spatial datasets are different in that they will (usually) have an "extra level of structure and granularity that is reflected in the data" e.g. the use of geometry and spatial relationships
  4. Representations may be provided with varying degrees of authority and currency, for differing scales and purposes
  5. Links are first-class citizens - it is 'connectedness' that makes for "Data on the Web"; allowing one to discover new information by traversing those links
  6. Links should be explicit and discoverable - not only at the dataset level, but also for the resources that those datasets describe (and preferably without having to download & parse the entire dataset) ... we want to be able to express links between resources in different datasets
  7. The need to express Links places requirements on our recommended choice of encodings (see related discussion about JSON a link poor format (email thread)
  8. Datasets need to be identified
  9. Features (Things) need to be identified
  10. Representations (geometry, topology) should be identified - and must be identified if they are managed elsewhere (i.e. in another dataset)
  11. Identifiers should be URLs (HTTP URIs) and must be globally unique - see Architecture of the World Wide Web; Identification
  12. In the absence of a URI for the feature (e.g. when only a thematic identifier is used, such as hotel name / address combination), create one (& reconcile later if necessary) ... avoid the use of blank-nodes because they're behaviour is inconsistent across implementations
  13. (from the subject of identifying sub-sets of Coverages) Identifiers for the resource should be separate from the URL of the service-end point(s) that can be used to access that resource ... we need a "protocol independent" way to identify [coverage] dataset subsets

A few other questions occur:

  1. In our discussion, we are falling back on an RDF-centric approach; are we really going to say “if you’re starting from scratch, use RDF (because of it’s expressibility) - but if you’ve already got data or a tool-chain in place, do it like this?"
  2. Is hypermedia beyond our scope? Erik Wilde says: “‘webby data’ is a necessary but not sufficient condition to have hypermedia; hypermedia is not (only) about linking data (i.e., using "web data"), it's also about providing navigational affordances to get things done with that data. this means that the links are about *services* (or whatever you might call this).”
  3. We expect Content Negotiation to provide access to different encodings
  4. We need a Glossary

Question: is a Feature the Real World Thing?

ISO 19101 -- Geographic information - Reference model states:

  • [4.11] feature: abstraction of real world phenomena
  • [4.12] feature attribute: characteristic of a feature ...
    • EXAMPLE 2 A feature attribute named ‘length’ may have an attribute value ’82.4’ which belongs to the data type ‘real’.

The definition of feature attribute is clear- it's a piece of information about the feature.

feature is not quite so clear. In this context, what does abstraction mean?

Typically, the Linked Data community refer to Real-world ‘Things’ (see Designing URI sets for the UK public sector); real-world Things (or just Things) are "are the physical and abstract ‘Things’ that may be referred to in statements". Examples include a school, a road, a person (physical); a government sector, an ethnic group, an event (abstract).

A commonly used example is Manchester Piccadilly Railway Station. A URI for Manchester Piccadilly Railway Station would refer to the real station, constructed from steel and concrete with thousands of people passing through it each day. Clearly one cannot expect an HTTP request to return the real railway station (!); it returns an information object about the railway station.

W3C URLS in Data (FPWD) discusses the need to differentiate between the real Thing and the information resource that describes it. The Publishing Data section provides three strategies for doing so.

In the Geographic Community, the Feature is seen as an information resource - which is, in some way, related to the real-world Thing. INSPIRE (Generic Conceptual Model) refers to these resources as Spatial Objects: "abstract representation of a real-world phenomenon related to a specific location or geographical area". It notes that the term is "synonymous with "(geographic) feature" as used in the ISO 19100 series" and, later, talks about versioning the Spatial Objects. Clearly, you can only version the record of information held about a real world Thing, not the Thing itself?

So the question remains: are we identifying real-world Things (both physical and abstract) or information objects that describe them? Once that's decided, we need to get our terminology clear and stick to it!

What are the challenges people face for 'linking data'?

Linking data is all about making the data “webby”. @dret provides some useful insight in his Overview of Web Data principles (also see the hypermedia discussion on the WG-Comments email list)

Let's make some assumptions

  • We're working with structured data- not things like PDF
    • Spatial data can often be found in content of HTML pages- typically in non-structured (textual) elements. HTML will continue to be one of the main mechanisms by which people will publish spatial data- but we want our best practice to provide clarity on how to do that in a structured way (e.g. using RDFa, microdata or embedded JSON-LD - a la schema.org
    • [eparsons] Spatial data?? ( less than the commonly quoted 80%)

Where do links commonly occur (in spatial data)?

  • statistical data (& other ‘aspatial’ data) to geographic features (e.g. administrative regions, roads etc.)
  • provenance
  • spatial and temporal reasoning- including containment hierarchies of one ‘place’ within another (and other topological/merological relationships); specific need for “Mutually Exclusive Collectively Exhaustive” (MECE) set
  • relating a real-world resource to one or more geometry objects (information resource) that describe it (also see "What should I link to?")

Often, we might publish the 'spatial' and 'aspatial' data in different datasets (or even be published by different providers). For example, a set of places might be used as the basis for an air-quality dataset. Here we can see the spatial dataset being used as a 'dimension' with which we can query into the air quality dataset [...]

What should I link to? (or link between)

  • Information resources (document centric) - or the things that the information resources describe (real-world entities)
    • DEPENDENCY: assignment of identifiers
  • how to link when all you have is a thematic reference- such as a place name rather than an explicit identifier

We chose to apply a "linked data perspective"; this is different to trad-GIS. We see the geometry / representation, e.g. "the line-string as centre-line of the road", as an attribute of the real-world thing.

(JonBlower and Ed note that we are interested in the real-world thing as a primary concern- that is what we should link to)

We're interested in making durable connections between Things - whether they are physical things in the real-world, concepts or abstractions (such as geometry). Generally, real-world things in the physical world are more durable than the information that describes them ... Certainly, _we_ live in the real world so are more likely to want to use real-world things as our frame of reference. I'm interested in, say, this particular road segment where road-traffic accidents occur, not the geometry object that was created and published the last time the road was surveyed. That said, I'm interested in seeing that geometry as an _attribute_ of the road segment so that I can fix it's place and shape in the real world.

(JonBlower notes the use of twitter hash-tags as (spontaneous) identifiers, or "anchors", in the absence of anything better (such as persistent URLs). LarsG, Kerry and JonBlower note that this practice is far from perfect- hashtags are non-durable labels rather than globally unique persistent identifiers. Although far from perfect (far from), people are actually doing using hashtags to meet their needs in absence of anything better. Kerry suggests that finding an identifier already in [wide-spread] use, such as from Geonames, should be the first aim and, failing that, users should mint their own URLs. That said, many people are discouraged from minting identifiers as they may lack the authority to control use of a domain-name or may lack the commitment to maintain resolvable URLs. [I think that this topic is more closely aligned with assignment of identifiers])

(AHarth notes that it is common, say, in schema.org data to refer to hotels without assigning unique identifiers ... "90GB of hotel data" ... in which case each hotel is treated like a blank node)

(AHarth suggests to looking at schema.org for inspiration re linking without identifiers)

Where resources are not assigned unique identifiers, e.g. they are treated like blank nodes, they may be reconciled later by establishing that two, or more, nodes refer to the same resource ... Reconciliation may be done using such things as OpenRefine

(Ed notes that Google Knowledge Graph is underpinned by wiki data; [it] keeps count of references about what it considers to be the same thing and eventually asserts that the "thing" must be worth assigning an identifier to and creating a new node in the Knowledge Graph)

(BillRoberts says that we should encourage "semantic precision" (use of explicit identifiers) - but recognises that this is unrealistic for many cases. We need to work with the real world content which is less precise)

[Linda] Something about automatically creating links (alignment), see http://www.pilod.nl/wiki/Boek/Gueret-Linking for an overview of the topic. This is not especially spatial, though.

[jtandy] notes the use of Silk in the MELODIES project to automatically "generate links between related data items within different Linked Data sources"

What do the links mean?

link semantics is a complex topic ...

To make use of links, you need to understand what those relationships _mean_

What are the common vocabularies that should be used to describe links?

Are the link types particular to a given domain? Is it possible to specify general purpose definitions? Can domain-specific vocabularies be mapped to common vocabularies (& if so, how)?

From the WG email discussion on hypermedia:

  • SimonCox: “Linked data relies first on (i) stable, resolvable URIs, (ii) open formats, and (iii) hyperlinks, so let's make sure that message gets across first and is not buried in premature focus on semantics”
  • RobAtkinson: "I think a star that matters is missing - which is to make the meaning of hyperlinks explicit and discoverable - this is far more useful than putting the data into RDF per se, but one could argue thats the underlying intent of using RDF, in that such links have URIs for link predicates - and there is an implication regarding what those URIs should resolve to."
  • Kerry: "In linked data, the meaning of links is always explicit and discoverable […]. What we *can* do in this group is to advise on using linking vocabulary that is well-defined and, if we cannot find such vocabulary already, to create and define whatever is missing in the spatial space"
  • [... a few detailed posts ...]
  • SimonCox: "[…] I’m finding it still necessary to establish the more basic principles (fine-grained well-managed URIs, hypertext). Mention of RDF and semantic web technologies too esoteric for most web developers, who only know JSON"

So when considering "link semantics" we need to square concerns such as:

  • helping people choose the right link-type ... owl:sameAs is much over-used and usually wrong, and when is OK just to use SKOS?
  • how to relate our link-types to upper ontologies - or should we even try?
  • how we can use ontologies behind the scenes to make sure we don't get an infinite number of incompatible JSON encodings for the same data

...

@dret's Principles of web data (using a slightly different 5-star rating than the one we're used to) suggest that "web data" should be:

  • (1-star) Linkable
  • (4-star) Linked

Issues arising from this are include:

… use unique, global, _durable_ identifiers
… need to be able to discover resources so that you can link to them
… how to assert and maintain links between large sets of resources? (people are lazy)
… what about ‘back-links’ … finding them, dealing with too many
… how are links expressed? (in a doc, in a link header)
… how are links characterised? (e.g. typing)
… how are hints about the target resource conveyed?
… links should be typed (implicitly or explicitly) so that client applications can decide which links to follow when traversing a web of interlinked resources to reach application goals

How can I describe links in my format of choice?

  • RDF (& concrete encodings: TTL, JSON-LD etc.)
  • (plain-old) JSON ... and variants such as GeoJSON
  • XML ... and variants such as GML, KML
  • HTML ... using structured mark-up (RDFa, microdata or embedded JSON-LD)

... other options:

Given that the concept of "spatial data on the web" is predicated on the idea of linking data together, it is essential that we cover as much of the community publishing spatial content as possible- we want their data to be "on the web" (e.g. linked to other stuff - not "using the web as a glorified USB-stick" to quote @phila). We want people creating spatial data to continue to use their existing tool chains in so far as is possible- else they will take one look at this BP and go elsewhere thinking that it is not relevant to them. Over time, they may see they value in adopting more sophisticated formats because they enable more value to be accrued from their content.

What else links to my subject of interest?

BillRoberts sees this problem occur routinely. Statistical data (expressed in RDF Data Cubes for example) often relates to geographic and administrative areas. Many statistical datasets may use the same "spatial" dataset. A common use case is to look for other statistics relating the subject administrative area.

Beware that there is no limit to how many times resources may have linked to (referred to) a given subject area ... there may be thousands of 'back-links'. How do we prioritise which ones to show/use?

(BillRoberts says that the situation is quite easy if you have all the necessary information in your triple-store - or even if you can federate your query across a number of known SPARQL end-points. However, what about the "unknowns" - how do you discover what else is "out there" that refers to your subject? How do you find third party content? This is a discoverability concern [see related theme "enabling discovery"])

Is there a place for (domain-specific) catalogues where one can register assertions about / references to a subject? For example sameas.org provides a place where owl:sameAs assertions can be registered ... perhaps, say, data.gov.uk could allow open data publishers to register information about resources that are identified in URISets in the data.gov.uk domain???

(Ed notes that "there be dragons" when you're trying to reconcile different statistical data that assert that they are (apparently) talking about the same location or area ... )

How can I manage links between sets of things (so that I don't have to manage individual links)?

How can third parties assert links and relationships (and should I trust their assertions?)