Warning:
This wiki has been archived and is now read-only.

TF-Graphs-UC

From RDF Working Group Wiki
Jump to: navigation, search


This page lists use cases collected by the TF-Graphs task force related to the work item “Support for Multiple Graphs and Graph Stores” (see RDF Next Steps workshop results: Graph Identification and Metadata).

See Why Graphs for a simplified list of representative use cases.

Contents

1 Storage Use Cases

When storing RDF information in a graph store, we would like to organize related information into separate graphs. Each graph must be identified with a URI to facilitate retrieval.

1.1 (A PRIORITY) Slicing datasets according to multiple dimensions

Within the BBC, we want to slice large RDF datasets according to multiple dimensions: statements about individual programmes, access control, 'ownership' of the data (what product owns/maintains what set of triples), versioning, etc. All those graphs are potentially overlapping or contained within each other. Those issues are very common in large organisations using a single, centralised, triple store.

1.2 ( C priority) Permissions

Another purpose in storing RDF content in different graphs is to enforce a permissions model so that sensitive information is not accessed by unauthorized users.

1.3 (A PRIORITY) Graph Changes Over Time

When storing graph information retrieved from a URL external to an application, it becomes important to store snapshots of the location over time. When these graph snapshots are taken, it is useful to annotate each snapshot with information such as retrieval time, HTTP Headers used, HTTP Response returned, and other such items that may have affected the contents of the graph snapshot.

1.4 (B priority) Tracing inference results

By identifying the graphs that where consumed and produced by an inference engine, and keeping track of their relationships, one can trace an inferred statement back to its premises. One can also more easily undo some reasoning, for instance when the store is updated.

:G1 { :Tom ex:manage :ACompany }

:G2 { :Tom rdf:type ex:Manager }

:G2 ex:deducedFrom :G1

1.5 (A PRIORITY) Exchanging the contents of RDF stores

Frequently, the entire contents of a SPARQL store have to be “dumped”, for purposes such as backup, replication, migration and archival. One could resort to multipart files or multi-file archives such as tar, or otherwise use a file format that can serialize an RDF Dataset.

As well whole store dumps, data may be added to the store as a collection of graphs to be added or merged (where there is already a graph in the store with that IRI). This can be achieved with multiple files but it can be convenient to have a single file-format. Such additional information may be human-authored so it's not just a machine format.

There is a relationship to subsets of SPARQL 1.1 Update to be explored. SPARQL Update provides INSERT DATA and DELETE DATA.

1.6 (C priority) Versioning in SDMX and DDI

SDMX and DDI have strong notions of ownership and versioning. Artefacts such as code lists, question banks and data structure definitions are managed by an authority (“maintenance agency”) and have, besides an authority-assigned “local” identifier also a numeric version identifier. The identity of an artefact consists of its agency, artefact type, local identifier, version number, and (in special cases) the identity of a parent artefact. Different versions of an otherwise same artefact can exist side-by-side in the same containing XML instance. It is important to retain this versioning and maintenance scheme when expressing these standards in RDF.

1.7 (B priority) Graph Store Management

One way to manage a graph store is to use named graphs. The identification of subsections of the data enables easy management of the data by adding, deleting or replacing whole subsections by actions on the named graphs in the store. This works well in conjunction with the default graph being the union of the named graphs in the graph store.

1.7.1 (B priority) SPARQL 1.1 Graph Store HTTP Protocol

An example of this is the SPARQL Graph Store Protocol which describes the use of GET/PUT/DELETE/POST to manage a collection of graphs.

Graph identification includes direct and indirect cases. "Direct identification" covers the case when a graph URI is based on the graph store URI (same server, same URI path) and "indirect naming" covers the case where the graph URI is not assumed to share path name.

The indirect form uses the ?graph= query string parameter to address graphs.

  GET /rdf-graphs/service?graph=http%3A//www.example.com/other/graph HTTP/1.1
  Host: example.com
  Accept: text/turtle;charset=utf-8

The default graph is addressed with ?default

Command line tools, such as curl and wget, can be used to manage the store.

2 Query Use Cases

While query services are not explicitly addressed in the RDF spec, SPARQL does make use of graph IRIs and we should ensure that the semantics of graph identifiers are compatible with the way in which RDF datasets are defined by SPARQL.

2.1 Find Information In a Graph

When a query service processes a query containing a graph identifier, it must resolve the graph identifier to some collection of materialized RDF content that will be returned in the result set.

2.2 (B priority) Computed Graphs

Often, graphs exposed by a query service are not present in any sort of physical storage, but rather their contents are computed at query time. Examples include:

  • A federated query service may define a graph URI to be the union of graphs accessible through other query services.
  • A service that does RDB to RDF mapping via R2RML may dynamically compute RDF results based on SQL results at query time.

2.3 (C priority) Graph URIs as Locations

In the situation where a query service is presented with a graph identifier that is not present in local storage, the query service may wish to resolve the graph URI as a URL and make a request to that URL (possibly with conneg) for a document that serializes the content of that graph.

NB: It is important to consider what the linked data "Follow your nose" approach means for identified graphs.

2.4 Contextual constraints in queries

In e-Science projects we can use identified graphs to represent and query contextual metadata. For instance, evidence-based reasoning requires being able to differentiate assertions considered as universally true and assertions which are concurrent hypothesis or interpretations. One can use identified graphs when annotating experiments (e.g. in biology) or analysis (e.g. in geology). Identified graphs are used to represent different contexts within which alternative metadata can be described.

Identifying the graphs also allows us to hierarchically organize the RDF datasets, based on RDFS entailment. When considering RDF datasets as contexts, the root of the hierarchy contains the triples that are true in any context below it i.e. any other node of the hierarchy entails it. The other nodes of the hierarchy represent specific contexts; each one recursively inherits and adds to the triples of its ancestors. Each node then provides a different context for querying and reasoning. When a hypothesis is tested (as a SPARQL query), the context of the test is specified by the identifier of the graph to be used.

A special case is the introduction of temporal or geographical aspects in querying and reasoning over the triple store: a query may be solved considering only the assertions that are true in a specific range of time or geographical area.

A way to address this family of scenarios is to allow a basic algebra of sets over the identified graphs. For instance to allowing to assert inclusion.

:G1 { http://dbpedia.org/page/Nice geo:lat 43.703392 ;
                                   geo:long 7.266274 . }

:G2 ex:includes :G1
:G2 { http://dbpedia.org/page/Nice ex:belongsTo http://dbpedia.org/page/France }

:G3 ex:includes :G1
:G3 { http://dbpedia.org/page/Nice ex:belongsTo http://dbpedia.org/page/Italy }

3 Publishing Use Cases

3.1 (C priority) Wikidata

(From Denny Vrandecic)

Wikidata will be a Website that allows the collaborative creation, maintenance, and curation of facts. For simplicity in this use case we assume that every fact is a triple, e.g. "Paris locatedIn France.". But, like Wikipedia, Wikidata is not meant to hold original research, but is only meant as a secondary (or, preferably, even tertiary) source. So for each such triple it needs to give references, e.g. "Paris locatedIn France" hasReference EncyclopediaBritannica. Using named graphs with plain RDF1 means that Wikidata would need to deal with far too many HTTP calls. This is far too inefficient. I.e. the normal Paris graph may only include:

ParisFact1 hasReference EncyclopediaBritannica.

and there would be a further graph called ParisFact1 with the actual content:

Paris locatedIn France .

This leads to at least (Number of reference+1) HTTP requests for a single entity, which easily can be in the hundreds. This is not acceptable.

Some standard syntax that would allow us to state the following would be really useful, but is not currently available:

(Paris locatedIn France) hasReference EncyclopediaBritannica, Wikipedia, Brockhaus .
(Paris hasPopulation 7000000^^int) hasReference Wikipedia 

3.2 Marking published artefact as strongly versioned

DDI and SDMX impose a policy that an artefact can be marked as “published”, and once published it MUST NOT be changed unless a new version number is assigned. A standard method of signalling that a publisher intends to adhere to such a policy would be good.

3.3 Composition of a logical artefact from multiple published artefacts

An actual SDMX or DDI instance is often composed from various artefacts that can be maintained by different agencies. For example, an SDMX dataset that reports national statistics may use a data structure definition maintained by a supranational statistics organization such as Eurostat, and may use code lists defined by various standards bodies. Each of these artefacts are independently versioned. When referencing an artefact, the specific referenced version has to be part of the reference. The exception is “late binding”, where the latest available version is assumed, leading to a more brittle but easier to manage setup.

The XML specifications contain special elements that agencies can use to publish collections of re-usable artefacts.

When an actual SDMX dataset or DDI instance is processed, the processor must be able to retrieve the correct versions of all referenced artefacts and build a complete representation.

4 Provenance Use Cases

One advantage of the RDF data model is the ease with which data from different sources can be combined. Aggregating data in a single place for exploration, querying, and visualization can be as easy as loading everything into a single model. Nevertheless, it is often important to retain provenance information. What came from where? How were the different data produced or obtained?

4.1 FOAF Use Case

A large number of FOAF files have been published around the Web. FOAF aggregators collect these files and attempt to answer questions that require information from multiple files, such as: “How old is Dan, according to people who are his colleagues?” This requires:

  • partitioning triples by source,
  • query triples taking into account source information,
  • verification of a document's publisher's identity (web of trust, digital signatures, FOAF+SSL/WebID etc),
  • serialize to some standard form entire repositories of relevant information, including who-said-what metadata, such that it could be reconstituted elsewhere

Full text: TF-Graphs-UC/FOAF Use Case

4.2 (C priority) Web crawling

Two issues arise when crawling RDF documents from the Web:

  • Statements come from untrusted sources; therefore, it is important to know which statement came from which source (site/domain)
  • When re-crawling documents in order to update their contents, one needs to know which triples originally came from which document

One way of approaching this is to store the triples parsed from each document in a separate named graph. The simplest scheme uses the document URI as the graph name. The source is recorded, as the DNS domain is part of the document URI. Re-crawling can be implemented by simply replacing the graph's contents.

There's a related, but slightly different case where the documents being crawled aren't RDF, but the application is generating triples based on the content of the documents. Thus the graph is not a direct representation (in the webarch sense) of the crawled resource.

Examples:

4.3 Sharing collections of RDF documents

The contents of an RDF store are often collected from a number of different sources. The creation of such a collection in itself can be a major effort, and are hard to reproduce. Examples include large-scale web crawls of RDF documents, and extractions of structured data from large numbers of text documents.

The creators of such collections sometimes want to share the collection with third parties. Keeping provenance (which triple came from which source URL?) intact is very important.

This is for instance done for the Billion Triple Challenge, which in fact does not provide a billion triples but a billion quadruples, so that sources are identified.

Examples:

4.4 (C priority) Backup / Restore of Triplestores

In similar vein to the above use case "Sharing collections of RDF documents", the ability to backup and restore dumps of triplestores is something that needs to be considered. There may be subtleties related the correct restoration of bNodes to their scoped graphs, and these should be addressed by any serialisation of quads.

Furthermore, from a parsing point of view it would be good to know from the outset whether or not a file contains quads or triples, so that one would not need to parse the entirety of a file before finding out that it has both quads and triples.

4.5 Digital Signatures on Graphs

There are a number of ways to create digital signatures on RDF graphs. Often, you do not want to co-mingle the signature information and the graph. Co-mingling signature information in a graph requires the software to use an algorithm to clean the graph in order to generate the signature hash for verification purposes. It also means that it becomes very difficult to sign a graph containing a digital signature at the top-most level. In order to express a digital signature on a graph of information, the idea of a Graph Literal becomes useful. Take the following as an example of a graph that we would like to digitally sign:

4.6 Capture elements of the production context

A graph may be produced through a variety of means and in very different contexts. For instance it could be the result of some natural language processing or other extractions techniques. An identified graph may be linked to the context in which it was produced (source, properties, etc.)

:G1 {
    <http://dbpedia.org/resource/Antibes>
            geo:lat 43.580833;
            geo:long 7.123889 .
}
:G1 ex:extractedFrom <http://en.wikipedia.org/wiki/Antibes>;
        dc:date "2010-11-12"^^xsd:date .

Note that :G1 could also be named <http://dbpedia.org/data/Antibes>.

4.7 (to drop) Applying Named Graphs to a Terminology Service Prototype

LexRDF is an prototype implementation of LexEVS on an RDF triple store.

Among its assertions is an expression of definition source using reification to resolve a non-preferred term definition. Named Graphs could eliminate some of the overhead associated with reification. An example assertion is expressed as follows:

FAO:0000025 a owl:Class;
    skos:prefLabel "mid reproductive";
    skos:altLabel “principal growth stages 6.1-6.3" .

_:A1 a rdf:Statement;
    rdf:subject FAO:0000025;
    rdf:predicate skos:definition;
    rdf:object “middle stages of reproductive phase.";
    dc:source TAIR:lr .

Using Named Graphs:

LexRDF:Graph1 {
    FAO:0000025 a owl:Class;
        skos:prefLabel “mid reproduction”;
        skos:altLabel “principal growth stages 6.1-6.3" .
}
LexRDF:Graph2 {
    FAO:0000025 skos:definition ”middle stages of reproduction” .
}
LexRDF:Graph3 {
    LexRDF:Graph2 dc:source TAIR:lr .
}

4.8 (C priority) Provenance Information and Data Retention

In Garlik we use named graphs to track the provenance of personal information we hold inside our triplestores. These Named Graphs have timestamped information associated with them upon import and are subsequently used to identify how long the data is held within our systems. This is used to comply with Data Protection Laws in the UK, whereby information about individuals who are not our customers are only stored and indexed within our systems for 90 days. Furthermore, other information regarding the source of the data and the software used to process the data is associated to the named graph, this information is also used to track the provenance of data which may be presented to an end user.

4.9 (A PRIORITY) Trust Web Opinions

Alice wants to find a good, local seafood restaurant. She has many ways to find restaurant reviews in RDF -- some embedded in people's blogs, some exported from sites which help people author reviews, some exported from sites which extract and aggregrate reviews from other sites -- and she'd like to know which sources she can trust. Actually, she'd like the computer to do that for her, and just analyze the trustworthy data. Is there a way the Web can convey metadata about those reviews that lets software assess the relative reliability of the different sources? (Full version provided by Sandro)

5 Providing a standard foundation for W3C specs

The 2004 recommendation set is focused on individual RDF graphs: It defines an RDF graph as a simple set of triples. It defines the semantics of such an individual graph. It defines syntaxes for serializing such an individual graph into a text. Later W3C groups, however, had requirements that went beyond this, raising questions about interactions between multiple graphs, about mutable graphs, and about the persistent identity of graphs beyond the mathematical set. They treated these topics in an ad hoc fashion, defining their own concepts and terminology that is not aligned.

5.1 (B priority) SPARQL's “RDF Dataset” and “Graph store”

SPARQL 1.0 defines the concept of a RDF Dataset, a collection of graphs containing one default graph and zero or more named graphs (pairs of URI and graph). SPARQL queries are evaluated against an RDF dataset.

RDF Dataset: Discussion, formal definition (in SPARQL 1.0)

A SPARQL query is executed against an RDF Dataset which represents a collection of graphs. An RDF Dataset comprises one graph, the default graph, which does not have a name, and zero or more named graphs, where each named graph is identified by an IRI. A SPARQL query can match different parts of the query pattern against different graphs as described in section 8.3 Querying the Dataset.

An RDF Dataset may contain zero named graphs; an RDF Dataset always contains one default graph. A query does not need to involve matching the default graph; the query can just involve matching named graphs.

SPARQL 1.1 adds the concept of a Graph Store, essentially a mutable RDF Dataset. The contents of the graphs can be modified, and graphs can be added or removed. SPARQL 1.1 also adds additional discussion about the default graph: Depending on implementation, it could be a separate graph, or a union of all or some of the named graphs.

Graph Store: Discussion, formal definition (in SPARQL 1.1 Update)

A Graph Store is a repository of RDF graphs managed by a single service. Like an RDF Dataset operated on by the SPARQL 1.1 Query Language, a Graph Store contains one unnamed graph and zero or more named graphs. Operations may specify graphs to work with, or they may rely on a default graph for that operation. Unless overridden (for instance, by the SPARQL protocol), then the unnamed graph for the store will be the default graph for any operations on that store. Depending on implementation, the unnamed graph may refer to a separate graph, or it could be a representation of a union of other graphs.

Unlike an RDF dataset, named graphs can be added to or deleted from a graph store. A Graph Store needs not be authoritative for the graphs it contains, i.e. the graph URIs do not need to be in the same pay level domain as the endpoint. That means a Graph Store can keep local copies of RDF graphs defined elsewhere on the Web and modify those copies independently of the original graph.

In the simple case, where there is one unnamed graph and no named graphs, SPARQL 1.1 Update is a language for the update of a single graph.

These concepts are applicable beyond SPARQL.

5.2 (A PRIORITY) OWL's “Ontology Documents”

An OWL ontology can be serialized as RDF and stored in a graph store. The general convention is to store each ontology in a separate graph, which is in turn identified with a graph URI. The OWL spec uses the notion of an Ontology Document as the means of organizing ontologies. An Ontology Document has an IRI, but it is left open-ended what that IRI represents (graph in a graph store? file on a file system? web resource?) Can the document IRI and the graph IRI that stores the ontology be the same?

This is especially relevant in the case of an ontology that imports another ontology. If an ontology is serialized as RDF and stored in a graph store, then there may be an owl:imports predicate whose object is an IRI identifying the ontology to import. How do we resolve this IRI to determine which graph contains that ontology, so that we can parse and process its contents?

5.3 Scope of blank nodes in RDF Semantics

RDF Concepts defines an RDF graph as an abstract set of triples. It also defines blank nodes as being scoped to the graph. But clearly a blank node can be part of multiple graphs at the same time, e.g., of a graph and its subgraphs. So, what stops a blank node from occurring in two completely unrelated graphs?

In RDF Semantics this is handled by introducing the operation of merging graphs, which is not the same as a simple set union. Merging graphs involves some rather complicated mechanics for “standardizing apart” blank nodes.

If new concepts beyond the graph as a set of triples were defined, then this would create an opportunity to explain the semantics of RDF in a more compact and intuitive way, removing some of the confusion and stigma that surround blank nodes (if not the implementation challenges).

5.4 VoID and metadata for RDF datasets

Describing Linked Datasets with the VoID Vocabulary is a SWIG Note on a metadata vocabulary for RDF datasets. It fuzzily defines the concept of a “dataset” as a “meaningful collection of RDF triples that are published, maintained or aggregated by a single provider.” It states that datasets can be published in different ways, including RDF dumps, SPARQL endpoint, or as collections of multiple RDF documents, but relying on the readers' intuition instead of solid definitions for all these terms. A proper formal definition of VoID is currently lacking, and would require concepts for collections of mutable graphs.

5.5 (C priority) Supporting formal alignment of Linked Data principles with RDF and AWWW

The Web of Data, as the macro entity that emerges from the micro principles of Linked Data, can be seen as an RDF Dataset. It contains a potentially unlimited number of named graphs. The contents of graph <u> are given by the function get-and-parse, which performs an HTTP get on <u> and parses the result with an RDF parser. An actual implementation would of course deviate from this idealized description in many ways (it would likely contain a subset of graphs, which are older snapshots; how are media types handled; what parsers are used; content-type sniffing and tag-soup parsing; redirects; etc etc). But the basic intuition of this model appears to be widespread in the Linked Data community, and is implemented in a number of tools (see Web Crawling use case above).

A formal definition of a concept such as “RDF Dataset”, “Set of Named Graphs”, and the g-box/g-snap/g-text distinction in a core RDF spec would make it easier to formally define such a model, tying together the Linked Data principles and practices, Architecture of the World Wide Web, and the REST model of information resources and representations, in a formal way.

6 Advanced Annotations Use Cases

6.1 (C priority) Separate Ontology Use Case

This use case is derived from a proposal to have OWL annotations that can be collected together into a separate ontology (and that might even be able to affect the main ontology). The proposal itself can be seen at http://www.w3.org/2007/OWL/wiki/Annotation_System however this "use case" is somewhat of a modification of the suggestions in the proposal.

The basic need is to be able to generate multiple ontologies (http://www.w3.org/TR/2009/REC-owl2-syntax-20091027/#Ontologies, i.e., the basic OWL construct that holds OWL axioms and other stuff) from a single OWL document. One ontology is the ontology that corresponds to the main information in the document. The other ontology (or ontologies) would sit alongside the main ontology. These secondary ontologies might be used to store and reason about things like provenance or certainty.


The secondary ontologies would not necessarily contain information about the domain of the ontology (and thus need not share axioms with the main ontology) but could refer to the syntactic bits (axioms, annotations, etc.) of the main ontology. Note that this does *not* directly require reflection, as the referenced syntactic bits don't have their semantic import in the secondary ontologies. Any semantic relationship between the main ontology and secondary ontologies is mediated by relationships outside the formalism semantics, again so that there is no need for reflection or reification or ....

So far this is about (OWL) ontologies, not graphs, but it can easily be turned into a use case for named graphs. An RDF document that encodes multiple OWL ontologies would contain named graphs that encode the secondary ontologies and the main graph of the document would encode the main ontology. Uses of the names of the named graphs would encode the links between the main ontology and the secondary ontologies. How to encode the links from the secondary ontology into the main ontology remains an unsettled issue, however.

6.2 (B priority) Reasoning over annotations

We want to support reasoning based on annotations, using a generic approach as defined in [1][2][3]. Annotations include, but are not limited to:

  • temporal annotations (when is a statement valid?)
  • provenance annotations (where did a statement originate?)
  • uncertainty annotations (how likely is a statement to be true?)

For instance, if temporal annotations exist in a dataset, one can ask when a triple holds (e.g., "who was a GoogleEmployee and when?"). If:

 ex:chadhurley  rdf:type  ex:YoutubeEmployee . [2005,2010]
 ex:YoutubeEmployee  rdf:type  ex:GoogleEmployee . [2006,2011]

then:

 ex:chadhurley  rdf:type  ex:GoogleEmployee . [2006,2010]

With provenance annotation, if:

 foaf:Person  rdfs:subClassOf  foaf:Agent . foaf:
 ex:chadhurley  rdf:type  foaf:Person . dbpedia:

then one can infer:

 ex:chadhurley  rdf:type  foaf:Agen . foaf: \and dbpedia:

Following the same generic framework, it is also possible to deal with fuzzy, probabilistic and uncertain information, e.g.,:

 ex:pictureAreaXYZ  rdf:type  ex:HumanFace . 0.82
 ex:HumanFace  rdfs:subClassOf  ex:Ellipse . 0.75

Fuzzy-annotated RDF are likely to be produced automatically by tools relying on statistical data or heuristic-based algorithm. Terminological statements with uncertainty are very common outputs of ontology matching algorithms.

In all these situations, identifying the triples or graphs to which attach the annotations is necessary.

7 References

  1. U. Straccia, N. Lopes, G. Lukacsy, A. Polleres. A General Framework for Representing and Reasoning with Annotated Semantic Web Data. In Proceedings of the Twenty-Fourth AAAI Conference on Artificial Intelligence (AAAI-10), AAAI Press, 2010. http://axel.deri.ie/publications/stra-etal-2010AAAI.pdf
  2. N. Lopes, A. Zimmermann, A. Hogan, G. Lukácsy, A. Polleres, U. Straccia, S. Decker. RDF Needs Annotations. In RDF Next Steps, June 2010. http://www.w3.org/2009/12/rdf-ws/papers/ws09
  3. N. Lopes, A. Polleres, U. Straccia, A. Zimmermann, AnQL: SPARQLing Up Annotated RDFS. In Proceedings of the International Semantic Web Conference (ISWC-10), no. 6496 in Lecture Notes in Computer Science, Springer-Verlag, 2010, pp.518–533. http://iswc2010.semanticweb.org/pdf/51.pdf