HCLSIG/NamedGraphs in Life Sciences

From W3C Wiki
Jump to: navigation, search

NamedGraphs in Life Sciences

The motivation for these pages came out of discussions at the Amsterdam F2F and was nicely summarized by Chimezie's follow-up email. NamedGraphs as defined by Carroll et al [1 ], are a means to group a collection of RDF triples and give this collection a unique URI. It is a form of reification, but at a larger scale than per single triple, and hence has some powerful utility when used in life sciences as a grouping mechanism. The origin of a formal logic for Named Graphs can probably be traced back to Ramanathan Guha's PhD thesis [11]. It's no coincidence that Ramanathan's early work on MCF [12] was the the immediate predecessor (and primary motivation) to RDF.

Another important point to consider is that in life sciences, most of what is stated is not fact, rather it is interpretation based on limited evidence and knowledge, or hypotheses that can never be proven true, but only proven false. Both kinds of statements require acknowledgement of KD45 propositional logic rather than S5 propositional logic see Amsterdam F2F slide, [8 ]. The fundamental difference between these two is that S5 requires all statements 'known' to be also 'true' ( know(phi) => phi ), whereas KD45 does not use this axiom, and substitutes the axiom that statements can be 'believed' and possibly false, but no known falsehood can be 'believed' ( believe(phi) => ¬ believe(¬phi), where ¬ means NOT ). This requirement demands the compartmentalization of statements and logic, which could be supported by using NamedGraphs.

What follows are some useful examples that make the case for NamedGraphs, or a comparable form for RDF compartmentalization and tracking.

Life Science usages

  • Collection of RDF statements that are part of someone's hypothesis or data interpretation; the provenance of the NG is used to allow others to 'judge' their belief of the overal set of RDF statements. The recipient can then determine how much weight they want to apply to these statements. This is equivalent to scoping assertions by context [4 ]. How else can we represent hunderds of different and potentially inconsistent hypotheses?
  • As an appropriate boundary for RDF statements associated with a particular patient record.
  • A means to bundle a group of statements and hash them, so that the authored set is non-repudable; useful for implementing trust networks, and guarenteeing who said what. This has far-reaching potential for clinical study data management and integrity.
  • A means to bundle a group of statements in order to grant or deny access for users or groups within a content management system.
  • Annotations that contain simple or compelx RDF graphs could be represented effectively as NamedGraphs. This would allow building networks of anntotations that discretize knowledge based on prior facts and/or annotations, and working assumptions. A community could readily use these to track down origins of ideas, consequences, evidence compiled, and branc points of alternative explanations.
  • NLP can allow us to build knowledge from scientific publications, but the graph should be limited to only the context of that paper, its authors, their evidence, and their working assumptions.
  • A survey of relationships between contexts [5 ]; Many logical relationships are only useful in a given Context: "In individuals that are ill with disease X, the Jnk Pathway can be modulated by inhibitors of Jnk" ; no need to pollute the entire RDF space with such context-narrow facts, since not all the stated relations may hold when the context is different (not suffering from disease X).

Some explicit NamedGraph examples

  1. Joanne believes { Alan states { LSIDs are insufficient } . Alan in Amsterdam_F2F}
  2. <doi:10.1038/ni1006-1021> describes { WNT_Pathway is_involved_in Hematopoiesis_Stem_Cell_Differentiation }

SPARQL and Named Graphs

The SPARQL specification's section on the RDF Dataset [9 ] represents the current critical mass concensus on how the RDF data model can be logically composed as named graphs. It defines an RDF Dataset as having a 'default' graph (which doesn't have a name), as well as other 'named' graphs - each associated with an IRI.

Existential Contexts?

A recent thread [10 ] on public-sparql-dev identified an additional usecase for named graphs: as a context for triples with 'unknown' origin. The conversation suggests the possible use Blank Nodes to identify such contexts. This can be used to express assertions such as: 'There exists a collection of statements about relationships of cardiovascular anatomy'

Papers and References

  1. Carroll et al, 2004
  2. Design Issues
  3. NamedGraphs using JENA
  4. Scoping assertions by context
  5. Relationships between contexts
  6. W3C NamedGraphs Activity
  7. RDFS NG extension
  8. Modal Epistemic Knowledge
  9. RDF Dataset
  10. SPARQL, named graphs and default graph
  11. Contexts: A Formalization and Some Applications
  12. Meta Content Framework Using XML
HCLS Home Discussions Post to HCLS listserv