FilteredPush Position Paper

Perspectives

Robert A. Morris and Paul Morris are members of the FilteredPush development group at the Harvard University Herbaria. Our immediate perspective is the annotation of scientific data, but we have two related viewpoints. First, scientific funding agencies increasingly require that the data supporting publications should be made available publicly in usable form. In the face of this we expect that scientific e-publications and the data supporting them will ultimately require reciprocal annotation. Second, modern web documents are typically rendered at access time from data in back-end data stores. Our perspective is that the actual knowledge thus resides in the those stores, with the web document serving as a captured-in-time view, and that this pervades whether the back end is structured, semi-structured, or unstructured. Thus even if access to the back-end is not directly accessible from the document generation system, we believe that some document annotation pitfalls can be analyzed from this perspective.

System built

The FilteredPush project has built a configurable platform to support the semantic annotation of all forms of distributed data. It is presently deploying two instances of the platform aimed at annotation supporting collaborative digitization and data quality control of georeferenced metadata for specimens in Natural Science collections in the U.S. Two similar projects are known in Australia and Europe.¹ ² Our annotations exploit a small extension of the Open Annotation Ontology (OA) central to academic scholarship, especially in the sciences. Of importance to us is the ability to model, within an annotation, assertions about the results of a query, including assertions that are independent of some of the query details. For data quality control we also require that annotations be actionable by annotation consumers based on an expectation of the producers expressed in the annotation. This requirement has led us to consider some requirements of "annotation conversations" surrounding the actions taken by a consuming agent, and requirements for conveying the annotator's expectation about actions to be taken by the consumers on their own datasets. In turn, the consumer must be able to launch an annotation suitable for informing interested parties what action they took in response to an annotation. Further discussion is here³. In practice we support annotation production and consumption as web services that can be exploited by third-party data management tools available to the domain scientists. In addition, a semantic pub-sub component allows notice of interested parties of the publication of new annotations relevant to their interests.

Lessons learned

Annotation provenance is complex, particularly when the annotations are modeled in a graph language such as RDF. To prevent provenance distortion in a triple store, some annotation systems use one named graph per annotation and control writing to the "original" annotation. We chose instead to memorialize annotation documents in a document store (Fedora Commons) since we had other reasons to support it.

Lessons yet to be learned

For a particular instance of our collaborative platform, we fix domain ontologies. We are unsure what are the requirements for domain ontology mapping in order to exchange annotations with other annotation systems in the same domain. For scientific data annotation this is important, because many assertions about data have impacts on the scientific conclusions supported by the data. Almost certainly this problem also exists for annotating documents.
We are not yet aware what scaling issues we may face in add-on services such as the annotation pub-sub facility.
In some related projects about annotation of wikis, we have some concern about what is the actual resource to be annotated, and the reproducibility by consumers of the assertions of producers. For example, many web documents serve html that transcludes, or otherwise depends on, dereferencing references to resources that may change without changing the provenance assertions of the document. In such a case, an annotation may refer to a specific html serialization that continues to carry provenance identifiers (e.g. persistent URLs) that are unchanged, even though the html seen by an annotation consumer is not the same as that seen by producer.

What is missing from OA for data and other scientific web resource annotations

In Science, assertion without evidence is merely hypothesis. This is also true for many other annotation venues. OA needs a clear way to signify whether it is hypothesis or evidence-based assertion that is claimed for the annotation body, and signify the supporting evidence.
Actionable annotations need more thought. It may be that oa:Motivation is an adequate model, but for collaboration between producer and consumer we have found a need for more granularity about what is hoped for by the producer.
The outcome of queries as the subject of annotation. Even for documents, it seems useful to be able to express things like "A regular expression search based on "...." yields seven sentences asserting that Admiral Perry reached 93 deg North. All of them are wrong."

Notes

[1] Chernich R, et al., Providing Annotation Services for the Atlas of Living Australia. Proceedings of TDWG; 2008.

[2] Tschöpe O, et al., Annotating biodiversity data via the Internet, Taxon 62(6), 20 December 2013 , pp. 1248-1258(11).

[3] R. A. Morris et al., Semantic Annotation of Mutable Data, PLoS ONE, November 2013.