Use Case Attribution for a Versioned Document

From XG Provenance Wiki
Jump to: navigation, search


Attribution fora versioned document


Jim Myers

Provenance Dimensions

Primary: Attribution Secondary: Versioning, Publication, Process

Background and Current Practice

When a document (e.g. scientific paper) is created today, assignment of authorship (which maps to recognition in the community and ultimately to fame and fortune), as well as the ordering of author names, is often a judgement call by senior/primary authors. Contributions recognized by inclusion s an author include both irect authroship of text as well as contributionsto the data and analysis reported in the paper. Mistakes can be made if primary authors do not fully comprehend the contributions of other project members.

Systems that can provide more complete information about the contributions of individuals to such an effort will change and potentially improve decisions about authorship.


Provide more complete information to lead authors about contributions to the text and work being presented in a paper to aid in authorship decisions with the goal of improving decisions and increasing the transparency of the process.

Use Case Scenario

Multiple users contribute to an artifact A, with some contributions made indirectly through contributions to other artifacts used in producing A (i.e. inputs or artifacts representing earlier versions of A). Users wish to explore the record of how A was created to understand who contributed and whether specific contributions truly affected the final result in order to properly assign credit (give attribution for) the creation of A. The specific motivating case from which this general scenario derives is as follows.

Alice and Bob take on the task of writing up their recent effort with Charlie, Doug, and Ellen to synthesize a new protein. The group has used a provenance tracking system while working an is using a provenance-aware version tracking system to create the text. Charlie has done the core work in the lab with Bob doing the analysis. Charlie sends Alice some text via email to create the first paper version which Bob then edits several times. Bob includes a reference to data created by Doug in the document as the source of Figure 2, Ellen writes a few paragraphs outlining a difficult step in the analysis and creates a new version which Alice, after reading it, ultimately rejects and rewrites starting from an earlier version. After several months of hectic intermittent work, Alice realized the deadline has come and she quickly adds herself, Bob, and Charlie as co-authors. She does a quick check within her document editor and is reminded that Doug contributed data, so she adds him and fires the paper off. Ellen's name does not come up since she did not contribute to the versions of the document that survived...

Problems and Limitations

Fairly simple provenance systems could be used to provide the type of capability outlined here and reduce the type of mistakes that would have led to Doug being left off the author list in the scenario above - he was a direct contributor to an artifact that was included in the paper.

However, the case with Ellen points out limitations of such a simple system - attribution is based on intellectual contribution not physical causality and though the two are often aligned, they are not always. Ellen's ideas that she expressed in her text did contribute to how Alice eventually explained that point. The causal history of those ideas is not captured by the document versioning system. One could further imagine that Frank, another collaborator in the same group contributed to early versions of the paper about his work on a second protein with a subsequent decision to write that up seperately resulting in his text (and intellectual contribution) being removed in later versions. Alice might accidentally include him as an author given a simple provenance report that he contributed text.

Some of these issues could be solved by more sophistication related to recognizing that papers are not atomic artifacts, and 'editing' processes do not necessarily result in contributions to every byte of artifact state. Similarly, everyone involved in a high-level 'experiment' process may not have contributed to a specific data set and paper from the set of several that were produced. While managing composite artifacts and composite processes does not address the disconnect between physical causality and intellectual causality, it is a start.

Additional capability to recognize that version artifacts are all states of a logical paper and that a paper is just one manifestation of an intellectual contibution defined by a proposal, talks, workflows, multiple papers, etc. would solve more issues - one could document that a co-PI contributed to the ideas in the paper through their contribution to the proposal and that some editing operations resulted in refinement of the intellectual idea whereas others (e.g. editing for grammar) do not and thus would not result in credit a a paper author.

Unanticipated Uses (optional)

This use case is primarily about deciding authorship attibution. However, the core problems that some aspects of causality are not recorded (personal communications, A heard B's talk) and that there are multiple process spaces (intellectual versus physical document editing here,but more generally there are physical, mathematical, intellectual, management, economic, andother spaces) that do not fully align (do not share common definitions for artifacts and processes) are general. Provenance systems that recognize composite artifacts and processes and mappings between provenance spaces would solve a broad range of other problems. For example, who's responsible when a lot of money is spent with little result? One must follow the trail across economic, management, and physical work processes at least to understand whether fiscal controls, management decisions, lazy workers, bad parts or other problems were involved. Or - when debugging, how would one know to check thata piece of software not only ran w/o error but also did a "Fourier transform" correctly unless one can map between the mathematical processing that was planned against the results from the physical/digital processing that occurred?

Existing Work (optional)

I've been involved in several discussions in the context of OPM about composite issues and some of the spaces issues. I'm also aware of some work in curation and text mining related to artifacts having several meanings (the string "John Doe" is both an instance of a name and a person and one can talk about the provenance of both...)