Use Case Human Executed Processes
Provenance in support of artifacts derived by human-executed processes.
Paulo Pinheiro da Silva, Jitin Arora
In many scientific scenarios, data is collected over a period of time and processed through complicated processes to produce artifacts. Some scientific processes are fully or semi-automated but a large number of scientific processes are executed by humans who may use machines basically to record their artifacts. If the artifacts lead to an unexpected claim, then a higher standard of acceptance may be applicable. It may then be necessary to produce all the datasets that contributed to this artifact, and possibly the detailed sequence of steps that led to the generation of this artifact. During this process, additional input may have come from sources that are hard to capture such as GUI or keyboard input by a scientist. In the following scenario, we describe the need to capture provenance in a dynamic manner, going beyond a static template implied by the execution of an automated process, that can last a long time, and that may not be finished yet.
To enable scientists to justify their final conclusions by providing a detailed trace of the steps leading to the generation of an artifact, most notably the data sources that were involved in its generation. Further, to justify the partial conclusions in a similar way that they justify their final results with the additional need to explain how these partial conclusions will lead to a final conclusion in case the process has not been entirelly executed.
Current Practice Scenario
We observe that most scientists are careful about capturing and recording provenance information. However, currently a very limited amount of provenance may be embedded inside certain data file formats such as JPEG images or Excel spreadsheets which may not be easily accessible and is not adequate to provide a complete trace. Another limited option is to capture provenance information and to encode it into databases that eventually become silos of provenance information that are hard to access and use.
Use Case Scenario
Geologist Janet uses a number of off-the-shelf tools to generates gravity maps of some regions using experimental data gathered from field visits. The process of generating the maps is somehow well-known and it not complex enough for Janet to justify the creation and use of a workflow engine. Some areas of the maps have significantly different values than nearby areas. In order to provide credibility to his claim that such rapid variations do indeed exist, Janet must discover and make available the specific data sets that were used in generating those maps as well as the parameters input to the process when it was executed.
Problems and Limitations
To answer these questions, a complete trace of the execution of the process is needed, including the source datasets, the identification of the tools that were used, input parameters provided in such ways as keyboard input, shell environment variables, etc. More complex is the problem of justifying the existence of intermmediate artifacts that are in the process of being used to generate a final conclusion (i.e., an artifact) that does not exist yet.