Analysis of use cases

From CSV on the Web Working Group Wiki
Jump to: navigation, search

Are there use-cases hidden in the CSV-LD proposal (Greg Kellog) and XBRL-GL discussion (Eric Cohen)? (to be determined)

1 Automated Processing of CSV (Validation) (Adam Retter) Transfer of Records from government departments and other government agencies to The National Archive (TNA) in support of Digital Preservation. Records supplemented with metadata to:

  • Describe Record provenance,
  • Determine Record integrity,
  • Determine legal qualification of Record (if any),
  • Package Record for end-user discovery and access (presentation, search etc.)

Metadata for a set of Records to be published may be machine extracted or provided by humans; e.g. “Transcriptions of Records”. Preferred metadata format is RDF, allowing direct ingest into the Records “Catalogue” (triple store implementation), but typical delivery is in CSV form.

The National Archive specifies mandatory metadata fields and formatting to enable machine processing and preparation for ingest into Records Catalogue.

REQ: CSV file to reference externally managed/maintained data definition resource

REQ: Validate CSV content against conformance with data definition resource

REQ: Automated transformation of CSV to RDF

Burden of effort for transforming metadata in “native form” pushed to supplier of Records. Potential for reduction in effort through automated transformation if semantics of columns and rows can be unambiguously specified.

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file

NEEDS EXAMPLE TO ILLUSTRATE THE WORKFLOW; ORIGINAL RECORD, METADATA DESCRIPTION IN CSV, CSV DATA DEFINITION RESOURCE, METADATA DESCRIPTION TRANSFORMED INTO RDF … I RECALL THE USE OF AN EXAMPLE FROM A PAPER ABOUT THE TNA’S DIGITAL REPOSITORY CONCERNED WITH COLONEL A R ORAM (xml-london-rw.pdf, section 4.3 semantic search)

2 Relational Data and Row-Formats (Adam Retter) Within a “Transcription of Records” metadata CSV file, each Clause (terminology? [Sub-]Record?) within a Record is allocated a unique identifier (locally unique within the scope of a single CSV file). The CSV file includes the relationships between Clauses.

ASSUMPTION: a row within a CSV file is describes a uniquely identifiable entity

REQ: each entity described in a CSV file shall be uniquely identifiable

REQ: it shall be possible to assert relationships between entities each of which are described within individual rows of a CSV file

Each row in a “Transcription of Records” metadata CSV file may refer to a different type of Record.

REQ: it shall be possible to assert the type of each entity described in a CSV file

... noting that it is feasible in RDF for an entity to be declared as an instance of multiple types ...

Different types of Record have different information requirements, thus only some of the columns are necessary for each type of record.

REQ: validation of content defined within a row of a CSV file against conformance to a data definition resource may vary depending according to the type of entity described in that row

RECOMMEND MERGING THESE DETAILS WITH USE CASE #1 (ABOVE) USING A NARRATIVE AND REAL EXAMPLE(S)

IS VALIDATION IN SCOPE?

3 Publication of Statistics (Jeni Tennison) National statistics are routinely published in tabular form by the UK by the Office of National Statistics (ONS), including supplementary metadata (e.g. notes or annotations) about the tabular datasets themselves, particular columns, measures used and about additional cells.

Tabular data is typically published in Excel as CSV format.

REQ: “metadata rich” CSV files shall remain compatible with commonly used data processing and analysis tools, including Microsoft Excel – the “zero edit” requirement (Greg Kellog)

The Excel files embed metadata necessary for correct interpretation of the statistics and to provides references to externally published supplementary information (e.g. links to information on the ONS website). It is common practice for statistical tables to use codes to refer to geographic areas.

REQ: it shall be possible to incorporate, or provide reference to, supplementary metadata, such as notes or annotations, for tabular datasets, columns, rows and cells

REQ: it shall be possible to incorporate references to externally published information, including (but not limited to) (i) supporting documentation, (ii) column heading descriptions and (iii) cell values

Each Excel file describes multiple tabular datasets.

REQ: it shall be possible to group multiple tabular datasets within a single resource package

PROVISIONAL LIST OF COMMONLY USED TOOLS: MICROSOFT EXCEL, R, MATLAB, IDL – NOTING THAT DATA PROCESSING AND ANALYSIS IS OFTEN UNDERTAKEN WITH CUSTOM CODE WRITTEN IN PYTHON AND FORTRAN

USE A PARTICULAR TABULAR DATASET TO ILLUSTRATE THE TYPES OF ANNOTATION; E.G. “b” AND “f” QUALIFERS FOR STATISTICS

4 Organogram Data (Jeni Tennison) The UK Government publishes information about public sector roles and salaries in CSV format. Information is provided for each public sector organisation as a pair of CSV files, each of which conform to a data definition schema maintained by the Cabinet Office.

REQ: CSV file to reference externally managed/maintained data definition resource (repeat; #1)

REQ: Validate CSV content against conformance with data definition resource (repeat; #1)

The two CSV files are interrelated – with one CSV file referring to entities uniquely identified in its complementing file.

REQ: each entity described in a CSV file shall be uniquely identifiable (repeat; #2)

REQ: it shall be possible to assert relationships between entities each of which are described within individual rows of separate CSV files

An RDF dataset is derived from each pair of CSV files using a bespoke software program, supplementing the content in the CSV with additional information such as RDF properties for specific columns. The need to create and maintain bespoke software incurs costs that may be avoided through use of a generic CSV-to-RDF transformation.

REQ: Automated transformation of CSV to RDF (repeat; #1)

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file (repeat; #1)

5 Publication of weather observation time-series data as input into analysis or impact assessment (Jeremy Tandy)

MODIFY TO EXPRESS WORK FLOW FROM GLOBAL SURFACE TEMPERATURES CLIMATE SCIENCE INITIATIVE

“Creation of consolidated global land surface temperature climate databank”

6 Publication of Data by the UK Land Registry (Andy Seaborne) Information about property (real estate) transactions in England and Wales is recorded by the Land Registry. These transactions, comprising more than 18.5 million records captured since 1995, are published online (Price paid data). The data is collated into a CSV file on a monthly basis before being converted to RDF for storage within a database. These CSV files are provided for public consumption.

REQ: Automated transformation of CSV to RDF (repeat; #1)

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file (repeat; #1)

Each monthly CSV file provides a set of changes to be applied to the database, including new transactions (additions; “A”), corrected transactions (“C”) and deleted transactions (“D”). Each transaction described within a CSV file is identified with a unique identifier (GUID) enabling corrections or deletions to be applied to the appropriate entity within the RDF database.

REQ: each entity described in a CSV file shall be uniquely identifiable

In addition to the code-value used to assert the status of the transaction (“A”, “C” or “D”), code-values are used to classify the property type. These codes are published as external reference material.

REQ: it shall be possible to associate values within a given column of a CSV file with an externally published thesaurus or vocabulary ... (this requirement is twofold: (i) to refer from column to thesaurus, and (ii) to dereference the code-value used to a specific term within that thesaurus)

7 Processing search results from Solr (Alf Eaton) Apache Solr provides search results in several formats including CSV. Subsequent analysis of search results, using tools such as Open Refine, requires additional effort to associate attributes with each column (e.g. data type, reference to property definition, unit of measure etc.) as these are not included within the CSV output.

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file (repeat; #1)

REQ: it shall be possible to incorporate references to externally published information, including (but not limited to) (i) supporting documentation, (ii) column heading descriptions and (iii) cell values (repeat; #3)

Queries yielding large numbers of results are typically paginated, wherein each CSV describes a “page” of results including a convenient number of results. Considering the set of search results as a “dataset”, each CSV page can be seen as a subset within the larger results dataset. In many cases, a user does not need the entire dataset – only a subset of the whole; for example, requesting updates or new additions to the search results accumulated since the previous request. Furthermore, users are often interested in summary statistics for the entire dataset (e.g. total number of search results, maximum or minimum values).

REQ: specific subsets of a tabular dataset (e.g. rows 501-600) shall be unambiguously identifiable

REQ: it shall be possible to assert the relationship between a specific subset and the tabular dataset within which it is incorporated

REQ: it shall be possible to determine the structure of a tabular dataset and how (& where) a specific subset fits within that structure ... (requirement to be verified?)

REQ: it shall be possible to include, or reference, summary information about the containing tabular dataset from within a CSV file that describes a fragment of that dataset

RECOMMEND REWORKING THIS USE CASE TO FOLLOW A SPECIFIC NARRATIVE ILLUSTRATING HOW A USER QUERIES A (LARGE) DATASET (WHERE RESULTS ARE SPLIT ACROSS MULTIPLE PAGES) AND THEN USES OPEN-REFINE (OR OTHER TOOLSET) TO VISUALISE / ANALYSE THE RESULTS – ILLUSTRATING HOW THE COLUMN ATTRIBUTES (DATA TYPE, UNITS ETC.) PROVIDE UTILITY?

DOES THE SEARCH RESULT PAGINATION AND RESULT COUNT RELATE TO A MORE GENERAL CASE FOR OPENSEARCH?

I THINK THE “SEARCH” TOPIC IS MISLEADING HERE – WE’RE NOT TRYING TO DESIGN A SEARCH PROTOCOL (OR AT LEAST I DON’T THINK SO!), WHAT WE’RE TRYING TO DO IS ILLUSTRATE HOW LARGE DATASETS (REPRESENTED HERE BY A LARGE SET OF SEARCH RESULTS) ARE PAGINATED INTO SMALLER SUBSETS FOR PUBLICATION.

8 Reliability Analysis of Police Open Data (Davide Ceolin) Crime statistics are published in CSV format by the UK Home Office. The crime reports are divided into categories and aggregated by geographical (or administrative) area. Over time, the category set and aggregation method varies due to changes in policy. Assessment of crime trends over a long-term duration is complicated by the resulting variation in CSV format – changes in categorisation affect the number of columns whilst changes in aggregation method affect the number of rows. Furthermore, it is difficult to assess whether set of statistics are comparable due to these changes in categorisation or aggregation method, yielding unintentional biases or errors in the crime trend analysis.

To determine whether sets of statistics may be used for trend analysis, a “reliability analysis” is undertaken to compare the meanings of the column headings within the CSV file and determine the geographic area over which the crime reports have been aggregated. Details of geographic areas are not included within the CSV files and are often published by third parties.

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file (repeat; #1)

REQ: it shall be possible to incorporate references to externally published information, including (but not limited to) (i) supporting documentation, (ii) column heading descriptions and (iii) cell values (repeat; #3)

WOULD BE USEFUL TO INCLUDE A SET OF CSV FILES (SOME OF WHICH ARE “UNRELIABLE”) TO DEMONSTRATE HOW THE RELIABILITY ANALYSIS USES THE CATEGORIES AND GEOGRAPHIC AREAS

THE RDF DATA CUBE PROVIDES MUCH OF THE MECHANICS TO MEET THIS REQUIREMENT WHEN PUBLISHING DATA IN RDF – WE NEED SIMILAR MECHANISMS IN CSV

9 Analysis of Scientific Spreadsheets (Martine de Vos via Davide Ceolin) Microsoft Excel is one of the tools used by environmental scientists to analyse and manipulate research data. For example, an [analytical] model is constructed in Excel to process source datasets and create an interim dataset that is used within other software such as simulations. This workflow requires significant manual effort as the source data files:

  • Do not have clear descriptions for the column headings
  • Lack references from cell values to externally published information which supply the meaning of coding terms.

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file (repeat; #1)

REQ: it shall be possible to incorporate references to externally published information, including (but not limited to) (i) supporting documentation, (ii) column heading descriptions and (iii) cell values (repeat; #3)

PLEASE PROVIDE A PARTICULAR NARRATIVE AND SUPPORTING EXCEL FILES; e.g. HYDROLOGY & RIVER FLOW

10 Publishing the results of scientific experiments (Alf Eaton) The results of scientific experiments (e.g. the collated set(s) of measurement values) are often published in tabular form. It is crucial that scientific data be unambiguous to avoid misinterpretation. A single experiment may result in the production of multiple datasets which are published for convenience as a single package.

REQ: it shall be possible to publish multiple tabular datasets in a single package and, where appropriate, describe the interrelationships between those tabular datasets

Ambiguity often encountered by scientists when working with tabular data include:

  • Uncertainty regarding the meaning or type of data in a given column
  • Understanding the context associated with repeated results (e.g. was the same measurement repeated multiple times, repeated on multiple subjects or repeated in different circumstances?)
  • Determining details about the data collection / experiment method

REQ: machine-readable mechanism to associate (rich) semantics (e.g. RDF properties) with columns and rows for a given CSV file (repeat; #1)

REQ: it shall be possible to incorporate references to externally published information, including (but not limited to) (i) supporting documentation, (ii) column heading descriptions and (iii) cell values (repeat; #3)

RECOMMEND MERGING WITH CASE #9 (ABOVE) USING A SPECIFIC NARRATIVE … AGAIN, WITH REAL EXAMPLE DATA

11 Visualisation of time series data with annotations (Alf Eaton) Long duration instrumental climate records are the basis of climate research. These time series illustrate how the physical characteristics of the environment at a given location vary with time over a period of years or decades – or on occasions, centuries. However, they are usually affected by inhomogeneities (artificial shifts) due to changes in the measurement conditions (e.g. relocations, modification or recalibration of the instrument etc.). As these artificial shifts often have the same magnitude as the climate signal, such as long-term variations, trends or cycles, a direct analysis of the raw time-series data can lead to wrong conclusions about climate change.

In order to mitigate this problem many statistical homogenisation procedures have been developed for detection and correction of these inhomogeneities. Once detected, the raw time-series data can annotated to indicate the presence of artificial shifts in the data – and where possible, the reason for that shift.

REQ: it shall be possible to incorporate, or provide reference to, supplementary metadata, such as notes or annotations, for tabular datasets, columns, rows and cells (repeat; #3)

I REWROTE THIS WITH A NARRATIVE BASED ON CLIMATE OBSERVATIONS (source ... FOR INFO, THIS PROJECT HAS DEVELOPED R MODULES TO PROCESS THE SOURCE DATA AND DETERMINE THE INHOMOGENEITIES) … COMPLEMENTARY DATASETS ARE REQUIRED TO SUPPORT THIS USE CASE; I DON’T KNOW IF YOU HAVE A BETTER EXAMPLE IN MIND?

FOR OTHER EXAMPLES SEE THE CHARME PROJECT (Charme project website; FP7-funded project looking to use open annotation and linked data to annotate climate data – so called “commentary information”)