Second-Round Use Cases

From Data on the Web Best Practices
Jump to: navigation, search

Dear Data-lovers,


We are delighted to announce the Second Public Working Draft of the W3C DWBP Use Cases & Requirements (UCR) document http://www.w3.org/TR/dwbp-ucr/. The use-cases and requirements contained in UCR will form the basis for the following working-group deliverables:

  • Data on the Web Best Practices
  • Data Quality and Granularity Description Vocabulary
  • Data Usage Description Vocabulary


We would like to invite you all to review the UCR doc, to ensure we have accurately captured a wide range of requirements that address the challenges associated with ‘Data on the Web’. This can include Open Data, closed/restricted data, Linked Data, Big Data, government data, scientific data, private-sector data, messy data, etc.


In addition, if you feel a certain instance of Data on the Web is underrepresented, we would love to hear your new use-cases. Use-cases from all domains are welcome. The template for use-cases is below (completing only the relevant fields are fine).


All new use-cases will be posted on the W3C DWBP wiki here https://www.w3.org/2013/dwbp/wiki/Second-Round_Use_Cases and will be discussed at the upcoming W3C TPAC meeting http://www.w3.org/2014/11/TPAC/ (join us on Thurs 30th & Fri 31st if you’re attending). These use-cases may then be incorporated in the next version of UCR.


We look forward to receiving your feedback and new use-cases!


Kind regards,

Deirdre, Bernadette and Phil

W3C DWBP-UCR Editors


Use Case Template: Please Copy

Contributor:

City, country:

URL:

Overview:

Elements:

(Each element described in more detail at Use-Case Elements )

  • Domains:
  • Obligation/motivation:
  • Usage:
  • Quality:
  • Size:
  • Type/format:
  • Rate of change:
  • Data lifespan:
  • Potential audience:

Positive aspects:

Negative aspects:

Challenges:

Potential Requirements:

LuSTRE: Linked Thesaurus fRamework for Environment

Contributor: Riccardo Albertoni (CNR-IMATI)

City, country: Genoa, Italy

URL: http://linkeddata.ge.imati.cnr.it/

Overview: LusTRE is a framework that aims at combining existing environmental thesauri to support in the management of environmental resources. It considers the heterogeneity in scopes and levels of abstraction of existing environmental thesauri as an asset when managing environmental data, thus it aims at exploiting linked data best practice SKOS (Simple Knowledge Organisation System) and RDF (Resource Description Framework) in order to provide a multi-thesauri solution for INSPIRE data themes related to nature conservation.

LusTRE is intended to support in metadata compilation and data/service discovery according to the ISO 19115/19119. The development of LusTRE includes (i) a review of existing environmental thesauri and their characteristics in term of multilingualism, openness and quality; (ii) the publication of environmental thesauri as linked data; (iii) the creation of linksets among published thesauri as well as well-known thesauri exposed as linked data by third-parties, (iv) the exploitation of aforementioned linksets to take advantage of thesaurus complementarities in terms of domain specificity and multilingualism.

Quality of thesauri and linksets is an issue that is not necessary limited to the initial review of thesauri, it should be monitored and promptly documented.

In this respect, a standardised vocabulary for expressing dataset and linkset quality would be recommendable to make accessible the quality assessment of thesauri included in LusTRE. Considered the importance of linkset quality in the achievement of an effective cross-walking among thesauri, further services for assessing the quality of linksets are going to be investigated. Such services might be developed extending the measure proposed in Albertoni et al, 2013, so that, linksets among thesauri can be assessed considering their potential when exploiting interlinks for thesaurus complementarities.

LusTRE’s is currently under development within the EU project eENVplus (CIP-ICT-PSP grant No. 325232), it extends the common thesaurus framework De Martino et al. 2011 previously resulting from the EU project NatureSDIplus (ECP-2007-GEO-317007).

Elements:

  • Domains: Geographic information. Thesauri and Controlled vocabularies provided within LusTRE's are meant to ease the management of Geographical Data and Services.
  • Obligation/motivation: Activity foreseen in EU project which encourages the adoption of INSPIRE metadata implementation rules.
  • Usage: Data that is the basis for services to the public.
  • Quality: Largely variable.
  • Lineage: Thesauri and controlled vocabulary provided come from Third Parties.
  • Size: small most of the thesauri size is less than 100MB.
  • Type/format: LusTRE publishes SKOS/RDF, but the thesauri considered for inclusion in LusTRE are not necessarily in that format.
  • Rate of change: Depends on the thesaurus, in average it is a low rate of Change.
  • Data lifespan: Lifespan of eENVPlus project ranges in 2013 – 2015, by the way, the framework is going to be maintained after the project is concluded.
  • Potential audience: Public administrations involved in the cataloguing of geographical information and Spatial Data Infrastructure. Decision makers searching in Spatial Data Infrastructure.
  • Certification/Governance: na.

Positive aspects: The use case includes publication as well as consumptions of data.


Negative aspects:


Challenges:

  • Diversity and (sometimes) complexity of Licenses.
  • Issues pertaining to multilingualism.
  • Assessment and documentation of dataset and linkset quality with domain-dependent quality metrics.

Primary Requirements

  • R-FormatMachineRead, R-FormatStandardized, R-FormatOpen, R-VocabReference, R-VocabDocum, R-VocabOpen, R-VocabVersion, R-MetadataAvailable, R-MetadataMachineRead, R-MetadataStandardized, R-MetadataDocum, R-LicenseAvailable, R-AccessBulk, R-UniqueIdentifier, R-PersistentIdentification, R-QualityCompleteness, R-QualityComparable, R-QualityMetrics, R-IncorporateFeedback.
  • A additional requirement about quality, which is not yet included in UCR, could be grounded by LuSTRE and it is discussed in http://lists.w3.org/Archives/Public/public-dwbp-comments/2014Oct/0002.html

Secondary Requirements

  • R-Citable, R-Archiving, R-QualityOpinions, R-ProvAvailable, R-TrackDataUsage, R-MultipleRepresentations, R-UsageFeedback, R-DatasetVersioning, R-DatasetEnrichment

ASO - Airborne Snow Observatory

Contributor: Lewis John McGibbney (NASA Jet Propulsion Laboratory/California Institute of Technology)

City, country: Pasadena, CA, U.S.

URL: http://aso.jpl.nasa.gov

Overview: The two most critical properties for understanding snowmelt runoff and timing are the spatial and temporal distributions of snow water equivalent (SWE) and snow albedo. Despite their importance in controlling volume and timing of runoff, snowpack albedo and SWE are still largely unquantified in the US and not at all in most of the globe, leaving runoff models poorly constrained. NASA/JPL, in partnership with the California Department of Water Resources, has developed the Airborne Snow Observatory (ASO), an imaging spectrometer and scanning lidar system, to quantify SWE and snow albedo, generate unprecedented knowledge of snow properties for cutting edge cryospheric science, and provide complete, robust inputs to water management models and systems of the future.

Elements:

  • Domains: Digital Earth Modeling, Digital Surface Modeling, Spatial Distribution Meausrement, Snow Depth, Snow Water Equivalent, Snow Albedo.
  • Obligation/motivation: Funding provided by NASA Terrestrial Hydrology, NASA Applied Sciences, and California Department of Water Resources.
  • Usage: Example data usage include <24hrs turnaround of flight data which is passed on to numerous Water Resource Managers aiding in water conservation usage, policy and decision making processes. Accurate and weekly spatially distributed SWE has never been produced before, and is highly informative to reservoir managers who must make tradeoffs between storing water for summer water supply versus using water before snowmelt recedes for generation of clean hydropower. Accurate SWE information, when coupled with runoff forecasting models, can also have ecological benefits through avoidance of late-spring high flows released from reservoirs that are not part of the natural seasonal variability.
  • Quality: Available in a number of scientific formats to customers and stakeholders based on customer requirements.
  • Lineage: All ASO data stems directly from on-board imaging spectrometer and scanning lidar system instruments.
  • Size: Many many TB in size. Raw data acquisition is dependent on the basin/survey size. Recent individual flights generate in the order of ~500GB which include imaging spectrometer and Lidar data. This does however shrink considerably if we just consider the data that we would distribute.
  • Type/format: Digital Elevation Model / binary image (not public atm), Lidar (Raw Point Clouds)/ las (not public atm), Raster Zonal Stats / text (not public atm), Snow Water Equivalent / tiff, Snow Albedo / tiff
  • Rate of change: Recent weekly flights have provided information on a scale and timing that has never occurred before. Distributed SWE increases after storms, and decreases during melt events in patterns that have never before been measured and will be studied by snow hydrologists for years to come. Once data is captured it is not updated, however subsequent data is generated from the original data within processing pipelines which as screening for data quality control and assurance.
  • Data lifespan: For immediate operational purposes, the last flight's data become obsolete when a new flight is made. However, the annual sequence of data sets will be leveraged by snow hydrologists and runoff forecasters during the next decade as they are used to improve models and understanding of the spatial nature of the mountain snowpack.
  • Potential audience: (snow) hydrologists, hydrologic modelers, runoff forecasters, and reservoir operators and reservoir managers.
  • Certification/Governance: na.
  • Dataset versioning and dataset replication: This use case is particularly suited to ISSUE-94: Dataset versioning and dataset replication as it is typical for missions of this nature to take such matters seriously. In particular ASO aims to have versions of data year after year (or less) to draw historical comparisons over time. Dataset versioning is therefore identified as being closely linked to adequate archival of data products generated and curated within the scope of the mission.

Positive aspects:

  • This use case provides insight into what a NASA funded demonstration mission looks like (from a data provenance, archival point of view).
  • It is an excellent opportunity to delve into an earth science mission which is actively addressing the global problem of water resource management. Recently senior officials have declared a statewide (CA) drought emergency and are asking all Californians to reduce their water use by 20 percent. California, and other U.S. states are experiencing a serious drought and the state will be challenged to meet its water needs in the upcoming year. Calendar year 2013 was the driest year in recorded history for many areas of California, and current conditions suggest no change is in sight for 2014. ASO is at the front line of cutting edge scientific research meaning that the data which back the mission, as well as the practices adopted within the project execution are extremely important to addressing this issue.
  • Project collaborators and stakeholders are sent data and information when it is produced and curated. For some stakeholders, the data (in an operational sense) they require is very small in size and in such cases ASO emphasizes on speed. It's more like a sharing of information rather than delivering a product for the short-term turnaround information.

Negative aspects: Demonstration missions of this caliber also have downsides. With regards to data best practices, more work is required in the following areas;

  • Documentation of processes including data acquisition, provenance tracking, curation of data products such as bare earth digital earth models (DEM), full surface digital surface models (DSM), snow products, snow water equivalents (SWE), etc.
  • Currently data is not searchable, this makes retrieval of specific data difficult when data volumes grow to this size and nature
  • There is no publicly available guidance regarding suggested tools which can be used to interact with the data sources
  • Quick turnarounds of operational data may be compromised when ASO moves beyond a demonstration mission and picks up new customers etc. This will most likely be attributed to the time associations for the generation and distribution of science grade products.

Challenges:

  • Data volumes are large, and will grow by year on year. The volume of generated data grew by 50% between 2013 and 2014.
  • On many occasions we require a very quick turn around on inferences which can be made from the data. This sometimes (but not always) comes at the cost of reducing the emphasis of best practices for the generation, storage and archival of projects data
  • The data takes the form of science oriented representational formats. Such formats are non-typical of the typical data many people publish on the web. A lot of thought needs to be put in to how this data can be better accessed.

Primary Requirements R-DesignatedUserExpertise, R-SoftwareDataUsage, R-UsageFeedback, R-Location, R-GranularityLevels, R-QualityMetrics, R-FormatMachineRead, R-QualityCompleteness, Secondary Requirements R-DataIrreproducibility, R-LicenseLiability, R-MetadataAvailable, R-DataMissingIncomplete

Uses of Open Data Within Government for Innovation and Efficiency

The Share-PSI 2.0 network, co funded by the European Commission, is running a series of workshops throughout 2014 and 2015 examining different aspects of how to share Public Sector Information (PSI). This is in the context of the revised European Directive on the Public Sector Information. The network's focus is therefore narrower than the Data on the Web Best Practices, however, the overlap is substantial. There are more than 40 partners in the Share-PSI 2.0 network from 25 countries including many government departments as well as academics, consultants, citizen's organizations and standards bodies involved directly with PSI provision.

The report from the first Share-PSI 2.0 workshop, held as part of the Samos Summit 30 June - 1 July 2014, summarizes the many papers and discussions held at that event. From it, we can derive a long list of requirements.

Contributor: Phil Archer

URL: http://www.w3.org/2013/share-psi/workshop/samos/report

Elements and challenges not included here as the report summarizes many use cases.

Requirements in the http://www.w3.org/TR/2014/WD-dwbp-ucr-20141014/ 14 October WD]

R-VocabReference

R-MetadataStandardized

R-MetadataDocum

R-ProvAvailable

R-IndustryReuse

R-SelectHighValue

R-SelectDemand

R-AccessRealTime

R-AccessUpToDate

R-SensitivePrivacy

R-SensitiveSecurity

R-Citable

R-CoreRegister

R-QualityComparable

R-IncorporateFeedback

Requirements not included in the 14 October WD

R-Location

Locations (countries, regions, administrative units) must be referred to consistently

Perhaps new set of Reqs around policy issues

R-Strategy

Data should be shared on the Web following a well-defined plan and process

R-PolicySupport

Any data sharing strategy, whether public or private sector, must have support at the highest level.

R-AccessLevels

A strategy is likely to need to define at least three levels of access: confidential (not shared), restricted (access subject to authentication), open.

UK Open Research Data Forum

Contributor: Phil A

City, country: UK

URL: http://www.researchinfonet.org/wp-content/uploads/2014/07/Joint-statement-of-principles-June-2014.pdf

Overview:

In 2013, the Royal Society lead the formation of the UK Open Research Data Forum. This effort is a national reflection of a global trend towards the open publication of research data; see, for instance, the work of the Research Data Alliance, DataCite and the US National Institutes of Health as described in a talk by its Associate Director for Data Science, Philip Bourne. Following a workshop in April 2014, the UK Open Research data Forum and US Committee on Coherence at Scale issued a joint statement (PDF) of the principles of open research data.

  1. The data that provide the evidence for the concepts in a published paper or its equivalent, together with the relevant metadata and computer code must be concurrently available for scrutiny and consistent with the criteria of “intelligent openness”. The data must be:
    • discoverable – readily found to exist by online search;
    • accessible – when discovered they can be interrogated;
    • intelligible – they can be understood;
    • assessable – e.g. the provenance and reliability of data;
    • reuseable – they can be reused and re-combined with other data.
  2. The data generated by publicly -- or charitably -­ funded research that is not used as evidence for a published scientific concept should also be made intelligently open after a pre­-specified period in which originators have exclusive access.
  3. Those who reuse data but were not their orginators must formally acknowledge their originators.
  4. The cost of creating intelligently open data from a research project is an intrinsic part of the cost of research, and should not be considered as an optional extra.
  5. Although the default position for data generated by publicly -­ or charitably -- funded research should be one of “intelligent openness”, there are justifiable limits to openness. These are where commercial exploitation is in the public interest and the sectoral business model requires limitations on openness; in preserving the privacy of individuals whose personal information is contained in databases; where data release would endanger safety (unintended accidents) or security (deliberate attack). However, these instances do not provide justification for blanket exceptions to the default position for those researchers or research institutions whose role is to disseminate openly their finding, and should be argued on a case- by case basis.
  6. Existing processes, reward structures and norms of behavior that inhibit or prevent data sharing or new forms of open collaboration should, wherever possible, be reformed so that data sharing and collaboration are encouraged, facilitated and rewarded.

At the time of writing, these are undergoing review and refinement but the aims are clear. In the context of the Data on the Web Best Practices Working Group, many requirements stem from this list.

Challenges

Each principle listed here represents one or more challenges, with points 1, 3 and 5, being particularly relevant to Data on the Web Best Practices. Matters of policy and culture within any domain, whilst certainly challenging, are out of scope for the current work.

Elements:

(Each element described in more detail at Use-Case Elements )

  • Domains: Research data
  • Obligation/motivation: Cultural/professional obligation
  • Usage: Data that supports the scientific method
  • Quality: Variable - often empirical, often messy. Some of the data may not be repeatable
  • Size: Highly variable but it's noteworthy that research data can be very large (e.g. genomics)
  • Type/format: variable including some specialist formats, XML dialects etc. but often CSV
  • Rate of change: Usually the data is static
  • Data lifespan: Publication often associated with a journal publication that marks the end of the cycle.
  • Potential audience: Research peers

Requirements R-FormatMachineRead, R-FormatStandardized, R-FormatOpen, R-MetadataAvailable, R-MachineReadable, R-MetadataStandardized, R-MetadataDocum, R-LicenseAvailable, R-ProvAvailable, R-AccessBulk, R-SensitivePrivacy, R-SensitiveSecurity, R-UniqueIdentifier, R-Citable, R-PersistentIdentification, R-TrackDataUsage

Open Experimental Field Studies

Contributor: Eric Stephan

City, country: Richland, USA

URL: n/a

Overview:

In 2013 the United States Whitehouse published an executive order on Open Data to help make publically available data: understandable, accessible, and searchable. A number of historical and on-going atmospheric studies fall into this category but are not currently open. This use case describes characteristics of laboratory experiments and field studies that could be published as Open Data.

For measurements to be considered useful and comparable to other findings scientists need to track every aspect of their laboratory and field experiments. This can include: background describing the purpose of the experiment, field site selected, instrumentation deployed, configuration settings, house keeping data, types of measurements that need to be taken, work performed on field visits, processing the raw measurements, intermediate processing data, value added data products, quality assurance, problem reporting, and standards relied upon for disseminating the study results including selected data formats, quality control codes selected, engineering units selected, and metadata vocabularies relied upon for describing the measurements.

Traditionally knowledge and data about the studies have either been kept in separate local databases, file systems and spreadsheets, or in non-record keeping systems. If kept electronically the experiment in its entirety may be kept in bulk by way of archive files (tar, zip etc). Measurements from the study may be shared along with background information in the form of a summarized report or publication, content management system or wiki site and the bulk of knowledge is largely retained internally data providers.

Elements:

(Each element described in more detail at Use-Case Elements )

  • Domains: Open scientific experimental research relying upon in situ and remote sensing instruments. E.g. wind studies that may use anemometers and LIDAR to study wind measurements.
  • Obligation/motivation: Answer scientific questions about the characteristics and behavior of the physical system being studied.
  • Usage: Data may analyzed and visualized by applications, used in computational models or combined in larger data sets for larger studies. Data must also be discoverable. Part of discoverability is being able to convey data from different perspectives such as measurement e.g. wind speed and instrument e.g. lidar, anemometer. If we are successful our hope is that this data will be discoverable by other researchers throughout the world.
  • Quality: House keeping data, problem reporting, maintenance history, calibration history
  • Size: Dependent on the length of the study, measurement rate, and the size of each sample. Size can vary from kilobytes to tens of gigabytes daily for a single instrument.
  • Type/format: raw data is dictated by the instrument producing the measurements. Intermediate results and value added products can be in binary, delimited text file, NetCDF, or stored in other formats. Data may also be put into standardized formats and rely on existing community vocabularies to describe the dataset metadata where they are available.
  • Rate of change:  : Data will be measured, collected and made available in near real time.
  • Data lifespan: This may vary between scientific communities. For atmosphere field studies data cannot be reproduced and may be retained forever. If a laboratory experiment can be repeated, it may have a limited lifespan. In cases where data is cited even repeatable experiments will be available to back up the published research findings.
  • Potential audience: domain experts and scientific peers, science teachers and students. Other domains will use these results.

Positive aspects: The Web of Things (instruments), Linked Services (processing software), and Linked Data communities offer an opportunity to field or laboratory experiments by coupling all the elements of the experiment into one composite product. Leveraging these technologies it is possible to construct a catalog that acts as a concierge to any collaborator giving them perspectives on things, services, and data.

Negative aspects:

  • When data is published on the web there is no mechanism for users to rate and review data.
  • Data providers usually are unaware of new user communities using measurements.

Challenges:

  • Publishing experiments to publically accessible web-based archives.
  • Advertising experiments in catalogs that includes comprehensive information about the things and services used in the experiment.
  • Providing composite experiment in such a way that it is useful to users that are not fellow collaborators.
  • Identifying new emerging target user communities
  • Without specific best practices guidance data may not be published and irreproducible data risks being lost.
  • Policies need to be provided when in the experimental design when it is acceptable to publish data and when to keep it initially private.


Potential Requirements:

  • R-FormatMachineRead
  • R-FormatStandardized
  • R-VocabReference
  • R-Vocab-Open
  • R-AccessRealTime
  • R-MultipleRepresentations
  • R-DataLifecycleStage – Data should be identified by a designated lifecycle stage.
  • R-DataIrreproducibility – Data should be designated if it is irreproducible.
  • R-DesignatedUserExpertise – Data should be designated if either by virtue of its complexity or its nature is relevant to users with specific expertise.
  • R-DesignatedThingsServiceProviders – Data produced by things or services should be associated with complete things/services metadata descriptions.
  • R-SoftwareDataUsage – Data should be annotated with descriptions of software applications using the data.
  • R-UsageFeedback – Data consumers should have a way of sharing feedback and rating data.


Free Open Data SLAs for Open Data publishing

Contributor: Peter Hanečák (COMSODE consortium), Oskár Štoffan City, country: Bratislava, Slovak Republic URL: http://www.comsode.eu Overview: Very few (if some) Open Data publishers define and publicly provide consumers with SLA. There is a need to define one or more standard “free/Open Data SLAs” in order to keep the data consumers informed about the conditions under which they are provided with data (or API service). Such SLA should also provide sufficient information about the expected availability and quality of the service, what are the limitations, how the overload is handled, etc. So far "best effort" is usually employed (or worse) assumed, which poses risk of misunderstanding between publishers and consumers. It would be best to define such “free/Open Data SLAs” in a machine-readable format mentioned in another use-case1.

1 http://www.w3.org/2013/dwbp/wiki/Use_Cases_Document_Outline, items "RSLA1: SLAs should be provided in a machine-readable format" and "RSLA1: Standard vocabularies should be used to describe SLA".

Elements:

  • Domains: Open Data in general
  • Obligation/motivation: to provide clearly defined services to consumers of Open Data
  • Usage: Open Data publishing in general
  • Potential audience: Open Data publishers

Challenges: integrate such SLAs with publishing real publishing tools, so that lets say web server is able to be given such SLA and will then enforce it

Potential Requirements: As a result publishers should:

  • know how to create a versatile Open Data specific SLA ensuring fair conditions under which Open Data is published from scratch
  • or reuse some predefined standard SLA template

Dataset versioning and dataset replication

Contributor: ’ Peter Hanečák (COMSODE consortium), Oskár Štoffan ’City, country: ’ Bratislava, Slovak Republic ’URL: http://www.comsode.eu

’Overview: ’ It is useful to track changes in data during Open Data publishing process (what was changed). It is also important to provide sort of "paper trail" for published datasets (when changes occurred, by whom, why, how). Proper versioning of published dataset can help keep and provide such information and functionality:

  • track changes in data
  • provide possibility to review the history of changes
  • provide audit trail
  • get access to whichever previous version of data, not only to most recent version
  • get datasets updates more efficiently

These are important mainly for data consumers, mainly to build trust in data and develop more knowledge about what the data is and how it is being collected, maintained, etc. This will then help better use the data. Versioning may help also data publishers, to publish the data more efficiently and spot and fix bugs in publishing process. Especially in cases when changes are frequent and small comparing to the size of a dataset, the storage and subsequent access to data should be easy and effective. This will allow to track the changes and subsequently to make it possible to reconstruct and build a dataset of a required version (presuming that the first version is always stored as a whole dataset and all changes are properly tracked and versioned). This subsequently reduces the amount of data that needs to be stored/transferred each time changes are made. The approach previously described is achievable utilizing the Git (optionally also GiHub) as repository for data1. The solution comes with some advantages and some limitations as well (the file size of the dataset, the format of data - line-oriented text is the best for diffs). These limitations are compensated for with existing diff/merge tools. Git (and GitHub) also represents a better scalable hosting solution at the same time (especially for small municipalities). Replication of data is ensured via “git pull” effectively (it is not required to transfer the whole dataset in case of changes in data) each time data update is available. Usage of Git(Hub) and its tools by non-programmers and general public might be however a bit cumbersome at the moment. Question then arises if it would be feasible to develop GitHub like service dedicated for data, to avoid limitations coming from the fact that Git and GitHub is today intended and used mainly as source code repository and management tool. 1 Suggestion inspired by OKFN’s http://blog.okfn.org/2013/07/02/git-and-github-for-data/ .

Elements:

  • Domain: Open Data in general
  • Obligation/motivation: to optimize the process of publishing regularly updated data and its availability to data consumers
  • Usage: Open Data publishing in general
  • Potential audience: Open Data publishers

Challenges: Git/GitHub dedicated for data?

Potential requirements: Set of best practices which could be proposed in order to track changes in Open Data publishing process in a way that proper versioning and effective replication of requested data is performed easily along with review and auditing of changes.


Mass Spectrometry Imaging (MSI)

Contributor: Annette Greiner, Lawrence Berkeley National Laboratory

City, country: Berkeley, California, USA

URL: https://openmsi.nersc.gov

Overview: Mass spectrometry imaging (MSI) is widely applied to image complex samples for applications spanning health, microbial ecology, and high throughput screening of high-density arrays. MSI has emerged as a technique suited to resolving metabolism within complex cellular systems; where understanding the spatial variation of metabolism is vital for making a transformative impact on science. Unfortunately, the scale of MSI data and complexity of analysis presents an insurmountable barrier to scientists where a single 2D-image may be many gigabytes and comparison of multiple images is beyond the capabilities available to most scientists. The OpenMSI project will overcome these challenges, allowing broad use of MSI to researchers by providing a web-based gateway for management and storage of MSI data, the visualization of the hyper-dimensional contents of the data, and the statistical analysis.

Elements:

(Each element described in more detail at Use-Case Elements )

  • Domains: imaging mass spectrometry, life sciences, microscopy, analytical chemistry
  • Obligation/motivation: scientific analysis, reporting results, collaboration
  • Usage: Data sets can be contributed by researchers anywhere in the world and perused/analyzed by anyone. Users can share their data with individuals and the public using a familiar group and users view/edit/own permission scheme. Once their dataset is in the system, a researcher can select subsets of the data for viewing as an image or spectrum. Researchers can perform statistical analysis of their data, e.g, via non-negative matrix factorization, while the API and online viewers enable users to interact with derived analytics in the same way as with raw data. Users can also download individual images. A REST API provides programmatic access to enable custom remote data analytics and retrieval of data subsets.
  • Quality: varies with mass spectrometry instrument used, preparation of sample
  • Size: Average sizes typically range from 10-50 GB per sample (before compression). Larger images of 50 - 500GB can already be generated today. Each lab with an OpenMSI account generates typically 2-5 samples per week.
  • Type/format: Multiscale, multimodal, and multidimensional data stored using the OpenMSI file format based on HDF5.
  • Rate of change: Underlying data for an experiment does not generally change, though new analyses and metadata will be added over time.
  • Data lifespan: years to decades
  • Potential audience: working scientists interested in obtaining spatially resolved chemical information about samples including scientists researching cancer, agriculture, and synthetic biology.

Positive aspects: huge improvement in ease of analysis over traditional methods, ability to readily share results with other researchers, ability to download relevant subsets of data, provides metadata for each sample, self-describing data format, fast and flexible web API, interactive web-based exploration that enables user to view data that cannot be opened using standard MSI tools.

Negative aspects: submission of metadata should be easier and automated. As it scales, we'll need to facilitate discovery of datasets of interest via search.

Challenges: Project is largely unfunded and resources are vitally needed for project to succeed.

Potential Requirements: Enable users to access/download subsets of large datasets. Make data available through an API. Make data available in formats that users and programs can readily manipulate. Provide good metadata. For an API, provide documentation for usage as well as metadata. Provide a description for each field and units for all measures. Use self-describing data formats when possible. Optimize speed when delivering large data through an API. Provide preliminary analysis/visualization where demand justifies the effort. Limit access to non-public data to those who are authorized. For collaborative publication, allow users to readily contribute new data. Enable users to find datasets of interest within collections.

R-FormatStandardized, R-FormatMachineRead, R-FormatOpen, R-MetadataAvailable, R-MetadataMachineRead, R-MetadataDocum, optimize speed, R-SensitiveSecurity, allow contributions, searchable, data enrichment, data subsetting, available as API

(These added by PhilA 10 Nov, see issue-95 )

  • LargeDataSetAPIs - large datasets should be available via APIs to allow access to specific portions of the data ??
  • The previous implies API documentation, which might include URIs for slices of the data - I feel the ocean struggling towards boiling point
  • Access control is mentioned too

(Annette replies 28 Nov)AG: Re point 1, I would say that making data available via an API is a BP not just for large datasets. Re point 2, I think we shouldn't say more than that APIs need usage documentation; the rest is about how to make a good API, which I think is out of scope, or how to publish data on the web, which is the rest of the doc.

BBC ontology versioning and Metadata

Contributor: Ghislain Atemezing (EURECOM)

City: Sophia-Antipolis

Country: France

URL: http://www.bbc.co.uk/ontologies

Overview: BBC provides a list of the ontologies they implement and use for their Linked Data platform available at http://www.bbc.co.uk/ontologies. The site provides access to the ontologies the BBC is using to support its audience using their applications, such as BBC Sport http://www.bbc.co.uk/sport or BBC Education http://www.bbc.co.uk/education. Each ontology is described by a short description with metadata information , an introduction, a sample data, an ontology diagram and the terms used in the ontology. Regarding metadata, it contains generally 6 filed: mailto authors, created data, version (current version number), prior version (decimal), license (a link to the license) and a link for downloading the RDF version. For example, see the description of the “Core concepts ontology” at http://www.bbc.co.uk/ontologies/coreconcepts. However, this metadata information available in the html page is NOT present in a machine-readable format, e.g. , the absence of that information in the ontology itself. Versioning: Each ontology used a decimal notation for the versions, e.g. 1.9. The URL for accessing each version file of the ontology is constructed as {BASE-URI}/{ONTO-PREFIX}/{VERSION}.ttl; where {BASE-URI} is http://www.bbc.co.uk/ontologies/. Example: The file of version 1.9 of the “core concepts” ontology is located at http://www.bbc.co.uk/ontologies/coreconcepts/1.9.ttl. However, between different versions, the URI of the ontology used is the same, of the form : {BASE-URI}/{ONTO-PREFIX}/.

Elements

  • Domains: vocabulary catalogue, versioning, metadata
  • Obligation/motivation: Provide a unique point of vocabularies built within BBC
  • Usage: The site provides access to the ontologies the BBC is using to support its audience using their applications,
  • Quality: High level and domain vocabularies adapted to BBC applications.
  • Size: currently, there are 12 ontologies of different sizes, from 40 triples to 750 triples.
  • Type/format: RDF/TURTLE, and html pages describing each ontology
  • Rate of change: Depends on the vocabulary, may depends on the different versions; although there is not such metadata information
  • Data lifespan: n/a
  • Potential audience: BBC applications and any user interested in the domains of the vocabularies (publishers, researchers or developers)


Challenges - It could be nice and consistent to add systematically the metadata provided in the html pages describing each BBC ontology in the RDF vocabulary.

- How to dereference from a unique URI, different versions of the ontology in different flavor of RDF (XML, TURTLE, etc.)

- Need to add the modified date along with the version of each ontology.

Potential Requirements: R-VocabVersion, R-MetadataMachineRead, R-MetadataDocum, R-MetadataMachineRead, R-MetadataStandardized, R-VersionURIDesign?

R-VersionURIDesign: “Data should have a canonical way to design URIs for different snapshot of the dataset.

Web Observatory

Contributor: Adriano C. Machado Pereira, Adriano Veloso, Gisele Pappa, Wagner Meira Jr.

City, country: Belo Horizonte, Brazil

URL: http://observatorio.inweb.org.br/english.html

Overview: There are almost 65 million brazilians connected to the Internet - 36% of the Brazilian Population, according to Comitê Gestor da Internet no Brasil. As a consequence, events such as the Brazilian Election Running have become popular topics in the Web, mainly in Online Social Networks. Our goal is to understand this new reality and present new ways to watch facts, events and entities on the fly using the Web and user-generated content available in Online Social Networks and Blogs. The Web Observatory is a research project part of the Instituto Nacional de Ciência e Tecnologia para a Web (INWEB), sponsored by CNPq and Fapemig. There are over 30 experts involved in the project, from four differente Federal Universities: Universidade Federal de Minas Gerais (UFMG), Centro Federal de Educação Tecnológica de Minas Gerais (CEFET-MG), Universidade Federal do Amazonas (UFAM) e Universidade Federal do Rio Grande do Sul (UFRGS). The INWEB researchers use a set of new techniques related to Information Recovery, Data Mining and Data Visualization to understand and summarize what the media and users are talking about on the Web. That is the fundamental piece to evaluate the impact of the Olympic Campaigns and how users react to news and discussions. One new feature in this project is the possibility to see the propagation of the Tweets.


Elements:

  • Domains:

Different contexts or domains, related to data from the Web. For example: Health (for example, diseases); Tourism; Sports (for example, soccer championship and Olympic games); Politics; Finance; Etc.

  • Obligation/motivation: Data must be obtained from different public data sources from the Web.
  • Usage: Provide different data analysis, indicators or visualizations to allow a better understand of a context.
  • Quality: Variable, depend on the data source, can be structured or not.
  • Size: Variable, can be small data instances to a huge amount of data, depending on the context under investigation. In general, there are a huge amount of data.
  • Type/format: Diverse, like CSV, HTML, JSON, XML, etc.
  • Rate of change: Different rates of change, usually very dynamic.
  • Data lifespan: n/a
  • Potential audience: Diverse, different Web users.

Challenges:

  • Data volume;
  • Data velocity;
  • Data variety;
  • Data value;
  • Complexity.

Potential Requirements: R-MetadataMachineRead , R-MetadataStandardized , R-MetadataDocum , R-VocabReference , R-VocabDocum , R-VocabOpen , R-ProvAvailable, R-GranularityLevels

Potential Requirements (related to Data Enrichment):

  • R-DataCharacterization;
  • R-DataImputation;
  • R-DataSegmentation;
  • R-DataDisambiguation;
  • R-DataEntityRecognition;
  • R-DataFusion.