Warning:
This wiki has been archived and is now read-only.
Data on the Web Life Cycle
From Data on the Web Best Practices
Contents
Data on the Web
Data from diverse domains (ex: governmental data, cultural heritage, scientific data, cross domain) available on the Web on a machine processable format.
Data on the Web Life Cycle
- A set of tasks or activities that take place during the process of publishing and using data on the Web.
- The process may pass through some number of iterations and may be represented using a spiral model.
Figure
An Overview of the Data on the Web Life Cycle
Data collection
- Sources selection: identification of data sources that may offer relevant data (ex: relational databases, xml files, excel documents)
Data Generation
- 1st iteration: Dataset project
- Define the schema of the target dataset (structural metadata)
- Choose standard vocabularies
- Data (ex: FOAF, DC, SKOS, Data Cube)
- Dataset (ex: DCAT, PROV, VoiD, Data Quality Vocab)
- Data Catalog (ex: DCAT)
- Choose data formats (machine processable data)
- Create new vocabularies
- …
- 2nd iteration: ETL process (Extract, Transform and Load)
- Extract data from the selected data sources, transforms the data according to the decisions made during the dataset project and loads the data into the target dataset
- Metadata generation
- Produce (manually or automatically) structured metadata according to the metadata standards defined during the dataset project
Data Distribution
- 1st iteration: Plan
- URIs project: Design URIs that will persist and will continue to mean the same thing on the long term
- Choose a solution(s) for data publishing data catalogue, API, SPARQL endpoint, dataset dump, …
- 2nd iteration: Publish
- Publish data and metadata: Make data and metadata available on the Web
- 3rd iteration: Update
- Update data: Make a new version of the dataset available on the Web
- Update metadata: Make a new version of the metadata available on the Web
Data usage
- Explore data: Identify important aspects of the data into focus for further analysis
- Analyze data: Develop applications, build visualizations, …
- Give feedback: Provide useful information about the dataset (ex: dataset relevance, data quality,…)
- Provide data usage descriptions
Data on the Web Life Cycle and Best Practices
- Best practices may be applied during the whole process of publishing and using data on the Web.
- Best practices may be defined according to the activities performed in each one of the quadrants (or tasks).
- For each best practice, a guidance of how to implement must be provided.
- Some best practices may have more than one way of implementation.
Figure
Examples of Best Practices
- Best practices - Data collection
- Have a catalogue to describe potential data sources, i.e., data sources that could provide data to be published on the Web
- …
- Best practices - Data Generation
- Document the process of data generation
- Use standard vocabularies to describe data
- Use standard vocabularies to describe datasets and data catalogues (ex: DCAT)
- Provide stable URIs
- Provide data on machine processable formats
- Provide metadata to describe data
- …
- Best Practices - Data Distribution
- Use standard ways to distribute data (ex: data catalogues and APIs)
- Provide details about data access
- Provide details about data licence
- Provide details about dataset provenance and quality
- Provide a schedule of dataset updates
- Keep a dataset history
- Provide ways to collect data consumers feedback
- Announce the publication of new datasets or new versions of existing datasets
- …
- Best Practices - Data usage
- Provide feedback about datasets
- Provide descriptions about the usage of the dataset
- …