Data on the Web Life Cycle

From Data on the Web Best Practices
Jump to: navigation, search

Data on the Web

Data from diverse domains (ex: governmental data, cultural heritage, scientific data, cross domain) available on the Web on a machine processable format.

Data on the Web Life Cycle

  • A set of tasks or activities that take place during the process of publishing and using data on the Web.
  • The process may pass through some number of iterations and may be represented using a spiral model.

Figure

An Overview of the Data on the Web Life Cycle

Data collection
  • Sources selection: identification of data sources that may offer relevant data (ex: relational databases, xml files, excel documents)
Data Generation
  • 1st iteration: Dataset project
    • Define the schema of the target dataset (structural metadata)
    • Choose standard vocabularies
    • Data (ex: FOAF, DC, SKOS, Data Cube)
    • Dataset (ex: DCAT, PROV, VoiD, Data Quality Vocab)
    • Data Catalog (ex: DCAT)
    • Choose data formats (machine processable data)
    • Create new vocabularies
  • 2nd iteration: ETL process (Extract, Transform and Load)
    • Extract data from the selected data sources, transforms the data according to the decisions made during the dataset project and loads the data into the target dataset
    • Metadata generation
    • Produce (manually or automatically) structured metadata according to the metadata standards defined during the dataset project
Data Distribution
  • 1st iteration: Plan
    • URIs project: Design URIs that will persist and will continue to mean the same thing on the long term
    • Choose a solution(s) for data publishing data catalogue, API, SPARQL endpoint, dataset dump, …
  • 2nd iteration: Publish
    • Publish data and metadata: Make data and metadata available on the Web
  • 3rd iteration: Update
    • Update data: Make a new version of the dataset available on the Web
    • Update metadata: Make a new version of the metadata available on the Web
Data usage
  • Explore data: Identify important aspects of the data into focus for further analysis
  • Analyze data: Develop applications, build visualizations, …
  • Give feedback: Provide useful information about the dataset (ex: dataset relevance, data quality,…)
  • Provide data usage descriptions

Data on the Web Life Cycle and Best Practices

  • Best practices may be applied during the whole process of publishing and using data on the Web.
  • Best practices may be defined according to the activities performed in each one of the quadrants (or tasks).
  • For each best practice, a guidance of how to implement must be provided.
  • Some best practices may have more than one way of implementation.

Figure

Examples of Best Practices
  • Best practices - Data collection
    • Have a catalogue to describe potential data sources, i.e., data sources that could provide data to be published on the Web
  • Best practices - Data Generation
    • Document the process of data generation
    • Use standard vocabularies to describe data
    • Use standard vocabularies to describe datasets and data catalogues (ex: DCAT)
    • Provide stable URIs
    • Provide data on machine processable formats
    • Provide metadata to describe data
  • Best Practices - Data Distribution
    • Use standard ways to distribute data (ex: data catalogues and APIs)
    • Provide details about data access
    • Provide details about data licence
    • Provide details about dataset provenance and quality
    • Provide a schedule of dataset updates
    • Keep a dataset history
    • Provide ways to collect data consumers feedback
    • Announce the publication of new datasets or new versions of existing datasets
  • Best Practices - Data usage
    • Provide feedback about datasets
    • Provide descriptions about the usage of the dataset