Warning:
This wiki has been archived and is now read-only.

Comments to be considered before publishing the last working draft

From Data on the Web Best Practices
Jump to: navigation, search
No. Subject Comment Author Comment or Proposal Resolution and Implementation
1 General issues Possible approaches to implementation should not include the word "should". That implies normativeness. This is a general issue with implementation sections. We say in the Audience section that "The normative element of each best practice is the intended outcome." Annette Berna's proposal: To remove the word "should" from the intended outcome sections. To remove the sentence "The normative element of each best practice is the intended outcome" from the BP template and the Audience section. Resolved: Should was replaced by can. There some cases that still need a review. Data Access Section.

Implementation: https://github.com/w3c/dwbp/commit/c9c5cb7a188e5858fe53ca60ff4363cf0a294851

2 General issues Subtitles should all be written in the same mode. (Mine were written in imperative -- "do this, don't do that", but most are declarative -- "this should be done".) I think imperative is better, because it gets away from RFC2119 keywords, which we voted not to use. It becomes a call to action, which is our goal, right? Annette This needs to be discussed by the group. (see also: https://docs.google.com/spreadsheets/d/1eSTt3A6kTfXYTcVMt5VGDardLIk8b7FnsgENpCuNRBA/edit?usp=sharing Resolved: Phil and Annette will rewrite the subtitles.

Implementation: https://github.com/w3c/dwbp/commit/047e2f78443fd917aee356eb2b405f6086efd732

3 provide metadata The intended outcome is "Human-readable metadata will enable humans to understand the metadata and machine-readable metadata will enable computer applications, notably user agents, to process the metadata." This is tautological. Metadata is necessary because, without it, the data will have no context or meaning. Annette Phil suggested "I'd write the intended outcome as simply: Humans and machines are able to understand the data." [Re: partial review] Resolved BP01: to keep Phil’s suggestion for the intended outcome.

Implementation: https://github.com/w3c/dwbp/commit/000593604a95bcb2d57ca15040e9fd9281edcc9b

4 provide metadata Also, I disagree that "If multiple formats are published separately, they should be served from the same URL using content negotiation." publishing multiple files is also reasonable, and it's even what we used in all our examples about metadata. (in BP2, the machine readable example gives the name of the distribution as bus-stops-2015-05-05.csv; in BP4, the entire URI is given, ending in .csv, etc.) Annette Phil's comment:

I think BP21 (#conneg) gets it right. You assign a URI to the dataset and use conneg to return whatever is the most appropriate version. However, you *also* provide direct URIs for each version, that by pass the conneg. (...) [Re: partial review]

Resolved: BP01: to rewrite "If multiple formats are published separately, they should be served from the same URL using content negotiation." New sentence:

If multiple formats are published separately, they should be served from the same URL using content negotiation and made available under separate URIs, distinguished by filename extension.

Implementation: https://github.com/w3c/dwbp/commit/b9f17696be1a39effdfecd35cf1470b1b3847e4e

5 provide metadata There is an inconsistency between the suggestion that one should use content negotiation for different formats (csv vs. rdf) and the .:mobility and :themes are referred to as URIs, but they are not URIs. (I know DCAT did this, but I think it's a mistake; colons are not legal in the first segment of a relative URI.) Annette Phil's comment "I would word the intended outcome as: Humans and machines can discover the dataset; humans can understand the nature of the data." and there are more comments about this on the message [Re: partial review]

Editor's comment: Phil slightly reword that section and take out the colons as they refer specifically to Turtle representation. (http://w3c.github.io/dwbp/bp.html#DescriptiveMetadata

Resolved: BP01: to rewrite "If multiple formats are published separately, they should be served from the same URL using content negotiation." New sentence:

If multiple formats are published separately, they should be served from the same URL using content negotiation and made available under separate URIs, distinguished by filename extension.

Implementation: https://github.com/w3c/dwbp/commit/b9f17696be1a39effdfecd35cf1470b1b3847e4e

6 locale parameters The human-readable example for the first three BPs is exactly the same. Can we make the examples more specific (maybe include them in the doc rather than link to one big external example)? The ttl in the machine-readable example could be trimmed to just the bold parts. Annette Berna's proposal: The doc is very long already. Instead of splitting the example, maybe we can link to specific parts of the page according to the BP.

Phil's comment: +1. All the data is in the HTML and TTL files, just highlight the relevant bits by including those and those only in the main doc. Incidentally, I expect to set up conneg between those two files, yes? [Re: partial review]

Resolved: include the relevant parts of the html example for each BP in the document itself. Newton will also see other ways to do this.

Implementation: https://github.com/w3c/dwbp/commit/52d136067f63a02b6a9dad24478a0fe237362abd

7 locale parameters I think the Why section is unnecessarily repetitive. A textual example

might also clarify things a little. I suggest:

Providing <a href="#locale_parameter">locale</a> parameters helps humans and computer applications to work accurately with things like dates, currencies and numbers that may look similar but have different meanings in different locales. For example, the 'date' 4/7 can be read as 7th of April or the 4th of July depending on where the data was created. Similarly €2,000 is either two thousand Euros or an over-precise representation of two Euros. Making the locale and language explicit allows users to determine how readily they can work with the data and may enable automated translation services.

My wording for the intended outcome:

To enable humans and software agents accurately to interpret the meaning of strings representing dates, times, currencies and numbers etc.

Phil (answering on Annette's message [Re: partial review]) Implementation: https://github.com/w3c/dwbp/commit/ca49005021c7ffc36d7e90657cdad82303ee31b1
8 Licenses We say "the license of a dataset can be specified within the data". I think we mean within the *metadata*. Annette Phil's comment: +1 Suggested rewording:

The presence of license information is essential for data consumers

to assess the usability of data. User agents may use the presence/absence of license information as a trigger for inclusion or

exclusion of data presented to a potential consumer.

Resolved: BP was update according to Anntte's and Phil's proposals

Implementation:https://github.com/w3c/dwbp/commit/25f6ce052098a61fa1a6e8c17b998e37b512adf9

9 Provenance The "Why" is pretty sparse and essentially says the same thing as the intended outcome. I think we could make it stronger. "Provenance is one means by which consumers of a dataset judge its quality. Understanding its origin and history helps one determine whether to trust the data and provides important interpretive context." Annette Phil's comment: +1.

My suggested wording for the intended outcome is: To enable humans to know the origin or history of the dataset and to enable software agents to automatically process provenance information.

Resolved: BP was updated according to Phil's and Annette's comments

Implementation:https://github.com/w3c/dwbp/commit/11dc5ed7aa9a23b287b5037d3cc731e37b2a9e12

10 Provenance The example links to the metadata example page. It would be more helpful to put the provenance-specific info into the BP doc itself. Annette Berna's proposal: to keep the example as an external page. if we present just parts of the human-readable example it will be out of context. Resolved: include the relevant parts of the html example for each BP in the document itself. Newton will also see other ways to do this.

Implementation: https://github.com/w3c/dwbp/commit/52d136067f63a02b6a9dad24478a0fe237362abd

11 Quality We say "Data quality information will enable humans to know the quality of the dataset and its distributions, and software agents to automatically process quality information about the dataset and its distributions." That's rather tautological. We could say something about enabling humans to determine whether the dataset is suitable for their purposes. Annette Phil's comment: Annette and I are in agreement here. I'd phrase the intended outcome as:

To enable people and software to assess the quality and therefore suitability of a dataset for their application.

Resolved: BP was updated according to Phil's comment

Implementation: https://github.com/w3c/dwbp/commit/8610970ef47d1d7a983763797a9703d6b6053087

12 Quality We probably should refer to DQV as a finished thing, as it will be soon. The human-readable example links to the metadata one. Annette Berna's proposal: to include DQV is a finished document and fix human-readable example. Phil's comment: +1. I suggest:

The machine readable version of the dataset quality metadata may be provided using the Data Quality Vocabulary developed by the DWBP working group VOCAB-DQV.

Resolved: include the relevant parts of the html example for each BP in the document itself. Newton will also see other ways to do this.

Implementation: https://github.com/w3c/dwbp/commit/52d136067f63a02b6a9dad24478a0fe237362abd


Resolved: BP was updated according to Phil's proposal Implementation: https://github.com/w3c/dwbp/commit/4991acf0eec2462d617330d17619058c00eab0f3

13 Versioning Of the four implementation bullets, only the last is really a possible approach. The first three belong in the intended outcome. Annette Editors'question:. Why the first three belong in the intended outcome? If they are intended outcomes, then the whole intended outcome section needs to be rewritten. In this case, would you like to make a proposal?

Phil's comment: Unusually, I disagree with Annette here. For me, intended outcomes are short "this is what will be possible." The implementation steps are how you make it so, which I think you have in this case.

Resolved: BP8 won’t change, just the subtitle. Subtitle should be more explicitly about what "has to be done".

Implementation: https://github.com/w3c/dwbp/commit/047e2f78443fd917aee356eb2b405f6086efd732

14 Versioning The human-readable example links to the metadata one. The version history there lists only 1.1, which is illogical. (1.0 must exist at least.) Annette Berna's proposal: to fix the link and the example page. Resolved: The example will be updated to be more detailed and part of the human-readable example will be included in the doc.
15 Version history The human-readable example links to the metadata one. The version history there lists only 1.1, which is illogical. (1.0 must exist at least.). This example doesn't meet the requirements of the BP. Neither the ttl version nor the Memento example provides a full version history, only a list of versions released. This BP is intended to be about providing the details of what changed. Annette In the machine-readable example of this BP there is a property rdfs:comment to show how the dataset was updated. If this is not enough, could you please tell us what else we should present.

Resolved: BP8 won’t change, just the subtitle. Subtitle should be more explicitly about what "has to be done". Implementation: https://github.com/w3c/dwbp/commit/047e2f78443fd917aee356eb2b405f6086efd732

Resolved: include the relevant parts of the html example for each BP in the document itself. Newton will also see other ways to do this. Implementation: https://github.com/w3c/dwbp/commit/52d136067f63a02b6a9dad24478a0fe237362abd

16 Identifiers Intro item 5 refers to an API which could be confusing, since we talk about APIs as web APIs elsewhere. Annette Phil's proposal: De-referencing a URI triggers a computer program to run on a server that may do something as simple as return a single, static file, or it may carry out complex processing. Precisely what processing is carried out, i.e. the software on the server, is completely independent of the URI itself. Resolved: To update the introduction of Identifiers Section according to Phil's proposal.

"De-referencing a URI triggers a computer program to run on a server that may do something as simple as return a single, static file, or it may carry out complex processing. Precisely what processing is carried out, i.e. the software on the server, is completely independent of the URI itself. "

Implementation: https://github.com/w3c/dwbp/commit/819d08a479682d5399b1bc89c69949a0631c6223

17 Persistent URIs as identifiers
  1. We say "This requires a different mindset to that used when creating a Web site designed for humans to navigate their way through." When creating a web site for humans to navigate, one should also consider persistence, so that sentence is not strictly accurate.
  2. The example uses the city domain instead of the transport agency's domain, which is not realistic for a large city. The agency domain is likely to persist as long as the information it makes available is relevant. Try Googling "transit agency" and see what comes up for domain names. The issue depends on how stable the transit service is. For a small town, the transit function might not be given over to a separate agency, and the guidance would be right, but for a big city, where the transit function is run by an independent agency, it's not realistic.
  3. We say "Ideally, the relevant Web site includes a description of the process..."I think we mean a controlled scheme.
Annette Phil's proposal item 1: delete that sentence so it's just: "To be persistent, URIs must be designed as such. A lot has been written on this topic, see, for example, the European Commission's Study on Persistent URIs [PURI] which in turn links to many other resources."

Proposal (item 2): Annette agreed to keep like it is now.

Proposal (item 3): How to test section was updated.

Resolved: BP was updated according to proposals.

Implementation: https://github.com/w3c/dwbp/pull/373/commits/1941fc3fe7e360169b584e98235e0d2293065fdb

18 Persistent URIs within datasets The word "affordances" is misused. Affordances are how we know what something is intended to do, not what the thing does. Affordances do not act on things, they inform. Annette Phil's proposal: "These ideas are at the heart of the 5 Stars of Linked Data where one data point links to another, and of Hypermedia where links may be to further data or to services that can act on or relate to the data in some way." Resolved: To change the sentence according to Phil's proposal. "These ideas are at the heart of the 5 Stars of Linked Data where one data point links to another, and of Hypermedia where links may be to further data or to services that can act on or relate to the data in some way."

Implementation: https://github.com/w3c/dwbp/commit/819d08a479682d5399b1bc89c69949a0631c6223

19 Persistent URIs within datasets The intended outcome should be a free-standing piece of text. Starting with "that one item" is confusing. Annette Phil's proposal: to rewrite the sentence as follows: "One data item can be related to others across the Web, creating a global information space accessible to humans and machines alike." Resolved: to update the intended outcome according to Phil's proposal: "One data item can be related to others across the Web, creating a global information space accessible to humans and machines alike."

Implementation: https://github.com/w3c/dwbp/commit/e03313c72ef913aceb40d84ca3f57437f1fe01cb

20 Persistent URIs within datasets Much of the implementation section is about minting new URIs, which is the subject of the previous BP. It is off topic here. Everything from "If you can't find an existing set of identifiers that meet your needs, you'll need to create your own" down to the end of the example doesn't belong in a BP that is about using other people's identifiers. Annette Ask Phil to review Resolved: Approach to implementation didn't change. How to test section was modified as follows:

"Check that within the dataset, references to things that don't change or that change slowly, such as countries, regions, organizations and people, are referred to by URIs or by short identifiers that can be appended to a URI stub. Ideally the URIs should resolve, however, they have value as globally scoped variables whether they resolve or not."

21 Persistent URIs within datasets The last paragraph of the example is almost exactly the same as the last paragraph before the example. Annette Phil's comment: "Correct. I have deleted it in my native speaker review copy." Resolved and Implemented.
22 URIs for versions and series # This BP is confusing two issues. One is the use of a shorter URI for the latest version of a dataset while also assigning a version-specific URI for it. The other issue is making a landing page for a collection of datasets. The initial intent was the former. I don't think this applies to time series. What we're talking about here is use of dates for version identifiers. The example is incomplete; it doesn't say what the latest version URI would be.
  1. The examples in the Why aren't series or groups except for the first item, yet they are introduced as examples of series or groups.
  2. How to Test says to check "that logical groups of datasets are also identifiable." That is vague. It should say "that a URI is also provided for the latest version or most recent real-time value."
Annette Phil's proposal (item 1): to change the example is described as follows:

Suppose that a new bus stop is created. To keep bus-stops-2015-05-05 up to date, a new version of the dataset (bus-stops-2015-12-17) is created. bus-stops-2015-12-17 includes all the data from bus-stops-2015-05-05 plus the data about the new bus stop. The two versions can be identified by the following URIs:

http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-05-05 is the versioned URI of the first version of the dataset

http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops-2015-12-17 is the version URI of the updated version of the dataset

http://data.mycity.example.com/public-transport/road/bus/dataset/bus-stops always resolves to the latest version so it pointed to resolved to bus-stops-2015-05-05 until 17 December 2015 when the server configuration was updated to point that URL to bus-stops-2015-12-17.

Phil's proposal (item 2): True, I offer this as a better alternative:

  • bus stops in my city (that change over time);
  • a list of elected officials in My City
  • evolving versions of a document through to completion.
I suggest this sentence "

In different circumstances, it will be appropriate to refer separately to each of these examples (and many like them).

" is replaced with

In different circumstances, it will be appropriate to refer to the current situation (the current set of bus stops, the current elected officials etc.). In others, it may be appropriate to refer to the situation as it exists/existed at a specific time.

Annette proposes to use just existed rather than exists/existed.

Phil's proposal (item 3): Rewrite How to test as follows: "Check that each version of a dataset has its own URI, and that there is also a 'latest version' URI."

Resolved Item 1: to update the BP according to Phil's proposal Resolved Item 2: to update the BP according to Phil's proposal Resolved Item 3: to update the BP according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/4a2fd830baaa3d9780f0614ff1e30548246304d1

23 Introduction First paragraph, some examples have no clear relationship to the web; "this phenomenon" has no clear antecedent.

Needs a careful native-speaker edit.

Annette Editors asked Annette to be more specific about the examples. Resolved: Phil updated introduction.

Implementation: https://github.com/w3c/dwbp/commit/c1b89c386a0540b0b833af332175839a17d699be

24 Audience Remove "such as CSV, JSON and RDF." They are too specific; don't use examples here. Annette Editors question: Why do you think that we shouldn't use examples ? I'm ok with removing the examples, but I'd like to understand the reason for this. Implementation: "such as CSV, JSON and RDF." was removed from the Audience.
25 Context The word "mainly" needs to be removed here: "The DWBP document is mainly interested on the Identification principle that says that URIs should be used to identify resources." "Mainly" means that it is more important than other considerations, which isn't true and probably isn't what was meant. Annette Implementation: Phil made slight changes so that the sentence now reads:

"An important aspect of publishing and sharing data on the Web concerns the architectural basis of the Web WEBARCH. An important aspect of this is the identification principle that says that URIs should be used to identify resources.

26 Context I disagree with the statement that "multiple Dataset Access mechanisms should be available." Annette Editors question: Could you please explain why do you disagree with this statement? Maybe this is also a rewriting issue. Resolved: rewrite the sentence. New sentence: "multiple Dataset Access mechanisms can be available"

Implementation: https://github.com/w3c/dwbp/commit/19205e4f3fa92c695f660afde09b9c82a6d42dfb

27 Context The diagram is still confusing for me. I can't tell what it is trying to say. What is the relationship between the blue dataset and the green, yellow, and orange rectangles supposed to be showing? Why does a blue box refer to a dataset and then to distributions? Why is the grouping within blue boxes different after the arrow? What does the dotted line represent? What does the arrow represent?

This section wanders between discussion of basic definitions and an incomplete enumeration of the best practices themselves. It needs to be rewritten so that it has a clear purpose and adheres to it.

Annette Implementation: https://github.com/w3c/dwbp/commit/9e6ad2afea3379cc4bde17ba833cf4c00c680b61
28 Basic Example It should be about more modes of transit than just buses. We have some examples that use multiple modes. Annette Instead of changing the example description, the examples that mention multiple modes could be rewritten. If we mention multiple modes in the example description we might create big expectations on the public (just few BP examples really consider this aspect). Resolved: to be more general in the example.

Implementation: https://github.com/w3c/dwbp/commit/2426e4285c3154e8fcc902ee99063a1a1185ea24

29 machine-readable standardized data formats There is no definition of 'machine readable', or of proprietary software. "computational tools typically available in the relevant domain" will surely include .docx and .xlsx, for example.

I looked at the Wikipedia page which links to a doc from the US government https://en.wikipedia.org/wiki/Machine-readable_data. from that I suggest the following:

Paragraph 1: There is an important distinction between formats that can be read and edited by humans using a computer and formats that are machine readable. The latter term implies that the data is readily extracted, transformed and processed by a computer. The following definition of machine readable is based on that provided by the US Office of Management and Budget's definition in their Preparation and Submission of Strategic Plans, Annual Performance Plans, and Annual Program Performance Reports OMB-A11

Paragraph 2:Machine readable: A format in a standard computer language (not natural language text) that can be read automatically by a computer system. Traditional word processing documents and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret. Formats such as XML, JSON, NetCDF, RDF or spreadsheets with header columns that can be exported as CSV are machine readable formats.

Phil to include the first paragraph in the Why section of the BP and the second one in the glossary.

Annette's proposal: "machine-readable" is used differently here than in the metadata section. Technically, nothing on the web is not machine-readable. I think we could remove that phrase.

Resolved: use the following definition in the glossary

"Machine-readable data: Data in a standard format that can be read and processed automatically by a computing system. Traditional word processing documents and portable document format (PDF) files are easily read by humans but typically are difficult for machines to interpret and manipulate. Formats such as XML, JSON, HDF5, RDF and CSV are machine-readable data formats." adapted from [include the link proposed by Phil] Implementation: https://github.com/w3c/dwbp/commit/683492861826b8688de0401bc08e69124d42fbcc Implementation: https://github.com/w3c/dwbp/commit/cc779f06892a7d58ab91d682429868e994001911

30 machine-readable standardized data formats In the 'Why' para, consider adding 'open', 'well documented', 'RAND', etc to 'non-proprietary'. Chris Litte We removed the sentence "The use of non-proprietary data formats should also be considered since it increases the possibilities for use and reuse of data".The focus of the BP is about machine-readable standardized data formats rather than recommending data formats with specific characteristics.

Implementation: https://github.com/w3c/dwbp/commit/fc884022a635c9d4051466258e062112df256001

31 Multiple formats Suggest that the intended outcome could be worded along the lines of:

"As many users as possible will be able to use the data without first having to transform it into their preferred format."

I have many similar comments on intended outcomes. I think they should be statements of the specific benefit that is gained, so "to enable X" rather than "Doing X will enable Y."

Phil to review BP considering Phil's proposal Resolved: BP was updated according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/9712ddfaa5de2149d4d2cdc17836c3de46fbac1c

32 Multiple formats I very much dislike the word 'intended' in the sentence: "Consider the data formats most likely to be needed by intended users, and consider alternatives that are likely to be useful in the future." The idea of making data on the WEb is that it's up to the user to decide that he/she intends to do with it, not the publisher.

Suggest simply making it "Consider the data formats most likely to be needed and consider alternatives that are likely to be useful in the future.

Phil Update approach to implementation to include: "Consider the data formats most likely to be needed and consider alternatives that are likely to be useful in the future." Resolved: BP was updated according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/9712ddfaa5de2149d4d2cdc17836c3de46fbac1c

33 Standardized terms Suggest rewording the intended outcome Phil New intended outcome: Enhanced interoperability and consensus among data publishers and consumers. (Ask Antoines feedback)

Resolved: Antoine will merge BP Use Standardized Terms and BP Reuse Vocabularies. Implementation: https://github.com/w3c/dwbp/commit/80451932600dfc4e26f96539753f3d3e9a919224

34 Reuse vocabularies Again, the intended outcome could be worded more succinctly I think. Phil follow Phil's proposal: To make datasets and metadata easier to compare and integrate by humans or machines. (I added 'and integrate', which I personally think is important but this is more than an editorial change). Resolved: Antoine will merge BP Use Standardized Terms and BP Reuse Vocabularies.

Implementation: https://github.com/w3c/dwbp/commit/80451932600dfc4e26f96539753f3d3e9a919224

35 Reuse vocabularies please also clarify 'vocabularies' versus 'code lists' as code lists are used in BP16. The list in the second para. Does not explain the generally agreed distinctions. Chris Little Resolved: Antoine will merge BP Use Standardized Terms and BP Reuse Vocabularies.

Implementation: https://github.com/w3c/dwbp/commit/80451932600dfc4e26f96539753f3d3e9a919224

36 Right formalization level I would word the intended outcome as:

The data supports a wide range of application cases but is not more complex to produce and reuse than necessary, or, to paraphrase Albert Einstein, "Everything should be made as simple as possible, but no simpler."

The Einstein line is often quoted but, like so many quotations, is probably a misquote.

And I'd say that the how to test line would be improved by using the word 'typical' rather than target:

For formal knowledge representation languages, applying an inference engine on top of the data that uses a given vocabulary does not produce too many statements that are unnecessary for typical applications.

Phil Antoine's suggestion: Higher level of formalization make vocabularies and the data that uses them more difficult to produce and re-use. The data should support all application cases but should not be more complex to produce and reuse than necessary

Resolved: Phil will rewrite the BP (How to test and Examples) Intended outcome: Higher level of formalization make vocabularies and the data that uses them more difficult to produce and re-use. The data should support all application cases but should not be more complex to produce and reuse than necessary

Implementation: https://github.com/w3c/dwbp/commit/a1887532129f16b983835afe7f84fcd9bf062c5c

37 Sensitive data Suggest rewording the intended outcome Phil follow Phil's proposal: "To enable data consumers to know that data that is referred to from the current dataset is unavailable or only available under different conditions." Resolved: BP was updated according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/8373f4da3d6032bd65973ec22f00156b52c79e03

38 Sensitive data Regarding Best Practice 18: Provide data unavailability reference"data unavailability reference" is awkard and unclear. Annette Annette's proposal: Could we say "Provide an explanation for data that is not available. Resolved: BP was updated according to Annette's proposal

Implementation: https://github.com/w3c/dwbp/commit/8373f4da3d6032bd65973ec22f00156b52c79e03

39 Sensitive data Best Practice 18: Provide data unavailability reference. address testing machine-readability. saying that a legitimate http response code in the 400 or 500 range should be returned. Annette Annette's proposal to How to test:

Where the dataset includes references to data that is no longer available or is not available to all users, check that an explanation of what is missing and instructions for obtaining access (if possible) are given. Check if a legitimate http response code in the 400 or 500 range is returned when trying to get unavailable data.

Resolved: to follow Annette's proposal.

Implementation: https://github.com/w3c/dwbp/commit/6fa8b6bce48b87a44b42b986584d6676e6dcb00e

40 Bulk Access I don't think this should only refer to cases where data is spread across multiple locations. I think it shoujld also cover the simple case of making a file available, as opposed to only providing an API. This is in addition to, not instead of what is written about multiple locations - which I think is very good.

I'd phrase the intended outcome as: "Bulk download enables developers to access the complete dataset for local processing without the need for further calls to the Web."

Phil I propose to complement the Why section to include "the simple case of making a file available".

Intended outcome: To enable consumers to access the complete dataset for local processing with a single request.

Resolved: Update BP Bulk Access as proposed below:

Intended outcome: To enable consumers to access the complete dataset for local processing with a single request.

Implementation: https://github.com/w3c/dwbp/commit/b373bef216bc0da7be4418d64ca17c06d4a6186b

41 Subsets The intended outcome section is too long IMO. All the content is valid, I just think some of it could be moved to the Why section.

Really not sure about include an example of making a set of PDFs available.

Phil Ask Annette's feedback Resolved: BP updated based on Annette's proposal.

Implementation: https://github.com/w3c/dwbp/commit/de513905142b813d3a7c5ae5a8d5305942136b7c

42 Conneg In tidying up the language of this BP I pretty much rewrote it. I hope without changing your meaning significantly.

I suggest the intended outcome could be phrased as: "To enable different representations of the same resource to be served fromt he same URI according to the request made by the client."

Phil Resolved and Implemented.
43 Access Real Time Rewrite intended outcome Phil follow Phil's proposal: "To enable applications to access time-critical data in real time or near real time, where real-time means a range from milliseconds to a few seconds after the data creation, and near real time is a predetermined delay for expected data delivery."

Resolved: BP22: to include a definition for near real time in the glossary (from wikipedia) and create a link in the outcome. Change the subtitle to use released instead of produced. Implementation: https://github.com/w3c/dwbp/commit/8168fc7a34be0fa665a97cb1d9f1f325f837e0c8 Implementation: https://github.com/w3c/dwbp/commit/a255b291331a98057033bd4abf0d2ca6212deebf

44 Access Up to Date I think this sentence: "The international date format is recommended to avoid any ambiguity <a href="https://www.w3.org/International/questions/qa-date-format">https://www.w3.org/International/questions/qa-date-format</a>."

Would be better as:

"Datestamps should be formatted using the XML Schema <a href="/TR/xmlschema11-2/#dateTimeStamp">dateTimeStamp</a> datatype xmlschema11-2."

Although I note that the NOAA example uses the horrible "Mar, 3rd 2016 at 9:03:07 pm PST" format which breaks this advice :-(

Phil Berna's proposal: to rewrite this BP Resolved: BP23: Annette and Bernadette will rewrite BP23 according to Phil's and Annette's suggestion.

Implementation: https://github.com/w3c/dwbp/commit/43ba83a77ecbcdd03d2e9066bedd5f7df482c2e4

45 document your API Rewrite intended oucome Phil follow Phil's proposal: "Developers can obtain detailed information about each call to the API, including the parameters it takes and what it is expected to return." Implementation: http://w3c.github.io/dwbp/bp.html#documentYourAPI
46 document your API This is very spatial, ideally we should have some non-spatial examples as well. I can tell this came from Linda and Jeremy et al :-) Phil Implementation: https://github.com/w3c/dwbp/commit/5879a077989310a409145bf476af57ff0a121342
47 Assess dataset coverage Rewrite intended outcome Phil follow Phil's proposal: "To enable data consumers to appreciate the coverage and external dependencies of a given dataset." Resolved: Phil will rewrite Data Preservation BPs

Implementation: https://github.com/w3c/dwbp/commit/a39037008231c32a6addb15c60dcac7638cd1560

48 Use a trusted serialization format Rewrite intended outcome Phil Phil's proposal: "To enable machines to process a dataset even if the original software that was used to create it is no longer available or supported." Resolved: Phil will rewrite Data Preservation BPs

Implementation: https://github.com/w3c/dwbp/commit/a39037008231c32a6addb15c60dcac7638cd1560

49 Provide structural metadata I think the why section could be stronger:

<p>Providing information about the internal structure of a distribution is essential for others wishing to explore or query the dataset. It also

helps people to understand the meaning of the data.

My intended outcome wording:

To enable humans to interpret the schema of a dataset and software agents to automatically process distributions.

NB, I removed the 2nd instance of the word schema in that sentence which I think was a mistake? [Re: partial review]

Phil Resolved: BP was according to Phil's proposal

Implementation:https://github.com/w3c/dwbp/commit/19aaecd5567f2c7280fc27ff560a6e29c4383d7d

50 Provide structural metadata Possible Approach how about adding a link to the RDF Data Cube. Chris Little Phil's proposal: I leave that to the editors although I'd be inclined not to. Yes, QB includes a lot of structural metadata but in the context of the BP, I'd say the examples given are sufficient. We included a link to the RDF Data Cube in the Possible Approach to Implementation section of BP4.

Implementation: https://github.com/w3c/dwbp/commit/fc884022a635c9d4051466258e062112df256001

51 Provenance I think the first paragraph of the intro section can be removed and the

glossary link added to the 2nd, like:

The Web brings together business, engineering, and scientific communities creating collaborative opportunities that were previously unimaginable. The challenge in publishing data on the Web is providing an appropriate level of detail about its origin. The <a href="#data_producer">data producer</a> may not necessarily be the data provider and so collecting and conveying this corresponding metadata is particularly important. Without <a href="#data_provenance">provenance</a>, consumers have no inherent way to trust the integrity and credibility of the data being shared. Data publishers in turn need to be aware of the needs of prospective consumer communities to know how much provenance detail is appropriate.

Phil Resolved: BP was updated according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/0271027460fadf467e7175476387d969e132a103

52 Quality Slight rewording of the intro paragraph:

The quality of a dataset can have a big impact on the quality of applications that use it. As a consequence, the inclusion of <a href="#data_quality">data quality</a> information in data publishing and consumption pipelines is of primary importance. Usually, the assessment of quality involves different kinds of quality dimensions, each representing groups of characteristics that are relevant to publishers and consumers. The Data Quality Vocabulary defines concepts such as measures and metrics to assess the quality for each quality dimension VOCAB-DQV. There are heuristics designed to fit specific assessment situations that rely on quality indicators, namely, pieces of data content, pieces of data meta-information, and human ratings that give indications about the suitability of data for some intended use.

Phil Resolved: BP was updated according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/88d24996c7db6727552178652b660f846db75530

53 Versioning Looking at the intro material I think I could probably find people to

argue that all three of those scenarios are simply corrections rather than new versions. But then, as you say, there is no consensus :-)

I would phrase the intended outcome as:

To enable humans and software agents to easily determine which version of a dataset they are working with.

Phil Resolved: BP was updated according to Phil's proposal

Implementation: https://github.com/w3c/dwbp/commit/12c52bf24c8bfad10045880e2df2cf01d74d8650

54 Assess dataset coverage BP28: 'Assess dataset Web context' is better than 'Assess dataset coverage'. Coverage could be confused with the specialised geospatial meaning. Change over paras. 'Scope' may be a useful word.

Could you use an example without mentions 'triples' - a term requiring specialised knowledge or too implementation specific.

Chris Little Resolved: Phil will rewrite Data Preservation BPs

Implementation: https://github.com/w3c/dwbp/commit/a39037008231c32a6addb15c60dcac7638cd1560

55 Bulk Access BP 19: Provide bulk download The intended outcome is focused on the wrong thing. It says "Bulk download will enable large file transfers (which would require more time than a typical user would consider reasonable) by dedicated file-transfer protocols." That's true, but it's not the point of the BP. The idea of allowing bulk download applies to datasets that are smaller as well as larger ones, and it need not involve alternative protocols. The outcome we are hoping for is that people will be able to easily download the data with a single request.

In the implementation section, the first bullet should clarify that it is about downloading. Making a request to one URI isn't unique to that bullet. (A bulk request to an API goes to one URI as well.) It should read "For datasets that exist initially as multiple files, preprocessing a copy of the data into a compressed archive format and making the data accessible for download from one URI."

The test should be about whether the full dataset can be retrieved with a single request, not whether the data is preprocessed. That test works for APIs as well as file downloads by humans.

Annette Proposal:

Intended outcome: To enable consumers to access the complete dataset for local processing with a single request..

Approach to implementation (1st bullet): For datasets that exist initially as multiple files, preprocessing a copy of the data into a single file and making the data accessible for download from one URI. For larger datasets, the file can also be compressed.

How to test: Check if the full dataset can be retrieved with a single request.

Resolved: Update BP Bulk Access as proposed below:

Intended outcome: To enable consumers to access the complete dataset for local processing with a single request.

Approach to implementation (1st bullet): For datasets that exist initially as multiple files, preprocessing a copy of the data into a single file and making the data accessible for download from one URI. For larger datasets, the file can also be compressed.

How to test: Check if the full dataset can be retrieved with a single request.

Implementation: https://github.com/w3c/dwbp/commit/b373bef216bc0da7be4418d64ca17c06d4a6186b

56 Subsets for Large Datasets BP 20: Provide Subsets for Large Datasets

Change "Static datasets that users in the domain would consider to be large will be downloadable in smaller pieces" to "Static datasets that take some time to download will be downloadable in smaller pieces" It's true that being large is dependent on what users in the domain consider to be large, but the issue here is time, not largeness.

Annette Annette's proposal: Both human users and applications should be able to access subsets of a dataset, rather than the entire thing, as needed. Available subsets should maximize the ratio of needed data to unneeded data in responses to consumer requests. Static file downloads should be kept to reasonable download times, and APIs should return results of appropriate granularity to suit the domain and Web application performance. Resolved: to follow Annette's proposal.

Implementation: https://github.com/w3c/dwbp/commit/de513905142b813d3a7c5ae5a8d5305942136b7c

57 Content negotiation BP 21: Content negotiation should be in the implementation section for BP 14, multiple formats, rather than its own BP. I think we've already agreed to change this, but I'll just reiterate that I'm not yet convinced that always using conneg is a best practice for serving multiple formats from an API. I like the use of file extensions, because they allow one to reference a resource as a URI instead of a URI plus required headers (plus a note explaining how to set headers). I also think it's good to allow tests of an API using a browser when possible. Since browsers don't let you set request headers, relying on conneg alone prevents that. Using both addresses most objections, but many people prefer conneg because it allows them to get file extensions out of URIs. Implementing both doesn't accomplish that. For file downloads, I think conneg is a worst practice, because browsers don't allow users to set headers. Anyway, we could argue a long time on this. There is still a lot of disagreement about this stuff. Annette Resolved: keep BP21(Content negotiation)and make a link from BP 14 to BP 21.

Implementation: https://github.com/w3c/dwbp/commit/0c0852efdff04fd54b07d5eb2b2bfc3936018d21

58 access real-time BP 22: Subtitle: I still don't know what it means for data to be "produced in real time". The other day I posted some log data from a supercomputing system. That data is produced constantly, and it appears in the logs immediately when an event happens. That feels to me like real time, but I don't think it is appropriate to publish on the web in real time, because the purpose of posting is detailed analysis, not monitoring. On the other hand, preparing the log data for publishing is slow, so maybe that's the real measure. Maybe it should be "When data is released in real time . . ."

The intended outcome defines near real time as with a predetermined delay. The U.S. Census has a predetermined delay of 10 years, and that is not near real time. See https://en.wikipedia.org/wiki/Real-time_computing#Near_real-time for some help.

I don't understand the Push approach to implementation. I think the last word was intended to be publisher. "Disseminating" is vague and not particularly push-y, and making storage available is certainly not push-y. The last sentence of the implementation section is garbled. I think real-time data implementation is better broken into streaming or not streaming. It would be helpful to give some info about those alternatives.

The example doesn't use the transport agency, and it doesn't show how to implement real-time data. It would be more appropriate as an example of an API.

Mention of PROV-O in the test is unnecessary and off point. A more appropriate test might be to measure the refresh frequency and see that it matches the update frequency of the source data, and to measure the latency and see if it is in the real-time or near-real-time range.

Annette Resolved: BP22: to include a definition for near real time in the glossary (from wikipedia) and create a link in the outcome. Change the subtitle to use released instead of produced.

Implementation: https://github.com/w3c/dwbp/commit/8168fc7a34be0fa665a97cb1d9f1f325f837e0c8 Implementation: https://github.com/w3c/dwbp/commit/a255b291331a98057033bd4abf0d2ca6212deebf

Resolved BP22: Change How to test section to use Annette’s proposal: A more appropriate test might be to measure the refresh frequency and see that it matches the update frequency of the source data, and to measure the latency and see if it is in the real-time or near-real-time range. Implementation: https://github.com/w3c/dwbp/pull/396/commits/a9779b4fa568d11354fce82a92b6202a5a34643b

59 Access Up to Date BP 22: up to date

The Why text is unclear as to what type of coincidence is desired and what should coincide with what. Similar to the real-time BP, I think the issue here is that the publication frequency should match the release frequency. The first sentence of the test reads like a note to ourselves to write a test. One step is "publish an updated version of data." That is not something one can do whenever a test is needed. More importantly, that test only determines whether there is a difference between two versions of the data. What it should be testing is the timeliness of the most recent data.

Annette Annette's suggestion: the first sentence is not about why. It belongs in the intended outcome. The intended outcome should say "Data on the web should be updated in a timely manner so that the most recent data available online reflects the most recent data released. When new data is released via any channel, it should be made available on the Web as soon as possible thereafter."

We could use a transit example about real-time bus arrival predictions. The test could be to check that the update frequency is stated and that the most recently published copy on the Web is no older than the stated update frequency.

Resolved: BP23: Annette and Bernadette will rewrite BP23 according to Phil's and Annette's suggestion.

Implementation: https://github.com/w3c/dwbp/commit/43ba83a77ecbcdd03d2e9066bedd5f7df482c2e4

60 Make Data Available through an API Regarding the BP 24: Make Data Available through an API The test should say that a test client can simulate calls and the API returns the expected responses. (The test client doesn't simulate the responses.) Annette Proposal:

How to test: Check if a test client can simulate calls and the API returns the expected responses.

Resolved: Update the BP as described below:

How to test: Check if a test client can simulate calls and the API returns the expected responses.

Implementation: https://github.com/w3c/dwbp/commit/9e75bea228c7dd8261e1eb33991e07957905ca14

61 document your API Regarding the BP 26: Provide complete documentation for your API. The examples are all spatial data examples. None of them really makes sense in this context. We should probably offer examples for the transport agency.

Can we use a test like "time to first successful call"? That would require having volunteers to learn to use the API and timing them.

Annette Implementation: https://github.com/w3c/dwbp/commit/5879a077989310a409145bf476af57ff0a121342
62 Use a trusted serialization format Regarding BP 28: Use a trusted serialization format for preserved data dumps.

If we keep this, it should at least offer JSON as an acceptable example. JSON is the current overwhelming standard for APIs. This talks about "sending data dumps for long-term preservation" and "data depositors". Where are the data being sent? Is it on the Web? The bad example would pass the How to Test.

Annette Resolved: Phil will rewrite Data Preservation BPs

Implementation: https://github.com/w3c/dwbp/commit/a39037008231c32a6addb15c60dcac7638cd1560


63 Update the status of identifiers Regarding BP 29: Update the status of identifiers

It's not quite clear what we are suggesting get linked to what. The Why talks about linking preserved datasets with the original URI. Are we saying the original URI should continue to point to the preserved dataset? If that's the case, then what does preservation mean? There is also discussion of saving snapshots as versions, which seems to me is covered better under versioning.

We say "A link is maintained between the URI of a resource, the most up-to-date description available for it, and preserved descriptions." One link can only join two resources. Should people preserve old descriptions? Maybe descriptions of older versions are what was meant?

A 410 status only makes sense if there's nothing served at the URI, which isn't the case if the advice here is followed. 303 seems like a good option.

Annette Resolved: Phil will rewrite Data Preservation BPs

Implementation: https://github.com/w3c/dwbp/commit/a39037008231c32a6addb15c60dcac7638cd1560

64 Feedback In the Introducion I disagree with this sentence: "In order to quantify and analyze usage feedback, it should be recorded in a machine-readable format." I think using automated tools to gather feedback and store it in a searchable way is a good idea, but saying the feedback should be machine readable is misleading and insufficiently specific. If you have succeeded in posting feedback on the web, it is machine readable by definition. It sounds like we are telling people to publish their feedback as another dataset. You may want to store it in a machine-readable way for the purpose of displaying it to other humans, but there's no reason to *publish* the feedback with machines in mind. Annette Resolved: Remove the sentence from the introduction: "In order to quantify and analyze usage feedback, it should be recorded in a machine-readable format."

Implementation: https://github.com/w3c/dwbp/commit/1ad8d3b312eea660fe420cb7aee9b4e2595e575a

65 Feedback Regarding the BP 31: Gather feedback from data consumers.

This BP includes recommendations about making feedback public, but that's handled in the next BP. We should keep this BP focused on enabling feedback.

The first sentence of the Why needs rewriting. We should remove the word "providing" at the beginning. The BP is about collecting feedback, not providing it. It should address the value of setting up a specific way of collecting feedback (makes it easier for consumers to contribute).

Remove the mention of machine-readable formats and using a vocabulary for capturing the semantics of the feedback information. Instead, suggest using an automated feedback system, such as a bug tracker.

How to test, the first bullet is a note to us, I guess. The second is partially about the next BP. The third is again treating the feedback data as another published dataset. There's nothing wrong with publishing such a dataset, but that's not the idea here. A real test would be whether a consumer is able to find a way to provide feedback.

Annette Resolved: to update the BP as follows:

Why: Obtaining feedback helps publishers understand the needs of their data consumers and can help them improve the quality of their published data. It also enhances trust by showing consumers that the publisher cares about addressing their needs. Specifying a clear feedback mechanism removes the barrier of having to search for a way to provide feedback.

Approach to implementation: Provide data consumers with one or more feedback mechanisms including, but not limited to, a contact form, point and click data quality rating buttons, or a comment box. In order to make the most of feedback received from consumers, it's a good idea to collect the feedback with a tracking system that captures each item in a database, enabling quantification and analysis. It is also a good idea to capture the type of each item of feedback, i.e., its motivation (editing, classifying [rating], commenting or questioning), so that each item can be expressed using the Dataset Usage Vocabulary [VOCAB-DUV].

How to test: Check that at least one feedback mechanism is provided and readily discoverable by data consumers.

Implementation: https://github.com/w3c/dwbp/commit/c9d24d64701564ad40d74907d86e4eb22414a178

66 Feedback Regarding the BP 32: Make feedback available.

The Why should mention avoiding duplication and being transparent about the quality of the data.

The intended outcome is tautological. It should include the idea that consumers should be able to review issues already raised by others, saving them the trouble of filing duplicate bug reports. Publishing feedback also helps consumers understand any issues that may affect their ability to use the data.

The implementation section need to be changed. We should not be telling people that they need to present their feedback in machine readable form.

The test is again about metadata for the feedback as a dataset. Publishing your feedback as a dataset is not a best practice.

Annette Resolved: Bernadette will update the BP according to Annette's proposal.

Implementation:https://github.com/w3c/dwbp/commit/e25d3263e24ba5e5ebd0202abf6a07c6884f1db4

67 Data Enrichment Regarding the BP 33: Enrich data by generating new data

The Why needs a few caveats. "Under some circumstances, missing values can be filled in, and ..." "Publishing more complete datasets can enhance trust, if done properly and ethically."

In the intended outcome, "should be enhanced if possible" is too strong. The first paragraph could be "Data that is unstructured should be given structure if possible. Additional derived measures or attributes should be added if they enhance utility. A dataset that has missing values can be enhanced to fill in those values if the addition does not distort analytical results, significance, or statistical power."

Annette Resolved: Update the BP as follows:

Why: Enrichment can greatly enhance processability, particularly for unstructured data. Under some circumstances, missing values can be filled in, and new attributes and measures can be added. Publishing more complete datasets can enhance trust, if done properly and ethically. Deriving additional values that are of general utility saves users time and encourages more kinds of reuse. There are many intelligent techniques that can be used to enrich data, making the dataset an even more valuable asset.

Intended Outcome: Data that is unstructured should be given structure if possible. In structured data, missing values should be added if they enhance utility, but only if the addition does not distort analytical results, significance, or statistical power. Values generated by inference-based techniques should be labeled as such, and it should be possible to retrieve any original values replaced by enrichment. Whenever licensing permits, the code used to enrich the data should be made available along with the dataset.

Implementation: https://github.com/w3c/dwbp/commit/66cf069a3b4d905aa79d9c50abd610898b4283bf

68 Glossary The definition of locale needs to mention geographic location.

The definition of machine readable data surprises me. I think proprietary formats are machine readable, too. If we want to steer people away from proprietary formats, we should do that explicitly.

Annette Resolved: Update the definition of locale as follows:

A set of parameters that clarifies aspects of the data that may be interpreted differently in different geographic locations, such as language and formatting used for numeric values or dates.

Implementation: https://github.com/w3c/dwbp/commit/defff64772d0ee99c03589bb397d7747eb83485c

69 licenses We say "Data license information can be provided as a link to a human-readable license or as a link/embedded machine-readable license." Since licensing info is part of metadata, and we tell people to provide metadata for both humans and machines, we should also require licensing info for both humans and machines. Annette Updated according to Annette's proposal

Implementation: https://github.com/w3c/dwbp/pull/388/commits/2508c71de31a9010c8157a5d1b3079d9102c5bd1

70 machine-readable standardized data formats In the possible approach to implementation for BP13, could we change NetCDF to HDF5? HDF5 is more general. NetCDF is based on HDF5, so using the latter covers both. Annette Resolved: to do the update proposed by Annette.

Implementation: https://github.com/w3c/dwbp/commit/46ac79a2b88a2f0021274214b8e378488cdbfdc5

71 Context "Data is published in different distributions, which is a specific physical form of a dataset." should/ can be replaced "Data is published in different distributions, which are specific physical form of a dataset." Riccardo Albertoni Resolved: the phrase in Context section ""Data is published in different distributions, which is a specific physical form of a dataset." will be replaced by "Data is published in different distributions, which are specific physical form of a dataset.".

Implementation: https://github.com/w3c/dwbp/commit/579a8a5e9bdb8e8a1b7479435e91737186120b1a

72 Provide Metadata how to test?

About the sentence "Check if all provided metadata are coherent with the described resource." I am not sure to understand what kind of coherence we are referring to. Perhaps we should specify it. Otherwise, I would opt for suggesting "Check if human readable metadata is available"

Riccardo Albertoni Resolved: The phrase in the BP Provide Metadata in the how to test "Check if all provided metadata are coherent with the described resource." will be replaced by "Check if human readable metadata is available"

Implemented: http://w3c.github.io/dwbp/bp.html#ProvideMetadata

73 Provide descriptive Metadata how to test?

About the sentence "Check if the descriptive metadata is available in a valid machine-readable" I would add "description" or format at the end of it.

Riccardo Albertoni Resolved: in the how to test of the BP Provide descriptive Metadata it will be added the word "format at the end of "Check if the descriptive metadata is available in a valid machine-readable format"

Implementation: https://github.com/w3c/dwbp/commit/f2f2eabba300ed5f07bd21c173b8b7d21b20c25c

74 locale parameters how to test?

there is an extra ")" at the end of the first sentence.

Riccardo Albertoni Resolved: in the how to test of Local Parameters it will be taken out the extra ")" at the end of the first sentence.

Implementation: https://github.com/w3c/dwbp/commit/62b32187b82b7b6f0fb0881707db927a7d6f4130

75 locale parameters Use machine-readable standardized data formats: Possible Approach to implementation

"Make data available in a machine readable standardized data format that is easily parseable including but not limited to CSV, XML, Turtle, NetCDF, JSON and RDF." RDF is more a data model than a data format, actually it can be serialized in different serialization syntaxes such as turtle, JSON-LD and RDF/XML I would replace the sentence above with "Make data available in a machine readable standardized data format that is easily parseable including but not limited to CSV, XML, Turtle, NetCDF, JSON and RDF/XML ." or "Make data available in a machine readable standardized data format that is easily parseable including but not limited to CSV, XML, NetCDF, JSON and RDF serialization syntaxes like RDF/XML, JSON-LD, turtle."

Riccardo Albertoni Resolved: In the Possible Approach to implementation of the BP Use machine-readable standardized data formats the phrase "Make data available in a machine readable standardized data format that is easily parseable including but not limited to CSV, XML, Turtle, NetCDF, JSON and RDF." will be replaced by "Make data available in a machine readable standardized data format that is easily parseable including but not limited to CSV, XML, NetCDF, JSON and RDF serialization syntaxes like RDF/XML, JSON-LD, turtle."

Implementation: https://github.com/w3c/dwbp/commit/c888f0c9db2fd27b56c8037df9ea9781b2bdc29e

76 subsets How to test should say something about all the subets adding up to the complete set. Didn't we have a test before that the entire dataset can be recovered by making a series of smaller requests? I think we had a note that coming up with use cases isn't deterministic enough. Annette Implementation:

https://github.com/w3c/dwbp/commit/5879a077989310a409145bf476af57ff0a121342 https://github.com/w3c/dwbp/commit/5fbf772df2f8f8e863afd6cb4ec98bc68a935316

77 identifiers

The example is rather redundant. It is data.mycity..., and yet /dataset also appears in the path. The path also contains /bus as well as /bus-stops. It's unlikely that the agency has so many transit modes that they need to be split between road and rail and water. The same info is conveyed as well by the much shorter http://data.mycitytransit.example.org/bus/stops I think we could go with something like this: http://data.mycity.example.org/transport/ is a base for all the example URIs Probably a real link would need to identify the dataset somehow rather than just say that it's a dataset. What do you think about this? http://data.mycity.example.org/transport/timetables/bus/stops/

Implementation:

https://github.com/w3c/dwbp/commit/152af52fda80a0ac43faf80acf806c9d9c4670ef https://github.com/w3c/dwbp/commit/50b9a9c47d05adf0e26bc34cc4e6429d20526b2f

78 data vocabularies The first paragraph seems to be suggesting that controlled vocabularies enable easy translation, but it's confusingly phrased. The last three sentences could be changed to read "Standardized vocabularies can also serve to improve the usability of datasets. Say a dataset contains a reference to a concept described in a vocabulary that has been translated into several languages. Such a reference allows applications to localize their display of the data depending on the language of the user."

The last paragraph refers to "the former kind of vocabulary". It's not clear what kind that is. It's not clear what the point of that paragraph is.

Annette Implementation:

First paragraph was removed. https://github.com/w3c/dwbp/commit/558e8d4c4555a65268fc6ac6b8d0c9c13a74bc93

79 multiple formats The example says John decided to use XML, but it shows ttl, and it shows

metadata, not data.

The trend lately is toward doing a single format (json). Do we want to go against that trend? I note that the W3C's own API is json only.

Annette Implementation:

https://github.com/w3c/dwbp/commit/b85fa424e9c0ca1f7c795a7f367652279e36ecfc

80 standardized formats "machine-readable" is used differently here than in the metadata section. Technically, nothing on the web is not machine-readable. I think we could remove that phrase.

"adequate for its intended or potential use" doesn't really help in choosing. That's like saying "data on the web must be good."

The intended outcome should be more normative. "Data should be available in a standardized format that is easily parseable" belongs in the intended outcome. We could add that data should not be posted as an image unless the data itself encodes the image. (A jpeg file of a table is an image that encodes the data; RGB channel data from an imaging microscope is data that encodes an image.)

The example is metadata, not data. This BP is about formats for the data.

Annette We made some updates on the examples and the intended outcome and the why sections were rewritten.

Implementation: https://github.com/w3c/dwbp/pull/386/commits/f033ee73eb869ba70fddc07001d2e4703aff5b0c https://github.com/w3c/dwbp/pull/386/commits/b52e383f7fd61076c15244522105623e2b4f259f https://github.com/w3c/dwbp/commit/f3ed1fe24b3bb8835ae062c06aa0454ae0c9f5c5 https://github.com/w3c/dwbp/commit/e0c2f70d52674d700d80c16cce1de6acc6bf9157

81 Sensitive data the discussion of sensitive data still needs a disclaimer, and the text

should me more general rather than focused only on personal privacy.

Annette Data sensitive section was removed, BP about data unavailability was included in the Data Access section and text about sensitive data was included in the introduction and data enrichment sectio.

Implementation: https://github.com/w3c/dwbp/commit/dd752baa9227b587f981c71779fcc3a60206273b


82 Data Access Data Access Introdcution: We say the web uses http by default and then say that different

approaches can be adopted, including bulk download and APIs. Bulk download and APIs, of tar files or anything else, both use HTTP!

Next there is discussion of packaging in bulk using non-proprietary file formats (e.g., tar files). This has nothing to do with being nonproprietary. The point, I think, is archiving a directory structure into a single file.

Paragraph 3 is tautological. Data that is already streaming to the web is already published in a manner that allows immediate access. I think we mean to say "For data that is generated in real time or near real time, data publishers should use an automated system to enable immediate access to time-sensitive data, such as emergency information, weather forecasting data, or system monitoring metrics. In general, APIs should be available to allow third parties to automatically search and retrieve such data." If you want to then talk a little about APIs for other kinds of data, you could add a paragraph that goes like this: "Aside from helping to automate real-time data pipelines, APIs are suitable for all kinds of data on the Web. Though they generally require more work than posting files for download, publishers are increasingly finding that delivering a well documented, standards-based, stable API is worth the effort."

Annette Implementation: https://github.com/w3c/dwbp/pull/387/commits/b3c0bb30ffa66a5afc20e5d2aaa5ef9b756d1439


84 web standard APIs We should have some references for REST.

Richardson, L. and Sam Ruby, RESTful Web Services, O'Reilly, 2007, http://restfulwebapis.org/rws.html.

Fielding, Roy T., "Representational State Transfer (REST)", Chapter 5 of Architectural Styles and the Design of Network-based Software Architectures, Ph.D. Dissertation, University of California, Irvine, 2000, https://www.ics.uci.edu/~fielding/pubs/dissertation/rest_arch_style.htm.

Annette Implementation: https://github.com/w3c/dwbp/commit/4a7e0e00f8afdef9d68f000ba9f9c6adedc2a66b
85 avoid breaking APIs In the implementation section, "home resource URI" gets used as a

plural, but an API should only have one. Remove "home" in the first one and it makes more sense. "...by keeping resource URIs constant..."

The bit about announcing changes should go in the outcome section.

Annette

Implementation: https://github.com/w3c/dwbp/commit/8ff10821cfe75f52bc084f155bf312c9bb35ab78

86 cite source The first line of the example ("You can cite the original...") should

replace the text above it ("You can use the Dataset Usage...") The example citation should list the transit agency as the author. 'Data source: MyCity Transport Agency, "Bus Timetable of MyCity...'

Annette

Implementation: https://github.com/w3c/dwbp/commit/7ee799d6abbf7ba53616aa6c73a0d3f9cee572c6

87 challenges In the diagram, the challenge texts should be similar, either all

statements or all questions. Suggestion for the reuse one: "How can I reuse responsibly?" (The current question sounds a little too self-serving.)

Annette Implementation:https://github.com/w3c/dwbp/commit/1aa0c50ab36b085006337ecc0629dafbab8b0d41