data quality vocabulary

Hi folks,
Following up on a comment I made last meeting and promised to put into writing, I think that the accuracy/consistency/relevance metric needs to be broken up into separate items. A dataset could rank well in one of those three and also rank poorly in another; they are pretty orthogonal. In particular, I think relevance can't be a ranking but rather a statement of who the data is relevant for. In that respect, it may not really be a measure of quality. (Credibility is similar, as the decision of who is authoritative is subjective, and I wouldn't want to rate a good dataset poorly because it was put out by a small business.) I see consistency as crucial. That is really what I want as a user of data. What I want to know is, can I use it readily in an analysis tool? Can I open the dataset in R and do some statistical manipulations? Can I open it in Tableau and make a visualization without doing a lot of cleaning? 

I worry, too, that we are defining some stuff that really isn't about data quality so much as the best practices that we have in the BP doc. I'm thinking here of the mentions of machine readability and metadata.

I think we also need to think about how this vocabulary is expected to be used. If a data publisher provides quality information upon publication (which is what I've been thinking of as the main use of the vocabulary), then some items won't really make sense to include. Information like why the data was removed from the web will not be available when it is first published.

We may need to do some scoping to be sure we are targeting quality information. I would suggest that we avoid repeating what is in the BP doc.
-Annette

Received on Friday, 29 May 2015 06:49:40 UTC