RDF Validation Workshop -- 10 Sep 2013

Attendees

Present: +1.617.715.aaaa, dbs, DaveReynolds, aisaac, +1.510.435.aabb, Workshop_room, +1.510.435.aacc, +1.510.435.aadd, +1.510.435.aaee, kcoyle
Chair: Arnaud Le Hors and Harold Solbrig
Scribe: sandro, arthur

Topics

Introductions

Miguel Esteban Gutiérrez <mesteban>	Center for Open Middleware (Universidad Politecnica de Madrid)
Jose Labra presentation <labra>	My name is Jose Emilio Labra Gayo (University of Oviedo, Spain). I am interested in this workshop because we have a practical use case on the WebIndex and we have used a SPARQL queries based tool to validate RDF called Computex. We are also interested on RDF profiles
Graham Rong <GR>	PhD, from MIT has been working on semantic web application in financial industry ... http://bit.ly/RxzPyr Linking XBRL to RDF: The Road To Extracting Financial Data For Business Value
Sandro Hawke <sandro>	W3C. Staff contact for RDF-WG, GLD-WG, and was for SPARQL, RIF, OWL, Prov
Roger Menday <roger>	Fujitsu Laboratories of Europe. Working on using Linked Data technologies in the Enterprise
Guoqian Jiang <guoqian>	Clinic, Rochester MN. I am a clinical informatics researcher. My research interests focus on clinical data standards and using semantic web tools for data validation and quality assurance in health domain.
Harold Solbrig <hsolbri>	Mayo Clinic. Focus on Ontologies in clinical research and standardized ontology representation. Editor and author of OMG LQS specification, HL7/ISO Common Terminology Services (CTS) and OMG CTS2. Participant in ISO 11179 and XMDR projects, IHTSDO SNOMED CT, WHO ICD-11 project.
David Booth <DavidBooth>	KnowMED. Applying RDF and other semantic web technology to medical records and other healthcare information to facilitate better research and help measure quality of care.
Ashok <Ashok_Malhotra>	Oracle. Member of LDP WG. Worked on XML Schema for many, many years!
Martin G. Skjæveland <mSkjaeveland>	, PhD student from University of Oslo, Norway. Will present work on validating incoming RDF data based on what in the receiving dataset.
Arthur Ryman <arthur>	IBM Rational, developed OSLC Resource Shape spec to fill the void where XML Schema lived, for documenting and specifying REST APIs for Linked Data
Robert Beideman <rmb>	GS1: Leveraging RDF and LOD to facilitate availability of trusted, authentic data about Products, Companies, and Services on the Web
Mark Harrison <mgh>	Auto-ID Lab at the University of Cambridge. We have a close collaboration with GS1 in the development of technical standards for supply chain visibility, traceability and electronic pedigree and we've recently been involved in the GS1 Digital project, which is looking at ways to use Linked Open Data for products
Anamitra <Anamitra>	IBM/Maximo: RDF data introspection
Evren Sirin <evrensirin>	Clark & Parsia, We develop Stardog RDF database that provide RDF validation capabilities
Arnaud Le Hors <Arnaud>	("Arno Luh Oarss"), IBM Linked Data Standards Lead, chair of the LDP WG and of this workshop (former W3C Team member :-)
Tim Cole <timCole>	Univ of Illinois and W3C Open Annotation Community Group
Steve Speicher <SteveS_>	IBM SWG Rational: LDP Editor: OSLC community/standards, I work with arthur
Dave Reynolds <DaveReynolds>	, Epimorphics Ltd. Part of GLD working group co-editing Data Cube and Org specs. Among other things work with UK public sector on use of Linked Data which has raised a number of validation-like requirements.
Antoine Isaac <aisaac>	from Europeana: previously working on SKOS. Interested in getting good quality data from numerous, heterogeneous datasets
David Dolan <ddolan>	from Cape Mobile Tech's
Jim McCusker	from RPI: Biomedical Semantics interested are data and provenance interoperability in life sciences
Paul Davidson	Chief Information Officer, Sedgemoor District Council, UK
Bob Morriss	from University of Massachusetts Boston/Harvard Herbaria
David Lowery	from Harvard University - Museum of Comparative Zoology
Phil Archer	from W3C. working with government linked data in the UK
Noah Mendelsohn	from Tufts University: XML Schema guru/historian

State of the Art

(slides, report summary)

<Ashok_Malhotra:> When we started RDF, folks said it was great BECAUSE it had no schema. Are we changing our mind?

<Arnaud:> Sounds like JSON :-)

<hsolbri:> The schema is there whether you write it down formally or not.

<DavidBooth:> There are lots of different schemas in RDF. The beauty of RDF is the ability to combine them.

<arthur> PDF version of my charts at http://www.w3.org/2001/sw/wiki/File:OSLC_Resource_Shapes.pdf

Presentation from Mark Harrison (U Cambrdige)

(slides, , report summary, ???)

Robert: GS1: we did bar codes. We work with the Auto-Id Labs (started here at MIT)
... GS1 digital, trying to leveral all the master data in the supply chain, business-to-consumer

Mark: (slide with iPones, LOD for products, Pre-Sale)
... more informed choices, eg products with particular environmental impact
[on slide 9] ... do we want broken hyperlink checking?
... (can we validate offline)
... what is the scope/boundary of what we validate?
... When we have these huge code lists, the scale of validation queries might be problematic
... 3000 attributes, hundreds of which are code-list-driven

<hsolbri> Focus on markup and validation tools rather than the actual validation

<DaveReynolds> +1, publishing and inspecting the contract is at least as important as enforcing the contract

hsolbri: Happy to see these use cases. I think RDF "validation" is not the best framing. I think it's MORE important to publish the characters of what's in a store, rather than just validating.

arthur: This sounds a lot like what we've done at IBM. Can you describe....

mark: It's about making sure you can ...
... We need to make sure the two datasets are in sync with each other.
... You need to have confidence that these are the true values asserted by manufacturer.
... Maybe we could use digital signatures. There's liability to consider.

arthur: you're comparing published data with Reference data. you don't need to comopute a sig

mark: true, we could use prov as an alterantive to sigs

<arthur> GS1 uses cases very similar to OSLC, except for digital signatures

timcole: The issue cardinality, not validations. Value is correct... unit transformations. 600g = 1.2lbs or whatever. Are you encompassing that in validation?

mark: Yes.
... like in eric's example of reproducedOn date -- you want to do checking like that, with units conversion
... EU legislation says vitamins are expressed in certain units. Sanity checking on values -- to make sure we're not off by orders of magnitude

timCole: Does broaden the scope.

mark: Yes.

Robert: We used to have a closed network for this. To open it to millions of producers makes this more complex.

Ashok_Malhotra: If you want to test whether this date follows this other date, there are xquery functions to handle all of that stuff. So we can just pick them up. We don't have to invent them again

mark: We should leverage what we can, yes.
... And using qudt for conversion of units, and so on.

Requirements for RDF Validation - Harld Solbrig

(slides, report summary)

hsolbri: [re: ASN1] we had "strings" where were kind of like rdf graphs. a ptext code was a sort of ontology

<guoqian> hsolbrig: from ptxt to ASN.1

hsolbri: RDF only guarantees triples, literals
... With SPARQL, you have to code EVERYTHING as optional!

<mgh> In SPARQL need to use OPTIONAL extensively for defensive coding in case value is not present

hsolbri: ... which is NP
... SIde note: Dataset (identity is content), Triple store (Identity separate from content)

<guoqian> hsolbri: a definition about what is RDF store

hsolbri: We should focus on the invariants in an RDF store. The synax MUST provide a way to state the invariants. What will always be true of this store, so when you're writing queries, you know what's optional, what can be in there, what can't be in there.
... We need a way for them to be published, and for them to be discovered.
... Future -- invariants will change over time.

<guoqian> hsolbri:RDF validation must provide a standard syntax and semantics for describing RDF invariants

hsolbri: Semantic Versioning. semver.org
... That was the MUST. Here's the SHOULD.
... representable in RDF, maybe also a DSL
... formally verifiable, consistent, maybe complete
... self-defining
... able to express subset of UM 2. class and attribute assertions (and some OCL?)
... able to express XML Schema invariants
... implementable in exising tooling and infrastructure (RDF, SPARQL, REST, ...)

hsolbri: [slide 17] Example of allowed transitions -- you're allowed to add subjects, but not to add predicates.
... spectrum from read-only to write-any-triple.

<guoqian> hsolbri: LOD today OK for research but not for production systerms

<guoqian> ... OK for relatively static stores but not for federation and evolution

<aisaac> Question for Harold: Just checking, when you say "All constraints of XML Schema", this includes sequences?

guoqian: You're offering another definition of "store". Is this different from existing defn of named graphs?

hsolbri: I'd have to go back and look at that. I think Named Graphs are local to quad store. ANd I'm focussing on having the identity of a store, but have the contents be constrainted.

<sandro:> as i understand SPARQL11 terminology, a "graph store" can have multiple "states"
... so you're talking about a particular graph store to only contain certain datasets

Arnaud: people use the term "graph" sometimes to mean something mutable or not, gboxes and gsnaps.

hsolbri: "magic box" was a term we onces used.

aisaac: I heard Harold say he wants to represent all that's allowed by XML Schema. Does that include Sequence Information?

hsolbri: Great question. There are situations where people take advantage of order, but this may be a drawback. so, maybe MOST of XML schema. The challenge is how to get it back out in the right order....

Arnaud: We have on the agenda a presentation from Noah Mendelson, to talk about XML Schema, warning us against reproducing some of their mistakes.
... Some people will say 20/80 rule, but which 80?

Arthur: Your summary slide was a bit disappointing/negative.

hsolbri: I believe fixing this is necessary to to make RDF able to be a primary source for content.

arthur: I consider your second negative to be a positive. It's why we've adopted RDF. Traditional data warehouses are very expensive because they completely enforce the schema. RDF allows more graceful evolution.

hsolbri: So, the flexibility of RDF is seen as a real advantage. A fellow at OMG used to distinguish between precise and detailed. We publish the invariants that are known, but it's important to be able to leave flexibility. If we make no assertion about firsttname and lastname, then that's important to know, too.

evrensirin: Graceful evolution of data is an advantage of RDF. That's not about enforcement of schema, but about having the option to not have a schema.
... Clarification on post-conditions. State transitions, or states?

hsolbri: Closely related to reasoning. If you're doing anything beyong a basic PUT, adding a triple to a store may involve doing additional inferences, eg adding a firstname may result in the presence of a fullname in a store.
... what has to be true for this set of rules to fire; what is true if they do.

RDF Validation in a Linked Data World - Esteban-Gutiérrez

(slides)

[discussion of dynamics in validation not captured]

Linked Data Profiles - Paul Davidson

(video, report summary)

Pauls wants a "Linked Data Profile" that describes the properties, values, etc., that should be used so that multiple councils in England can share data

<sandro> +1 Paul Davidson, make it easier to share municipal data

Forms to direct interaction with Linked Data Platform APIs - Roger Menday

(slides, report summary)

Roger: described use of REST APIs at Fujitsu

Roger: participating in LDP activity
... need to descibe parameters to create resources (Progenitor)
... use case: enable robots to fill in forms
... proposed a vocab (f:parameterSet ...) to be included in an LDP container

Europeana and RDF data validation

(slides, slideshare, report summary)

Antoine: aggregates data from multiple sources (musems) and need to enforce constraints
... described as table: property, occurence, range
... using OWL now
... EDM is implemented as XML Schema (for RDF) with Schematron rules

<dbs> EDM = Europeana Data Model

Antoine: Also using Dublin Core Description Set
... OWL = hard, SPARQL = low-level

Thoughts on Validating RDF Healthcare Data

(slides, report summary)

<guoqian> -- Schema promiscuous: why RDF?

<aisaac> Bye folks. It was a great morning. Enjoy the rest of your day, and thx a lot for the slide moving!

dbooth multiple schema, multiple data sources
... ==> need multiple perspectives on validation of the same data
... wish list: build on SPARQL,
... use SPARQL UPDATE to build intermediate results (instead of one giant SPARQL query)
... check URI patterns
... must be incremental so you can do it continuously, e.g. like rgression testing
... declarative is too awkward for complex rules ==> need operational (imperative): SPARQL UPDATE pipelines

Validate requirements and approaches - Dave Reynolds

(slides, report summary)

DaveReynolds currently working with UK gov: multiple vocabs, manual docs, each publsiher validates their data
... need a shared validation approach: need to specify "shape" of data
... declarative rules are desirable
... understandable by "mortals"

<hsolbri> Interesting: does Reynold's declarative requirement clash with Booth's procedural?

DaveReynolds cites W3C Datacube vocab

<DavidBooth> Harold, I think it depends on the complexity of the validation check. If it can be expressed in a simple declarative rule, then that is easiest. My point is that for more complex checks, operational is needed.

DaveReynolds SPARQL used to express Datacube integrity constraints
... SPARQL queries hard to understand
... for irregular data, OWL is also too hard

<guoqian> need ability to validate against external services such as registries

DaveReynolds need to specify controlled terms too

Requirements Discussion

(report summary)

Arnaud framing discussion: what do we need? What can we afford?

Harold: compare need for procedural steps versus declarative constraints
... must declarative description also be executable (for validation) e.g. by translation to SPARQL
... e.g. in many cases, the datastore content is already valid, so the missing capability is to advertise what's in a store

David: desirable to have high-level specification that is translatable to an executable language (SPARQL)

Arnaud: use the IRC queue system "q+" to get on queue

<DavidBooth> David: Want the best of both worlds: declarative when a constraint can be easily expressed that way, while allowing fall back to SPARQL when necessary. So to my mind the ideal would be declarative *within* the SPARQL framework.

<Zakim> ericP, you wanted to discus XML Schema/RNG + schematron

Dave: SPARQL is too low level: need high-level description

Eric: uses multiple schema langauges XSD, RelaxNG, Schematron
... we'll probably have a high-level validation language that is extensible with low-level rules in SPARQL, JS, etc

<hsolbri> UML has Class, property and OCL (schematron equivalent)

Evren: SPARQL has extension points. Concern about SPARQL UPDATE since it changes data

David: didn't imply to actually change data

Tim: OWL wasn't developed for validation, SPARQL wasn't developed for validation: why not have a language without baggage

Harold: we should be informed by UML

Ashok: should split up problem, 1) state, 2) structure, 3) constraints

Arnaud: perspectives are 1) validation, 2) description

Eric: description should be translatable to SPARQL, SPIN, whatever

<Zakim> hsolbri, you wanted to say if it isn't compatible, I think we need a good justification as to why.

Eric: cites Stephan Decker proposal to translate description into SPARQL

Harold: cites project to translate UML -> Z - SPARQL

<Zakim> evrensirin, you wanted to talk about what we can afford with sparql translation

<guoqian> hsholbri: working on translating from UML to Z to Sparql

Evren: translation is good implementation strategy, but not for state transitions

<Zakim> ericP, you wanted to say that coverage of all triples may be tricky in SPARQL

David: use multiple graphs or datasets to describe pre/post conditions

<Zakim> labra, you wanted to talk about RDF profiles

Labra: descibes work on RDF validation based on profiles
... like Schematron, using SPARQL instead of XPath

<Zakim> hsolbri, you wanted to say proposed requirement - invariants (and rules?) expressible in RDF

Harold: SPARQL not using RDF (unlike SPIN) - we should require an RDF representation

<guoqian> hsolbri:SPARQL should be able to be defined in RDF with meta data

Evren: SPIN is going to allow a literal string of SPARQL

Harold: don't want to parse another grammar

Evren: SPIN has both - RDF based and literal SPARQL string

<Zakim> ericP, you wanted to ask if the expressivity of SPIN in RDF is of opperational valye

Evren: what is the value of the RDF representation of SPARQL in SPIN? Is this just for query governance?

Harold: RDF is useful for impact analysis

<Zakim> DavidBooth, you wanted to say I think a main reason for the RDF-based SPIN syntax is the ability to change namespaces in the query

Steve: need to also see why validation fails

<guoqian> hsolbri: meta-repository may be an argument for RDF validation

<Zakim> DavidBooth, you wanted to say one thing I particularly like about SPIN CONSTRUCT rules is the ability to attach arbitrary data to a validation error

Harold: metadata merging is important so RDF is useful in that use case

David: SPIN CONSTRUCT rules allow attachment of other data

<SteveS> I'd like the validation results to not only provide a useful message that a tool could possibly recover, but also the context such as the triples causing problem and rules that cause it (some guidance on how to become validate would be helpful)

Arnaud: need to discuss what is affordable
... need to prioritize what we can do in a 2-year period
... experience shows that the experience of developing standards in charter groups can be brutal [laughs]

End

<arthur> Break for lunch courtesy of W3C

<arthur> check out this w3c spec that contains Z notation http://www.w3.org/TR/wsdl20/wsdl20-z.html

Guoquin Jiang presentation - Mayo Clinic

(slides, report summary)

<hsolbri> [at Slide 8: Architecture] Clinical Element Models converted to XML Schema, Instance data to XML then Schema to OWL and instance to RDF

Guoquin:: Slide 11: Check constraints and validate

Guoquin:: ... Use SPARQL

Guoquin:: Eric: Is SPIN generated fron Schema

Guoquin:: Jiang: No, by hand ... perhaps in future'

Guoquin:: Slide 15: Reference Model picture

there is a SPARQL error on chart 16

Guoquin:: Slide 16: Data values

Guoquin:: Slide 19: RDF Rendering of Domain Template

Guoquin:: ... using SPIN in an RDF Form

Guoquin:: Slide 20: Discussion Points

Guoquin:: ... RDF Validation against CIMI Models

Guoquin:: ... Challenging issues (data types, value set binding)

Guoquin:: ... XML Semantics Resuse Technology

<arthur> i don't undertand XSD->OWL

<arthur> XSD = constraints, OWL = Inference

Slide 21: Picure showing Technologies and their Relationships

Overlay: BRIDGing Technology

Arthur: How can you translate XML Schema to OWL or UML to OWL?

Eric explains ... they are different but can be used in similar ways

Discussion on translation between UML and OWL, XML and OWL

scribe: constraints and reasoning are just different

<kcoyle> aadd is kcoyle

Q&A

Discussion of constraint checking vs. inference

Arnaud: Are you doing this mapping on Slide 10 or are you thinking of doing this?
... asks about validation at different levels

Harold: This is a vision ...
... MIF is an extension of UML with a higher degree of expressivity
... Effort to translate MIF to OWL [example resulting clinical data]

Simple Application-Specific Constraints for RDF Models - Shawn Simister

(slides, report summary)

ssimister: RDF Validation at Google
... we are triplifying the Web
... What approaches did we consider?
... Schematron, SchemaRama
... SPIN constraints
... nice to be able to have metadata on constraints, like for severity of violations
... OWL Integrity Constraints
... Our Solution ... path-based constraints
... What did we learn
... Most constraints are property paths. SPARQL handles the rest
... constraints describes the app, not the world it inhabits
... Constraints need to be app specific

<arnaud:> how do the constraints get created? do you do it, does the developer?

ssimister: some of each. gmail team had their own internal software with their internal test cases, so it as easy to get them to generate stuff for us.

<guoqian> -- schema.org

<sandro:> surely an app has one set of property paths for what's needed to use the data at all, and another that it might be able to use.

ssimister: we only talk about the required stuff. for one thing, we're trying to not discourage people from providing information we don't happen to use yet.

<sandro:> It would be nice, probably, to still tell folks what data you can use if provided.

ssimister: good idea.

DBooth: Are the paths RDF property paths?

ssimister: No they are not ... very similar

Arthur: Why do you split into context and constraints when you can use a single SPARQL query?

ssimister: The design came from Schematron

<mgh> Seems like a constrained subset of the property paths that can be used in SPARQL 1.1 - not supporting *, + notation

Question about the parser

ssimister: Superset of RDF ...
... not public yet

Using SPARQL to validate Open Annotation RDF Graphs - Tim Cole

(slides, report summary)

Tim: Context: W3C Open Annotation CG
... has 102 members
... narrow and easy usecase for RDF

Tim describes the OA data model

Tim: describes the OA Ontology
... LoreStore Annotation Repository
... store, search, query, display and validate annotations
... approach

Bob Morris on FilteredPush RDF Validation

Tim: rules are grouped into RuleSets. All rules in a set must be valid
... the OAD namespace has some extensions to the OA namespace

[Q&A]

Tim: I was happy that most of these topics came up in the more complex cases as well

COFEE BREAK for 15 Minutes

Requirements List

(discussion pirate pad, report summary)

The group collaborated on a PiratePad, with some extra coordination because PiratePad permits a maximum of 10 simultaneous users.

Minutes formatted by David Booth's scribe.perl version 1.138 (CVS log)
$Date: 2013/10/04 17:30:29 $

See also

RDF Validation Workshop

Practical Assurances for Quality RDF Data

Attendees

Contents

Introductions

State of the Art

Presentation from Mark Harrison (U Cambrdige)

Requirements for RDF Validation - Harld Solbrig

RDF Validation in a Linked Data World - Esteban-Gutiérrez

Linked Data Profiles - Paul Davidson

Forms to direct interaction with Linked Data Platform APIs - Roger Menday

Europeana and RDF data validation

Thoughts on Validating RDF Healthcare Data

Validate requirements and approaches - Dave Reynolds

Requirements Discussion

Guoquin Jiang presentation - Mayo Clinic

Simple Application-Specific Constraints for RDF Models - Shawn Simister

Using SPARQL to validate Open Annotation RDF Graphs - Tim Cole

Requirements List