A loose collection of SemWeb-related rules of thumb that I would propose as good practice.
Modelling a dataset in RDF
- Make sure that everything has an rdfs:label, either directly specified, or by using some property that is defined as a subproperty of rdfs:label
- Don't be overly concerned with ambiguous labels; just consider the resource in isolation. That's because labels cannot do the job of disambiguation anyway, and trying to do it results in artificial and awkward labels.
- On untyped literals, if the literal is likely to be understood only by speakers of a single language, then add a language tag. If it is likely to work for speakers of many languages, keep it without a language tag. If the file or dataset has only a few exceptions, then it is perhaps better to go for consistency and mark them the same way as the rest of the file.
- If you publish in multiple languages, then perhaps it's a good idea to include a plain literal in a “default language” without a language tag, to make SPARQLing easy.
- Avoid xsd:string. Just use a plain literal.
- For numbers, prefer xsd:decimal and xsd:integer because they are not restricted in accuracy/range.
- Avoid defining custom datatypes if you can. Better bake the literal semantics into properties.
- Issue: SKOS demands custom datatypes for skos:notation. Just ignore that?
- For units of measurement, prefer a pattern such as: ex:length [ ex:meter 5.21 ]
Designing vocabularies and ontologies
@@@ Interesting advice from TimBL: http://lists.w3.org/Archives/Public/public-lod/2011Apr/0282.html
Naming of properties
- Properties that point to documents (information resources) should have names that announce this fact, e.g. userProfile, userPage, userList, eventRecord, eventForm
- Relationship nouns make good propery names, e.g “parent” is better than “hasParent” or “isParentOf” (as per TimBL)
Focus on one problem::
- [from an email to rdf-schema-dev on 2008-04-05] Some random half-formed thoughts: It's good if the vocabulary covers all my needs for a given problem. It's good if the vocabulary doesn't contain much extra stuff that I don't need to solve my problem. It's good if the purpose and coverage of the vocabulary can be conveyed in a short term or phrase (e.g. “document metadata” or “issue tracking”). It's good if the level of abstraction is consistent throughout the vocabulary, e.g. don't mix high-level concepts like Service and Container into your down-to-earth photo annotation vocabulary.
Provide excellent documentation::
- [from an email to rdf-schema-dev on 2008-04-05] Random thoughts again: Some introductory narrative. A bunch of good examples for typical usages of the vocabulary. An UML-style overview diagram if the vocab has more than a few classes. Some tutorial-style text. An excellent reference section with all terms and notes about how they are supposed to be used (including notes on what they are NOT supposed to be used for).
From danbri in an email to the DC Architecture list, 30 March 2010 09:40:47 IST:
My current preference / advice (for new work) is for the managers of
each serious namespace to invest in a distinct domain name for it, and for us as a community to come up with social machinery for 'watching each other's backs' to ensure that the domains are kept in good working order, fees are paid, etc. Sometimes an additional level of indirection can add as much risk as it saves. Initially when I bought xmlns.com I have idea it could be a home for lots of namespaces, and then the more I thought about it, the less I liked that idea. Each new namespace added to the bucket brings some risk to the others using the domain, by adding to the complexity and burden for subsequent maintainers. So I think a proliferation of independent domain names,while painful in its own way, spreads the risk...
Later in the thread:
Rule of thumb - when wondering what info to include in a namespace URI, ... try to leave *out* as much as possible
And even more concise:
First rule of namespace URI design "you're more likely to regret things you included, than things you omitted".
For hash namespaces, the RDF document containing the vocabulary should be typed as owl:Ontology and should be the target of any rdfs:isDefinedBy statements. Note, the RDF document's URI is the namespace URI without the trailing hash.
Publishing RDF on the Web
Metadata in RDF documents
- Every RDF document has some relation to the thing(s) it talks about. It is useful to explicitly state what that relation is. For example, in my FOAF file I state that I'm both the foaf:maker and the foaf:primaryTopic of the file. Other useful properties are: rdfs:isDefinedBy, foaf:topic. A concrete benefit is that consumers can pick out the “important” things from the graph.
HTML descriptions of URI-denoted things
- To create trust into the stability, reliability and availability of a URI, its HTML description should explicitly state the URI, it should contain an explicit Statement of Purpose to the effect that the URI is intended to be used as an identifier for the thing, and it should contain a Publisher Identification. It must provide sufficient information to enable human users to know exactly what is being referred to. (This is inspired by Published Subjects.)
- The benefit of CN is that all URIs also work in a standard Web browser, not just in RDF-enabled tools and browsers. Thus it's great for authoring and debugging and when your URIs are exposed to a lot of neophytes (e.g. DBpedia, FOAF, DC). On the other hand, content negotiation is very hard to get right, the devil is in the details and it has turned out to be quite an interop hassle in practice. So, CN should be thought of as icing on the cake, but not a requirement for publishing RDF.
- Rule of thumb: If your server solution does CN, then do CN. Otherwise, getting it right will be too much effort.
- Keep in mind the advice from On Linking Alternative Representations: Provide links between different variants to make them all accessible. This means, if some HTML can be returned in response to RDF/HTML negotiation, there should be an RDF icon nearby, which points to the RDF variant.
- Should be avoided in general. Using a blank node is appropriate if the publisher thinks that no one should ever care about this resource except in the context of looking at another, identified, resource in the same RDF document, e.g. a geo:Point that exists solely to give the location of another resource. Another situation where a blank node would be appropriate is when used as an existential variable, but I've never seen them used that way in a Linked Data context.
- Have rdfs:label (or a subclass thereof) on everything, always
- Have rdfs:label for the document URI, always
- Have a foaf:primaryTopic triple connecting document URI and main resource, always
- Have as much dc: metadata as possible on the document URI
- Think hard about possible external links, to other web pages and other RDF documents and entities. Provide as many as possible. These make all the difference.
- RDF URIs are always case sensitive, while from HTTP's point of view, some parts of the URI can change without changing any behaviour. So, be clear about the case of your URIs and stick to it once a decision is made. If in doubt, use as much lowercase as possible. (Story: L3S changed case of the domain name in their URIs, broke a Semantic Web Pipes demo.)
- From Harry Halpin: “If there is a URI that is used to identify a resource one would want to make logical statements about, and these statements do not apply to possible representations of that resource, then one should use the "hash" or 303 redirection to separate these URIs.”