Using RDF for social information management
Axel Rauschmayer, LMU Munich, hypergraphs.de
W3C Workshop “RDF Next Steps”
1. Introduction
Social information management (SIM) is about sharing and
collaboratively editing heterogeneous information. Example applications
include
wikis, Facebook, Flickr and Google Calendar. RDF is ideally suited to
handle the task
of managing this kind of heterogeneous information. But as of now, it
is rarely the foundation of
SIM applications (RDFa is
becoming more popular and might become a foothold for RDF). Most of
SIM’s
requirements are simple and use little of what RDF is capable of.
Yet, other requirements, while also simple, are surprisingly difficult
to fulfill with RDF. This paper describes the latter kind of
requirements and also
highlights RDF features that complicate some SIM tasks.
2. Limits of RDF
The following sections describe
challenges one faces when using RDF for social information management.
2.1. Reification
If many users contribute data, provenance becomes important: For each
statement, one should record who created it and
when. Collaborative tagging is an example where provenance is
essential: Several people might
contribute the same tag. The most prevalent current solution is to
introduce an intermediate resource denoting the tagging instance, but that seems
clumsy. An alternative would be RDF reification, which has its own set
of problems. Topic Maps have solved this problem well, by allowing associations (statements) to be
turned into topics
(resources), seamlessly.
2.2. Inference
For SIM, inference mainly has to be fast (as in “dynamic”). It only
needs a small subset of the inferences that OWL is capable of: class
hierarchies, sub-properties (to declare label predicates) and
equivalences (when using URIs for public entities). Therefore, if
SPARQL gets path queries, the need for most inferences is obviated. An
additional challenge is to
enable end users to find out what lead to a given inference.
An example of a non-standard inference that is needed by SIM is to
infer tags. For example, to infer the tag “programming language”
wherever the tag “JavaScript” appears.
2.3. Querying
The most common task when displaying repository contents to end users
is to list resources. Alas, the standard RDF query language SPARQL is
not helpful in this regard. For example, retrieving resources with
several labels and types requires manual grouping and queries such as
the following.
SELECT ?s ?p ?o WHERE {
?s ?p ?o .
FILTER ( ?p = rdf:type || ?p = rdf:label )
}
An even more
complicated scenario is to display a table with several columns and
order it by one of the columns. Currently, the best way to do this is
to query for the subject and the column value, and let SPARQL do the
sorting. Assuming a paged display (similar to Google’s search results),
this allows one to quickly determine what resources should be displayed
on the current page. The actual content of each table row can then be
assembled manually. Due to the preprocessing step, this has relatively
little impact on the overall performance. The preprocessing query looks
as follows.
SELECT ?subj ?colValue ?colValueLabel
WHERE {
?subj ex:colPred ?colValue .
OPTIONAL { ?colValue rdfs:label ?colValueLabel }
}
ORDER BY ?colValueLabel ?colValue
Note that this approach also works for literal column values. ?colValue is included to make
the result order deterministic and needed for some of the improvements
below. The approach has
several drawbacks:
- Time data is not sorted properly, because literals are grouped by
datatype, with each group being sorted lexicographically. For example,
the query above would first list all xsd:date literals and then all xsd:dateTime literals,
even if the first xsd:dateTime
chronologically
came before the last xsd:date.
The
solution is to use the str()
function to normalize all literals to plain literals. (The second use
of str() is more
important than the first one.)
ORDER BY str(?colValueLabel) str(?colValue)
- ?colValueLabel is
sorted case-sensitively, labels that start with a uppercase letter come
before labels that start with a lowercase letter. Solution: SPARQL can
be extended with custom functions, for example XPath functions. Thus,
one would order by fn:lower-case(str(?colValueLabel)).
?colValue would also
profit from case-insensitive ordering.
- Resources without a label come before
resources with a label, because nulls come before bound values. This
can be fixed by concatenating the results of two queries (manually):
The first one demands that
the label be bound, the second one demands that the label be unbound.
- Unlabeled blank nodes come before unlabeled URIs. This can be
fixed similarly to above.
- If one uses the local name of an
unlabeled URI to derive a label, this derived label should appear among
the explicitly assigned labels. Alas, neither the mixing of labels nor
the extraction of the local name is currently possible with SPARQL.
- Multiple column values lead to multiple table rows. There is no
easy fix. On the flip side, this allows one to sort columns with
multiple
values in a straightforward manner.
- Only
a single label predicate is supported. Multiple label predicates can be
enabled by filtering the predicate with a disjunction enumerating
all known label predicates. Alternatively, one can assign a marker
class to label predicates and query for direct instances of that class.
Displaying
the results on several pages also introduces a slight inconvenience:
One
has to finish iterating over the search results to
determine the complete number of pages. Solving all of these challenges
elegantly is difficult, a first step might
be
to support result rows that are not in first normal form.
2.4. Labeling
Configurable labeling is well supported via RDFS inferencing, by making
a label predicate a subproperty of rdfs:label. If there is no
label, one can use the local name instead. One might decide against
inferencing, due to performance or usability reasons. Then labeling
becomes more complicated (see above).
On the other hand, nicely looking
input is supported by namespace declarations. These declarations
are
usually stored per-repository, eliminating some of the composability
one achieves with named graphs. An alternative would be to declare
namespaces in RDF, inside the graphs where they are used. This
preserves composability.
2.5. Collections versus containers
The two different ways of storing ordered multi-sets of RDF nodes are
confusing, especially for newcomers. There should be a clear
explanation when to use what. For
example, the standard argument that collections can be closed while
containers
cannot is not convincing. One could easily add rdf:nil as a terminating
container element. Querying collections and containers should also be
better supported by SPARQL: Both returning a collection (or container)
as a single value and finding what collection an element is a part of
would be useful. Currently, one always ends up manually parsing a
collection (or container). A related question is whether
there should be a dedicated type of RDF node for ordered multi-sets.
2.6. Blank nodes
The unstable IDs of blank nodes complicate many tasks such as
synchronizing RDF repositories or referring to resources in SPARQL. Is
there a
better solution? Can blank nodes have stable IDs? Can they be replaced
by automatically generated URIs?
3. Conclusion
This paper outlined some of the problems one faces when using RDF for
simple tasks related to social information management. As this domain
experiences rapid growth and as RDF is ideally suited for it, it is
important to better solve these problems. The most immediate wishes of
the author are:
- Improvements for SPARQL (paths, NFNF, etc.; see 2.3).
- Dedicated nodes for sequences/lists and support for them in
SPARQL.
- Stable blank node IDs or the abolishment of blank nodes.
- The ability to turn statements into true resources (similar to
Topic Maps).