Using RDF for social information management

Axel Rauschmayer, LMU Munich, hypergraphs.de
W3C Workshop “RDF Next Steps”

1. Introduction

Social information management (SIM) is about sharing and collaboratively editing heterogeneous information. Example applications include wikis, Facebook, Flickr and Google Calendar. RDF is ideally suited to handle the task of managing this kind of heterogeneous information. But as of now, it is rarely the foundation of SIM applications (RDFa is becoming more popular and might become a foothold for RDF). Most of SIM’s requirements are simple and use little of what RDF is capable of. Yet, other requirements, while also simple, are surprisingly difficult to fulfill with RDF. This paper describes the latter kind of requirements and also highlights RDF features that complicate some SIM tasks.

2. Limits of RDF

The following sections describe challenges one faces when using RDF for social information management.

2.1. Reification

If many users contribute data, provenance becomes important: For each statement, one should record who created it and when. Collaborative tagging is an example where provenance is essential: Several people might contribute the same tag. The most prevalent current solution is to introduce an intermediate resource denoting the tagging instance, but that seems clumsy. An alternative would be RDF reification, which has its own set of problems. Topic Maps have solved this problem well, by allowing associations (statements) to be turned into topics (resources), seamlessly.

2.2. Inference

For SIM, inference mainly has to be fast (as in “dynamic”). It only needs a small subset of the inferences that OWL is capable of: class hierarchies, sub-properties (to declare label predicates) and equivalences (when using URIs for public entities). Therefore, if SPARQL gets path queries, the need for most inferences is obviated. An additional challenge is to enable end users to find out what lead to a given inference.

An example of a non-standard inference that is needed by SIM is to infer tags. For example, to infer the tag “programming language” wherever the tag “JavaScript” appears.

2.3. Querying

The most common task when displaying repository contents to end users is to list resources. Alas, the standard RDF query language SPARQL is not helpful in this regard. For example, retrieving resources with several labels and types requires manual grouping and queries such as the following.

    SELECT ?s ?p ?o WHERE {
        ?s ?p ?o .
        FILTER ( ?p = rdf:type || ?p = rdf:label )
    }

An even more complicated scenario is to display a table with several columns and order it by one of the columns. Currently, the best way to do this is to query for the subject and the column value, and let SPARQL do the sorting. Assuming a paged display (similar to Google’s search results), this allows one to quickly determine what resources should be displayed on the current page. The actual content of each table row can then be assembled manually. Due to the preprocessing step, this has relatively little impact on the overall performance. The preprocessing query looks as follows.

    SELECT ?subj ?colValue ?colValueLabel
    WHERE {
        ?subj ex:colPred ?colValue .
        OPTIONAL { ?colValue rdfs:label ?colValueLabel }
    }
    ORDER BY ?colValueLabel ?colValue

Note that this approach also works for literal column values. ?colValue is included to make the result order deterministic and needed for some of the improvements below. The approach has several drawbacks:

Time data is not sorted properly, because literals are grouped by datatype, with each group being sorted lexicographically. For example, the query above would first list all xsd:date literals and then all xsd:dateTime literals, even if the first xsd:dateTime chronologically came before the last xsd:date. The solution is to use the str() function to normalize all literals to plain literals. (The second use of str() is more important than the first one.)
```
ORDER BY str(?colValueLabel) str(?colValue)
```
?colValueLabel is sorted case-sensitively, labels that start with a uppercase letter come before labels that start with a lowercase letter. Solution: SPARQL can be extended with custom functions, for example XPath functions. Thus, one would order by fn:lower-case(str(?colValueLabel)). ?colValue would also profit from case-insensitive ordering.
Resources without a label come before resources with a label, because nulls come before bound values. This can be fixed by concatenating the results of two queries (manually): The first one demands that the label be bound, the second one demands that the label be unbound.
Unlabeled blank nodes come before unlabeled URIs. This can be fixed similarly to above.
If one uses the local name of an unlabeled URI to derive a label, this derived label should appear among the explicitly assigned labels. Alas, neither the mixing of labels nor the extraction of the local name is currently possible with SPARQL.
Multiple column values lead to multiple table rows. There is no easy fix. On the flip side, this allows one to sort columns with multiple values in a straightforward manner.
Only a single label predicate is supported. Multiple label predicates can be enabled by filtering the predicate with a disjunction enumerating all known label predicates. Alternatively, one can assign a marker class to label predicates and query for direct instances of that class.

Displaying the results on several pages also introduces a slight inconvenience: One has to finish iterating over the search results to determine the complete number of pages. Solving all of these challenges elegantly is difficult, a first step might be to support result rows that are not in first normal form.

2.4. Labeling

Configurable labeling is well supported via RDFS inferencing, by making a label predicate a subproperty of rdfs:label. If there is no label, one can use the local name instead. One might decide against inferencing, due to performance or usability reasons. Then labeling becomes more complicated (see above).

On the other hand, nicely looking input is supported by namespace declarations. These declarations are usually stored per-repository, eliminating some of the composability one achieves with named graphs. An alternative would be to declare namespaces in RDF, inside the graphs where they are used. This preserves composability.

2.5. Collections versus containers

The two different ways of storing ordered multi-sets of RDF nodes are confusing, especially for newcomers. There should be a clear explanation when to use what. For example, the standard argument that collections can be closed while containers cannot is not convincing. One could easily add rdf:nil as a terminating container element. Querying collections and containers should also be better supported by SPARQL: Both returning a collection (or container) as a single value and finding what collection an element is a part of would be useful. Currently, one always ends up manually parsing a collection (or container). A related question is whether there should be a dedicated type of RDF node for ordered multi-sets.

2.6. Blank nodes

The unstable IDs of blank nodes complicate many tasks such as synchronizing RDF repositories or referring to resources in SPARQL. Is there a better solution? Can blank nodes have stable IDs? Can they be replaced by automatically generated URIs?

3. Conclusion

This paper outlined some of the problems one faces when using RDF for simple tasks related to social information management. As this domain experiences rapid growth and as RDF is ideally suited for it, it is important to better solve these problems. The most immediate wishes of the author are:

Improvements for SPARQL (paths, NFNF, etc.; see 2.3).
Dedicated nodes for sequences/lists and support for them in SPARQL.
Stable blank node IDs or the abolishment of blank nodes.
The ability to turn statements into true resources (similar to Topic Maps).