Vocabulary Search on the Semantic Web for RDFa Default Profiles

$Date: 2013-03-01 15:54:47 $

The content of the vocabulary prefixes, to be included in the RDFa 1.1 Default Profile, is defined based on the general usage of those vocabularies on the Semantic Web. This general usage is established using search crawl data, courtesy of Sindice and of Yahoo!. This page describes the methodology used during crawls as well as the possible post-processing steps.

How Was the Data Collected?

The methodology used for both the Sindice and the Yahoo! cases were essentially the same, namely:

  1. A crawl of the respective search engine produced a set of URI-s from the Semantic Web.
  2. The result of the crawl was subject to a number of processing steps, namely:
    1. Using some simple heuristics and, in some cases, explicit processing rules the vocabulary URI-s were established.
    2. A number of vocabularies were eliminated as unsuitable for an RDFa profile.
    3. Some common mistakes in the datasets were handled, too. For example, a missing '#' or a '/' at the end of a property yields, formally, a different property URI but, in many cases, it was fairly clear that those were mistakes and could be merged with the intended URI. Another, somewhat more controversial, case is when a known vocabulary has changed its official URI at some point (e.g., Facebook’s Open Graph Protocol), and all data are merged into the current, official URI.
  3. The resulting set of vocabularies are ordered using the effective second level domains for each entry. The Public Suffix List, maintained by the community at large, was used both in the Sindice and the Yahoo! cases to identify the highest domain (i.e., second-level domain) that is directly below a top-level domain. Using this metric rather than, for example, the number of domains or graphs avoided artefacts of a few very important sites publishing a large number of triples or graphs with local vocabularies that would not be appropriate for a generic RDFa profile.

The most complex and possibly controversial step is 2.2 above. Here are the categories of vocabularies that were removed from the result set:

The rules used for the Sindice and the Yahoo cases, respectively, are available for download. The final results of the two crawls and subsequent processing are also available; see the Sindice and the Yahoo! pages for further details.

Merging the results and establishing the final profile content

Both crawl results have a relatively natural cut-off point for the vocabularies that should or should not be considered for a default profile, taking also into account that the number of default prefixes should not be very high (in the range of 10, considering the fact that the list might grow as time goes by). For Yahoo! the cut value of 10 seems to be a natural choice. It is slightly less clear for the Sindice case, though; at present, the vaue of 12 has been used.

However, the two data sets should be considered together; an entry from one dataset that scores very low on the other should not be added. Based on this, the following algorithm is used:

  1. The S and Y cut-off points are established (as said above, this is 12 and 10, respectively)
  2. The two data sets are considered in parallel from the top; each entry is considered and checked whether it appears in twice the cut-off value of the other. I.e., a top entry in the Sindice list (meaning that its index is under S) should be present in the Yahoo list with an index of maximally 2*Y; similarly for the Yahoo entries. If such entry is found, it is added to the final list.

This means that the number of final entries is under max(S,Y). (The python script executing the merge is also available.) The current results are:

Vocabulary URI Effective Second Level Domains in the Yahoo! dataset Effective Second Level Domains in the Sindice dataset
1. http://purl.org/dc/terms/ 344545 32848
2. http://ogp.me/ns# 177761 18954
3. http://creativecommons.org/ns# 37890 743
4. http://xmlns.com/foaf/0.1/ 2545 3630
5. http://rdf.data-vocabulary.org/# 6083 845
6. http://rdfs.org/sioc/ns# 1633 1305
7. http://www.w3.org/2006/vcard/ns# 1349 559
8. http://purl.org/goodrelations/v1# 488 390
9. http://purl.org/stuff/rev# 369 73
10. http://commontag.org/ns# 272 168
11. http://www.w3.org/2002/12/cal/icaltzd# 62 50

This list has been included in the RDFa 1.1 Default Profile (also available in Turtle and RDF/XML). In most of the cases the prefixes are well known and widely used (e.g., foaf); in other cases the prefix.cc service was used to establish the default prefix (e.g., ctag for the http://commontag.org/ns# vocabulary.)

Ivan Herman, ivan@w3.org, W3C, Semantic Web Activity Lead, thanks to Giovanni Tummarello, Robert Fuller, Diego Ceccarelli, and Renaud Delbru, Sindice, and Péter Mika, Yahoo!
$Date: 2013-03-01 15:54:47 $