Vocabulary Search on the Semantic Web for RDFa Default Profiles

$Date: 2013/03/01 15:54:47 $

The content of the vocabulary prefixes, to be included in the RDFa 1.1 Default Profile, is defined based on the general usage of those vocabularies on the Semantic Web. This general usage is established using search crawl data, courtesy of Sindice and of Yahoo!. This page describes the methodology used during crawls as well as the possible post-processing steps.

How Was the Data Collected?

The methodology used for both the Sindice and the Yahoo! cases were essentially the same, namely:

A crawl of the respective search engine produced a set of URI-s from the Semantic Web.
- In the case of Sindice the crawl was on the Semantic Web, yielding around 10B triplets.
- In the case of Yahoo!, the original generic crawl size was around 12B pages, with 431M documents using RDF (excluding trivial RDFa markup, i.e., pages containing triples in the xhtml namespace only). The measurement results are based on the RDF extracted from these RDFa pages.
The result of the crawl was subject to a number of processing steps, namely:
1. Using some simple heuristics and, in some cases, explicit processing rules the vocabulary URI-s were established.
2. A number of vocabularies were eliminated as unsuitable for an RDFa profile.
3. Some common mistakes in the datasets were handled, too. For example, a missing '#' or a '/' at the end of a property yields, formally, a different property URI but, in many cases, it was fairly clear that those were mistakes and could be merged with the intended URI. Another, somewhat more controversial, case is when a known vocabulary has changed its official URI at some point (e.g., Facebook’s Open Graph Protocol), and all data are merged into the current, official URI.
The resulting set of vocabularies are ordered using the effective second level domains for each entry. The Public Suffix List, maintained by the community at large, was used both in the Sindice and the Yahoo! cases to identify the highest domain (i.e., second-level domain) that is directly below a top-level domain. Using this metric rather than, for example, the number of domains or graphs avoided artefacts of a few very important sites publishing a large number of triples or graphs with local vocabularies that would not be appropriate for a generic RDFa profile.

The most complex and possibly controversial step is 2.2 above. Here are the categories of vocabularies that were removed from the result set:

Vocabularies defined through a W3C Recommendation or Working/Interest Group Note (those are part of a default profile “ex officio”)
Vocabularies whose URI-s are not referencable publicly, or that do not refer to a proper documentation (at the minimum a commented RDF file)
Vocabularies marked as “draft”, “experimental”
Vocabularies used for a very specific and specialized purpose (e.g., major ontologies used in medical or drug discovery applications). Note that this is not a judgement on the quality or the usefulness of that vocabulary, simply a reflection of the fact that the vocabulary should not be part of an RDFa profile of general use.

The rules used for the Sindice and the Yahoo cases, respectively, are available for download. The final results of the two crawls and subsequent processing are also available; see the Sindice and the Yahoo! pages for further details.

Merging the results and establishing the final profile content

Both crawl results have a relatively natural cut-off point for the vocabularies that should or should not be considered for a default profile, taking also into account that the number of default prefixes should not be very high (in the range of 10, considering the fact that the list might grow as time goes by). For Yahoo! the cut value of 10 seems to be a natural choice. It is slightly less clear for the Sindice case, though; at present, the vaue of 12 has been used.

However, the two data sets should be considered together; an entry from one dataset that scores very low on the other should not be added. Based on this, the following algorithm is used:

The S and Y cut-off points are established (as said above, this is 12 and 10, respectively)
The two data sets are considered in parallel from the top; each entry is considered and checked whether it appears in twice the cut-off value of the other. I.e., a top entry in the Sindice list (meaning that its index is under S) should be present in the Yahoo list with an index of maximally 2*Y; similarly for the Yahoo entries. If such entry is found, it is added to the final list.

This means that the number of final entries is under max(S,Y). (The python script executing the merge is also available.) The current results are:

	Vocabulary URI	Effective Second Level Domains in the Yahoo! dataset	Effective Second Level Domains in the Sindice dataset
1.	http://purl.org/dc/terms/	344545	32848
2.	http://ogp.me/ns#	177761	18954
3.	http://creativecommons.org/ns#	37890	743
4.	http://xmlns.com/foaf/0.1/	2545	3630
5.	http://rdf.data-vocabulary.org/#	6083	845
6.	http://rdfs.org/sioc/ns#	1633	1305
7.	http://www.w3.org/2006/vcard/ns#	1349	559
8.	http://purl.org/goodrelations/v1#	488	390
9.	http://purl.org/stuff/rev#	369	73
10.	http://commontag.org/ns#	272	168
11.	http://www.w3.org/2002/12/cal/icaltzd#	62	50

This list has been included in the RDFa 1.1 Default Profile (also available in Turtle and RDF/XML). In most of the cases the prefixes are well known and widely used (e.g., foaf); in other cases the prefix.cc service was used to establish the default prefix (e.g., ctag for the http://commontag.org/ns# vocabulary.)

Ivan Herman, ivan@w3.org, W3C, Semantic Web Activity Lead, thanks to Giovanni Tummarello, Robert Fuller, Diego Ceccarelli, and Renaud Delbru, Sindice, and Péter Mika, Yahoo!
$Date: 2013/03/01 15:54:47 $