Vocabulary Search on the Semantic Web for RDFa Default Profiles
$Date: 2013/03/01 15:54:47 $
The content of the vocabulary prefixes, to be included in the RDFa
1.1 Default Profile, is defined based on the general usage of those
vocabularies on the Semantic Web. This general usage is established using search
crawl data, courtesy of Sindice and of Yahoo!.
This page describes the methodology used during crawls as well as the possible
post-processing steps.
How Was the Data Collected?
The methodology used for both the Sindice and the Yahoo! cases were essentially the
same, namely:
- A crawl of the respective search engine produced a set of URI-s from the
Semantic Web.
- In the case of Sindice the crawl was on the Semantic Web, yielding around
10B triplets.
- In the case of Yahoo!, the original generic crawl size was around 12B pages,
with 431M documents using RDF (excluding trivial RDFa markup, i.e., pages
containing triples in the xhtml namespace only). The measurement results are
based on the RDF extracted from these RDFa pages.
- The result of the crawl was subject to a number of processing steps, namely:
- Using some simple heuristics and, in some cases, explicit processing rules
the vocabulary URI-s were established.
- A number of vocabularies were eliminated as unsuitable for an RDFa profile.
- Some common mistakes in the datasets were handled, too. For example, a
missing '#' or a '/' at the end of a property yields, formally, a different
property URI but, in many cases, it was fairly clear that those were mistakes
and could be merged with the intended URI. Another, somewhat more
controversial, case is when a known vocabulary has changed its official URI at
some point (e.g., Facebook’s Open Graph Protocol),
and all data are merged into the current, official URI.
- The resulting set of vocabularies are ordered using the effective second
level domains for each entry. The Public
Suffix List, maintained by the community at large, was used both in the
Sindice and the Yahoo! cases to identify the highest domain (i.e., second-level
domain) that is directly below a top-level domain. Using this metric rather than,
for example, the number of domains or graphs avoided artefacts of a few very
important sites publishing a large number of triples or graphs with local
vocabularies that would not be appropriate for a generic RDFa profile.
The most complex and possibly controversial step is 2.2 above. Here are the
categories of vocabularies that were removed from the result set:
- Vocabularies defined through a W3C Recommendation or Working/Interest Group Note
(those are part of a default profile “ex officio”)
- Vocabularies whose URI-s are not referencable publicly, or that do not refer to
a proper documentation (at the minimum a commented RDF file)
- Vocabularies marked as “draft”, “experimental”
- Vocabularies used for a very specific and specialized purpose (e.g., major
ontologies used in medical or drug discovery applications). Note that this is not
a judgement on the quality or the usefulness of that vocabulary, simply a
reflection of the fact that the vocabulary should not be part of an RDFa profile
of general use.
The rules used for the Sindice and the Yahoo cases, respectively, are available for
download. The final results of the two crawls and subsequent processing are also
available; see the Sindice and the Yahoo!
pages for further details.
Merging the results and establishing the final profile content
Both crawl results have a relatively natural cut-off point for the vocabularies
that should or should not be considered for a default profile, taking also into
account that the number of default prefixes should not be very high (in the range of
10, considering the fact that the list might grow as time goes by). For Yahoo! the
cut value of 10 seems to be a natural choice. It is slightly less clear for the
Sindice case, though; at present, the vaue of 12 has been used.
However, the two data sets should be considered together; an entry from one dataset
that scores very low on the other should not be added. Based on this, the following
algorithm is used:
- The S and Y cut-off points are established (as said above, this is 12 and 10,
respectively)
- The two data sets are considered in parallel from the top; each entry is
considered and checked whether it appears in twice the cut-off value of the other.
I.e., a top entry in the Sindice list (meaning that its index is under S) should
be present in the Yahoo list with an index of maximally 2*Y; similarly for the
Yahoo entries. If such entry is found, it is added to the final list.
This means that the number of final entries is under max(S,Y). (The python script
executing the merge is also available.) The current results
are:
This list has been included in the RDFa 1.1 Default
Profile (also available in Turtle and RDF/XML). In most of the cases the prefixes are
well known and widely used (e.g., foaf); in other cases the prefix.cc
service was used to establish the default prefix (e.g., ctag for the http://commontag.org/ns# vocabulary.)
Ivan
Herman, ivan@w3.org,
W3C, Semantic Web
Activity Lead, thanks to Giovanni Tummarello, Robert
Fuller, Diego Ceccarelli, and Renaud Delbru, Sindice, and Péter Mika, Yahoo!
$Date: 2013/03/01 15:54:47 $