WARNING: this is still a draft, and some details are still in discussion.

Vocabulary Search on the Semantic Web for RDFa Default Profiles

The table below contains the top 100 of the complete search and processing for Default Profile Vocabularies as performed by Sindice. See the the separate section below that explains the methodology leading to this table.

The content of each columns are:

  1. (Partial) URI of the vocabulary
  2. Number of Triples that use the vocabulary
  3. Number of Graphs that use the vocabulary
  4. Number of domains that use the vocabulary
  5. Number of 2nd level domains that use the vocabulary (i.e., http://a.b.c and http://q.b.c are considered as identical)

The domains’ data is important for the purpose of a default profile: if a vocabulary is used by a large number of triples but all originating at one or few domains only, that indicates that the vocabulary is used only in a few, albeit possibly very large datasets. Although these datasets may be very important resources, nevertheless, this would not warrant adding the vocabulary to be part of a generic default profile. To be precise about the content of the table, it contains the top 100 entries through a dicreasing order of the 2nd level domain value. The full search results is also available in CSV format (beware, the data set is over 140M, containing more than 1,800,000 entries).

By default, the table is sorted using its last column, in decreasing order. Clicking on the column header reorders the table using that column, first in increasing and, clicking again, by decreasing order.

Vocabulary URITriplesGraphsDomain2nd Level Domains
purl.org/dc/terms/3.81844672E81133080893395120232691
w3.org/2006/vcard/ns#2.15687398E952579593585781166510
xmlns.com/foaf/0.1/2.73884851E941645800293945183707
ogp.me/ns#1.62337136E8241622382757919424
purl.org/dc/elements/1.1/1.72281808E813324030237088318223
w3.org/2003/01/geo/wgs84_pos#4.6242568E7102158461056687989
creativecommons.org/ns#7827782.0165117882045450
w3.org/2002/12/cal/icaltzd#3.6544712E7215774860853895
purl.org/stuff/rev#1.8225048E7237286146642451
w3.org/2000/10/swap/pim/contact#2723938.068249288781350
rdfs.org/sioc/ns#1.7927096E764974426761313
rdf.data-vocabulary.org/#1.30246464E81005403117281030
purl.org/vocab/bio/0.1/788445.023811635921510
purl.org/goodrelations/v1#1.57271296E84953695403374
wellformedweb.org/CommentAPI/44282.01494226223
ramonantonio.net/doac/0.1/#7257748.0911699308190
commontag.org/ns#5257518.065337269180
rdfs.org/sioc/types#MicroblogPost1144228.068690373171
usefulinc.com/ns/doap#67050.07861163116
xmlns.com/wot/0.1/1480.0228124110
developers.facebook.com/schema/admins822554.0317936150109
developers.facebook.com/schema/app_id2800303.0918280135106
purl.org/vocab/relationship/74791.030808780
purl.org/stuff/rev#Review54401.083327871
purl.org/stuff/rev#hasReview27019.022766159
purl.org/ontology/bibo/700982.02376037156
xmlns.com/wordnet/1.6/Airport159.0846054
purl.org/stuff/rev#text38572.069525754
rdfs.org/sioc/types#Comment253000.0382166652
purl.org/dc/dcmitype/290338.0437306152
purl.org/stuff/rev#rating54185.091265651
rdfs.org/sioc/types#BlogPost24667.051746448
w3.org/ns/auth/rsa#1386.02235043
data.semanticweb.org/ns/swc/ontology#77595.041244942
geonames.org/ontology#7.7840128E7111348025341
w3.org/ns/auth/cert#972.02104640
developers.facebook.com/schema/page_id33067.0120724536
w3.org/2002/12/cal/ical#29016.018793532
open.vocab.org/terms/121108.0373163432
purl.org/stuff/rev#reviewer19146.010433230
http://dbpedia.org/ontology/1.3169637E711196933530
rdfs.org/sioc/types#Microblog575.03233029
purl.org/ontology/wo/438756.0189074229
smob.me/ns#Hub498.03352828
http://dbpedia.org/property/9.3231264E7117823463328
online-presence.net/opo/ns#1665.03062727
purl.org/ontology/mo/1214051.02347702826
purl.org/net/vocab/2004/07/visit#757.0382020
moat-project.org/ns#6395.05952220
purl.org/vocab/vann/3859.02931818
purl.org/stuff/rev#title1435.01702018
w3.org/2002/12/cal#1205.0781616
skype.com/49.0281513
purl.org/vocab/frbr/core#806407.0508861413
purl.org/net/provenance/ns#1.3858524E77609031313
openarchives.org/ore/terms/7141080.06432651513
_:node0126.01081313
ebusiness-unibw.org/ontologies/eclass/5.1.4/#C_AKJ315005-tax27.0181413
abmeta.org/ns#tags2494.04321313
trust.mindswap.org/ont/trust.owl#605.0181312
rdfs.org/sioc/types#BoardPost2536.09751312
purl.org/stuff/rev#type173.0201212
purl.org/NET/scovo#3426.03335612
purl.org/NET/c4dm/event.owl10470.08671212
purl.org/dc/dcam27946.070731312
holygoat.co.uk/owl/redwood/0.1/tags/taggedWithTag9408.07051212
abmeta.org/ns#Book335.01741212
w3.org/2004/03/trix/rdfg-1/Graph3217495.07612311111
abmeta.org/ns#link78.0501111
abmeta.org/ns#isbn127.0801111
abmeta.org/ns#description1100.05731111
xs:string1514.0711010
w3.org/2006/time#603.0461210
holygoat.co.uk/owl/redwood/0.1/tags/taggedBy4794.020271010
holygoat.co.uk/owl/redwood/0.1/tags/Tag27806.060721010
purl.org/net/pingback/20.0999
holygoat.co.uk/owl/redwood/0.1/tags/taggedResource1336.0191109
abmeta.org/ns#year111.06499
xri://$xrd*($v*2.0)Service27.01888
umbel.org/umbel#44058.012909118
redfoot.net/2005/session#hexdigest16.01488
rdfs.org/sioc/types#Weblog3303.01200108
mozilla.org/xblbindings9.0988
xri://$xrdsXRDS34.01777
xmlns.com/wordnet/1.6/Project16.0877
xmlns.com/wordnet/1.6/Person4908.034987
sw.deri.org/2005/08/conf/cfp#258.032107
purl.org/net/provenance/types#QueryResult2261588.089289377
holygoat.co.uk/owl/redwood/0.1/tags/name2474.045777
ebusiness-unibw.org/ontologies/eclass/5.1.4/#C_AKJ317003-tax12.0777
xs:boolean48.01966
xmlns.com/wordnet/1.6/Document97.06666
s.opencalais.com/1/type/lid/DefaultLangId127.02566
skype.com14.01176
rdfs.org/sioc/types#WikiArticle30965.021266
rdfs.org/sioc/types#Wiki1642.0445106
rdfs.org/sioc/types#MessageBoard226.015286
purl.org/vocab/psychometric-profile/46.01176
purl.org/net/schemas/quaffing/drankBeerWith55.0776
purl.org/net/provenance/types#DataCreatingService3180869.082134866

How Was the Data Collected?

The fundamental approach is to search the Semantic Web for vocabulary usage and process the results in order to derive possible vocabularies that are suitable for an RDFa Default Profile. The detailed steps are as follows.

  1. A crawl of the Sindice search engine produced a list of properties and classes, based on a sample of around 10B triples retrieved on the Semantic Web.
  2. The result of the crawl was subject to a number of processing steps, namely:
    1. Using some simple heuristics (and, in many cases, manual rules to overcome errors in the dataset) the property and class URI-s were used to establish the vocabulary URI-s. (Although, in cases where the number of domains using a specific vocabulary was very low, i.e., 1 or 2, the manual rule set did not have the corresponding rule because the end result would not be used for a default vocabulary anyway.)
    2. A number of vocabularies were eliminated as unsuitable for an RDFa profile.
    3. Some common mistakes in the datasets had to be handled, too. For example, a missing '#' or a '/' at the end of a property yields, formally, a different property URI but, in many cases, it was fairly clear that those were mistakes and could be merged with the intended URI. Another, somewhat more controversial, case is when a known vocabulary has changed its official URI at some point (e.g., Facebook’s Open Graph Protocol), and all data are merged into the current, official URI.
  3. The resulting set of vocabularies were categorized along different axes: number of triples they appeared in, number of graphs they appeared in, and the number of different domains they appeared (both top level and second level domains).

The most complex and possibly controversial step is 2.2 above. Here are the categories of vocabularies that were removed from the result set: