TaskForces/CommunityProjects/LinkingOpenData/DataSets
SWEO Community Project: Linking Open Data on the Semantic Web
Data sets
This page collects RDF data sets that are part of the emerging Web of Linked Data.
Please note: This page is outdated
For keeping the LOD cloud diagram up to date, the Linking Open Data community effort has started to collect meta-information about Linked datasets on CKAN, a registry of open data and content packages provided by the Open Knowledge Foundation.
The meta-information from CKAN (and not from this page) is used to draw the LOD cloud diagram and to maintain statistics about the size of the Web of Linked Data.
The list of Linked Dataset for which we have already collected meta-information on CKAN is found here:
Basic statistics about these datasets are provided at:
A guide on how to describe your dataset on CKAN is found here:
Thus, if you are publishing a Linked Dataset, please add meta-information about your dataset to CKAN.
Historic Version of this Page
To be part of the Linked Data Web, data has to be accessible as RDF over the HTTP protocol through at least one of the access methods listed below. The more methods the better (but avoid aliases). See also tutorial on How to publish Linked Data on the Web.
Note: a related page exists for Tabular collation of RDF data set archives that is more amenable to discovery and loading of these data sets into RDF data stores (Quad or Triple stores).
The page is part of the SWEO Interest Group Community Projects effort. See also:
- Richard Cyganiak's click-able version of the LOD Cloud showing instance relationships among data sets which are linked, directly or indirectly, to DBpedia.
- An alternative clickable illustration of the DBpedia Cloud, presented by OpenLink, shows the interconnections (and gaps) somewhat more clearly.
- The UMBEL project provides a clickable diagram of the UMBEL LOD Constellation, an illustration of class linkages across external shared ontologies which are linked, directly or indirectly, to UMBEL.
- OpenLink's Virtuoso Sponger supports real-time RDF conversion ("RDFization") for this Dynamic Linked Data Cloud, including linkages to the DBpedia Linked Data Cloud, the UMBEL Class Constellation, the Bio2RDF Linked Data Cloud, and other data sets and data dictionaries.
- The Bio2RDF Linked Data Cloud is described below.
- Sindice's dataset by topic map
- The Comprehensive Knowledge Archive Network (CKAN)- a registry of open knowledge 'packages', including plenty of open data.
How big is this Web of Linked Data?
- Statistics on Data sets that are available as Linked Data
- Statistics on links between Data sets that are available as Linked Data
Please note: The statistics on these pages will be updated in the future with data from CKAN. Also, the statistics collected on CKAN are also used to update the LOD cloud diagram on the LOD start page. So if you publish a dataset yourself or if you know detailed statistics for datasets that you use, please add them to CKAN and we will include them in the next revision of the LOD cloud.
Data sets available with dereferenceable URIs - "Linked Data"
Here are some useful starting points for RDF browsing. Please include an example URI as an entry point for crawlers. (Alphabetical order, please.)
- Advogato is exporting its users profiles using FOAF.
- BBC Music Data about Artists, Releases and Reviews. Largely based upon MusicBrainz and the Music Ontology. Developer Documentation.
- BBC Programmes Data about TV and Radio Programmes broadcast on by the BBC. Interlinked with MusicBrainz and DBpedia. Developer Documentation.
- The Bio2RDF project, a Semantic web atlas of post-genomic knowledge about human and mouse, has published 27 biology-, gene- and medical-related data sets (altogether 2.3 billion triples, served up by Virtuoso instances). The data sets are available via Virtuoso's built-in SPARQL endpoints and as Linked Data. Bio2RDF SPARQL endpoint list, PubMed article and PubMed author viewed using the Marbles Linked Data browser. Falcons Search for KILLER CELL.
- CIA Factbook D2R Server publishing the CIA Factbook. Example thing: Botswana
- Craigslist as Linked Data. See for details.
- CrunchBase as Linked Data, with SPARQL endpoint. About 11K people, 10K companies, 1K products, regularly updated, growing. Examples: iPhone, Yahoo!, Sergey Brin.
- DailyMed publishes Linked Data of marketed drugs along with general background on the chemical structure of the compound and its therapeutic purpose, details on the compound's clinical pharmacology, indication and usage, contraindications, warnings, precautions, adverse reactions, overdosage, and patient counseling.
- data.gov official website of the US government making over 1000 US government datasets available as Linked Data (around 6.4 billion triples).
- data.gov.uk official website of the UK government making over 3600 UK government datasets available as Linked Data.
- Data-Gov Wiki provides 5+ billions of triples converted from datasets published at http://data.gov (US). The datasets covers a good number of topics including government budget, environmental statistics, housing and population statistics, medical cost, energy consumption, public library statistics, labor statistics, and etc. Please go to the catalog to see the RDF data. Examples: a dataset's index, and a table entry from the dataset, National Science Foundation
- DBLP Bibliography Server Berlin: Provides bibliographic information about scientific papers. Size of the data set: 800.000 articles and 400.000 authors, approx. 15 million triples. Example thing: Tim Berners-Lee in the bibliography. The server provides the November 2006 version of the DBLP data set. As the Hannover DBLP Bibliography server is updated weekly, you should set RDF links there and not to the Berlin one.
- DBLP Bibliography Server Hannover: Derived from the FUBerlin server, but with more links between the publications (e.g., to conference series) and updated weekly. Unfortunately, no backward compatibility with regard to URIs (URI for persons do not include numbers anymore). Example thing: Tim Berners-Lee
- DBTropes a tvtropes.org wrapper, providing data about 1700+ movies and 2500+ tropes/features.
- DBpedia: Linked Data version of Wikipedia. The DBpedia data set currently provides information about more than 1.95 million “things”, including at least 80,000 persons, 70,000 places, 35,000 music albums, 12,000 films. Provides descriptions in 12 different languages. Altogether, the DBpedia data set consists of (more than) 103 million RDF triples. The data set is interlinked with various other data sources. Example things: Paul McCartney, Berlin, Tetris
- dbpedia lite: A cut down version of DBpedia, which does not attempt to extract data from Infoboxes. Data is loaded live from the Wikipedia API. Unlike Dbpedia, dbpedia lite uses Wikipedia pageids, to help improve identifier stability. Example things: Paul McCartney, Berlin, Tetris
- dbtune provides linked data access for the Jamendo Creative Commons music platform, the Magnatune label, the BBC John Peel sessions, the MySpace data and the AudioScrobbler data. It also hosts a version of MusicBrainz powered by D2R, and interlinked with Lingvoj and DBpedia.
- Diseasome publishes Linked Data of 4,300 disorders and disease genes linked by known disorder-gene associations for exploring all known phenotype and disease gene associations, indicating the common genetic origin of many diseases.
- doapspace.org 43,000 DOAP profiles of Freshmeat projects, 15,000 SourceForge projects, 1,720 Python Package Index projects and hundreds of spidered DOAP.
- DrugBank publishes Linked Data of almost 5000 FDA-approved small molecule and biotech drugs. It contains detailed information about drugs including chemical, pharmacological and pharmaceutical data; along with comprehensive drug target data such as sequence, structure, and pathway information.
- ECS School Southampton Serves data about members, projects and seminars on the Web as Linked Data. Example person: Mischa Tuffield
- ESWC2006 Conference Data Set describes many aspects of ESWC2006, according to the ESWC2006 Conference Ontology describing authors, papers, session and workshops. Mostly available via dereferenceable URIs. The data might need checking over, and it's not a huge number of triples, but is also well complemented by similar data sets from ISWC2006.
- data.semanticweb.org: Metadata for several semantic web related conferences and workshops, including the most recent ISWC, ESWC and WWW events. All data is available via dereferenceable URIs, RDF dumps, and a SPARQL endpoint. The site has been overhauled for ISWC2008. Some older sub-datasets:
- ESWC2007 Conference Data Set describing authors, papers, session and workshops. Available as Linked Data, HTML and via a SPARQL endpoint.
- ESWC2008 Metadata interlinked with DBpedia, Revyu, and the SemWeb community wiki.
- Eurostat Countries and Regions D2R Server publishing statistical information about European countries and regions. Example thing: Leipzig. See also LOD Eurostat page and the alpha release from this project.
- Food and Agriculture Organization of the United Nations (FAO) geopolitical ontology: The FAO geopolitical ontology provides a master reference for geopolitical information. It manages names in multiple languages (English, French, Spanish, Arabic, Chinese, Russian and Italian); maps standard coding systems (UN, ISO, FAOSTAT, AGROVOC, etc); provides relations among territories (land borders, group membership, etc); and tracks historical changes. In addition, Web services and five modules of the geopolitical ontology are available at FAO Country profiles.
- flickr wrappr from Christian Becker pulls photos related to DBpedia resources from flickr and serves them as RDF. Example: Paris
- Freebase, an open-license database for all things in the world, has released a Linked Data interface (See release note). Example instances: Sean Young, Blade Runner.
- FOAF-enabled profiles on several community sites — see table at the FOAF wiki
- Gene Fruitfly Embryogenesis Images Chris Mungall (Berkeley Drosophila Genome Project) serves a database containing annotated images of gene expression in fruitfly embryogenesis.
- Gene Ontology Annotations Chris Mungall (Berkeley Drosophila Genome Project) serves 6 million annotations from Gene Ontology database
- German National Library PND and SWD The German National Library has published its person data (PND dataset describing 1.8 million people) and its subject headings (SWD, 164.000 headings) as Linked Data on the Web.
- Geonames Information about over 6 million places and geographic features. Example thing Berlin
- GeoSpecies Knowledge Base Information on Biological Orders, Families, Species as well as species occurrence records and related data, links to geonames, bio2rdf, dbpedia, freebase, umbel. See About Page
- GovTrack.us from Joshua Tauberer publishes linked data about members of the U.S. Congress, as well as bills, committees and votes. 12M triples. Example resources, announcement
- Hungarian National Library OPAC and Digital Library
- Iflexion Software provider in Austin, TX, supports the Local businesses with RDF issues.
- IS-Group@Freie Univeristät Berlin There is RDF data about the activities and members of the IS-Group at Freie Universität Berlin available. Example thing: DOAP description of D2R Server project
- ISWC and ASWC 2007 Conference Data The data set contains data about tracks, papers, sessions, talks, workshops, tutorials, invited talks, panels, organizers, people, organizations and topics. The data is available as Linked Data, SPARQL endpoint and as RDF dumps.
- Itransition Software group provides Linked Data customization services for a wide range of domains.
- Jamendo Music server exposing Artist, albums, tracks, covers, lyrics, tags, P2P links (BitTorrent, ed2k)
- LastFM wrapper This service provides a live RDF representation of your last 10 tracks submitted to AudioScrobbler/Last.fm
- Lexvo.org provides language-related data for the Semantic Web, e.g. English 'school', Afrikaans language, Chinese character U+5A34
- Library of Congress Subject Headings as SKOS Linked Data (LOC webpage about Linked Data interface)
- Ligado nos Políticos provides data about Brazilian Politicians. Linked to DBPedia, GeoNames, FactBook, Freebase, UMBEL and YAGO. Example: Dilma Rousseff.
- lingvoj.org provides URIs and multilingual labels for hundreds of human languages. Example entries:French language, Chinese language.
- LinkedCT.org - Linked Data Source of Clinical Trials. Contains roughly 25 million triples (as of April 2011), about 106,000 clinical trials, with more than 167,000 links to external sources such as DBpedia, DailyMed, DrugBank, and Bio2RDF.org's PubMed. Refer to live stats page for up-to-date statistics about the entities and external links. Example instances: Breast Cancer, a Trial, Toronto.
- Linked Movie DataBase (LinkedMDB), aims at publishing the first open linked data dedicated to movies, with high quality and quantity of interlinks to other LOD data sources and movie-related websites. Refer to LinkedMDB Home Page for example URIs and to the Interlinking section for examples of the the interlinks and the linkage methodology.
- Linked Brain Data is a linked open domain knowledge base on brain and neuroscience. It's recent release contains more than 3.5 million triples on various domain knowledge about the brain (integrated and extracted from various resources such as PubMed, Neurolex, Allen brain atlas, etc.), containing knowledge on multi-scale brain structures, associations among cognitive functions, brain diseases, and brain building blocks at multiple scales.
- Linked Sensor Data Is the first open datasets for sensors and sensor observations, created at Kno.e.sis Center, and converted from weather data at Mesowest. Contains descriptions of 20 thousand weather stations and 160 million observations.
- Mannheim University Library Linked Data prototype for library catalog data, as well as additional data resulting from library research projects. Experimental, not (yet) open data.
- MindSwap There is RDF data about the activities and members of the MindSwap group at Maryland available.
- MusicBrainz provides lots of data about artists and their albums. Served as Linked Data and via a SPARQL endpoint.
- MySpace wrapper This service provides a live RDF representation of MySpace users. If the user is also an artist, then the corresponding tracks in the streaming audio cache are included in the RDF.
- New York Times Linked Open Data - Beta. The NYT publishes Linked Data for 5,000 people subject headings under a CC BY license.
- News about the Semantic Web provided by the Semantic Web Company.
- Open Archives Demo showing how a OAI-PMH endpoint is exposed as Linked Data with OAI2LOD server.
- OpenCyc Semantic Web version of the Open Cyc ontology. Supports content negotiation on concept URIs. Example things: RetailStore, Dog. Concept Browser
- OpenGuides are a network of wiki-based city guides. Example Open Guide to Milton Keynes Each node has RDF/XML describing the thing the node is about, in addition to wiki versioning information. URIs might need tidying up, and don't currently support 303 redirects.
- Open Election Data a project to help local government open up their election results. See also lessons learned from the project.
- Ordnance Survey data from the UK published as Linked Data.
- overdogg.com Allows users to post needs and wants and expose them to the semantic web, provides matching making with qualified providers. Scrapes Craigslist want ads for FOAF and TIWAN metadata (currently > 100K docs and users). Ads are exposed as linked data (RDF)
- Oxagile custom software development provides custom software and uses Linked Data in its processes.
- PDB2RDF Projekt making the Protein Data Bank available as Linked Data and via a SPARQL endpoint (approximately 14 billion triples).
- Project Gutenberg Catalog Linked data version of and SPARQL endpoint over the Project Gutenberg catalog. Interlinked with DBpedia. Example author: Ed Krol
- Pressemappe: 20th Century Press Archives published by ZBW, Kiel, Germany. Uses OAI-ORE.
- RAMEAU (French National Library book indexing vocabulary, with 150K subjects) as SKOS linked. Connected to LCSH. Example subject: Birds
- RDF Book Mashup: Provides bibliographic information, reviews and sales offers for most books that have a ISBN number. Maps data from Amazon and Google base to RDF. Size of the data set: Unknown, billions of triples. Example thing: "Weaving the Web", the book
- RDFohloh, a service that provides RDF data from ohloh, with more than 135000 instances of sioc:User and more than 13000 instances of doap:Project (information retrieved on June 9, 2008).
- Revyu has reviews and ratings in RDF/XML available via dereferenceable URIs and a SPARQL endpoint. FOAF and Tag information is also available by the same mechanism.
- RKB Explorer Data 25 different domains, each with a separate data set. The data sets are focused on scientific research, and the larger ones include DBLP, Citeseer, CORDIS, NSF, EPSRC, RAE2001 as sources. The data is available as Linked Data, SPARQL endpoint and RDF dumps, and a simple browser is provided. Semantic Web Sitemaps provided.
- Robots.net is exporting its users profiles using FOAF.
- Semantic Web Community Wiki Public Semantic MediaWiki featuring Linked Data views and a SPARQL endpoint.
- SemanticBible is a Linked Data Space for knowledge facts about the Bible. Original SemanticBible RDF (non-Linked Data) packages available from semanticbible.com. This is a project by Daniel Lewis and OpenLink Software
- SemanticWebCentral is a software development site for Open Source Semantic Web tools (think Source
Forge for the Semantic Web). It publishes information about its projects and developers in RDF, using the GForge ontology.
- SIDER publishes Linked Data of almost 1,000 marketed drugs and their adverse effects. The information is extracted from public documents and package inserts.
- SKOS Data Zone
- Southampton Pubs This provides a smaller RDF data set of pubs in Southampton UK. The data is exposed as linked open data, and a typical URI is of the form: http://www.johngoodwin.me.uk/pubs/id/pub1 which is dereferenceable to http://www.johngoodwin.me.uk/pubs/description/pub1 on html requests and http://www.johngoodwin.me.uk/pubs/data/pub1 on RDF+XML requests. There is also an RDF dump of the data here.
- STW Thesaurus for Economics Thesaurus for economics and business economics in English and German, including a classification of subject categories. Maintained by the German National Library of Economics (ZBW). Published as RDFa pages and RDF/XML dataset, licensed under Creative Commons (by-nc-sa)
- TalkDigger is exporting its users profiles using FOAF and the conversations data using SIOC (note: some problems should be resolved between the SIOC Users and the FOAF profiles).
- Telegraphis.net provides URIs for countries, continents, capitals, and currencies.
- Uberblic.org consolidation service, which republishes cleaned data from different sources as Linked Data and via a SPARQL endpoint. list of input data sources.
- UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology for relating Web content and data to a standard set of 20,000 subject concepts. A further 1.5 million named entities have been extracted from Wikipedia and mapped to the UMBEL reference structure with cross-links to YAGO and DBpedia.
- UniProt provides a large life sciences data set with more than a billion (and growing) triples
- US Census RDF version of the 2000 US census data set. Consists of around 1 billion triples. Served as linked data and via a SPARQL endpoint. Example things: USA New Jersey
- US Securities and Exchange commission's EDGAR database available as Linked Data and via SPARQL endpoint.
- Wikicompany is a free, worldwide business directory that anyone can edit. OpenLink Software hosts a Linked Data version of the directory on the Virtuoso Universal Server, extracted by the DBpedia team. Example entries: Northwest Airlines, Apple Computer, OpenLink Software.
- Watchdog.net collects political data, currently US-only. Example: Nancy Pelosi (RDF/XML)
- WordNet is a large lexical database of English. Currently being RDFized by a W3C Best Practices Task Force. Details ... Example thing: the verb "read" in the first sense Linked Data version and SPARQL endpoint at RKBexplorer
- Wordnet 3.0 served by cs.vu.nl
- YAGO ontology available as Linked Data. The ontology should be interlinked with DBpedia shortly.
- Colibrary Web API: Provides bibliographic and social information about books. Social data are tags, users and reviews for most books that have a ISBN number. Data are aggregated from Amazon, Anobii and LibraryThing to RDF. Example thing: "Barney"s Version", the book by Mordecai Richler
- LinkedGeoData.org: LinkedGeoData is an effort to add a spatial dimension to the Web of Data. LinkedGeoData uses the information collected by the OpenStreetMap project, makes it available as RDF and interlinks this data with other knowledge bases of the Linking Open Data initiative.
See also: http://esw.w3.org/topic/AnRdfHarvesterStartingPoint
Data sets available via Dynamic RDFizers
Data sets available via SPARQL Endpoints
See Collection of SPARQL Endpoints
Data sets you can RDFize yourself
If you have some data that needs to be RDFized, and wonder how, see this list of software projects that convert data to RDF
Data sets currently being RDFized
- Craigslist. See overdogg.com and TIWAN. Also planning for Myspace and Facebook want ads. Contact ShermanMonroe for details.
- GEMET. GEMET is the GEneral Multilingual Environmental Thesaurus of the European Environment Agency. Please ask Bernard Vatant for details.
- MusicBrainz. Please ask FrederickGiasson for details.
- US Government Data (at data.gov) A project being undertaken by the RPI Tetherless World research group. Contact Jim Hendler for details.
Data sets that would be nice to have on the Web of Data
Lots. Please feel free to add plenty :)
- Nutrition related data
- Economic data sets
- FreeDB. A database to look up CD information using the internet. Source
- GCIDE_XML. The GNU version of The Collaborative International Dictionary of English (Webster's). Available now as XML. Source
- IMDB Data. Not sure of the licensing terms. Source. Can be converted to MySQL using JMDB. Source
- Related: OMDB, The Open Media Database (Creative Commons)
- Internet Archive. Provides multiple interesting data sets.
- last.fm event data
- Open Library project that builds a open, digital library that is supposed to contain all books that have been published. Simple data model so wrapping it should be easy. See also Frederick's post on the open library and the BIBO ontology
- Open University Course Units. See LabSpace for an idea of what is available, currently in OU-specific XML wrapped in a zip file :(
- ReadWriteWeb: Where to find open data on the Web
- Peter Skomoroch: Some Datasets Available on the Web
- US Census Tiger/Line data on roads, zip code geography, places, etc. See also LOD Eurostat page(there is some overlap with Geonames)
- US government repositories
- US Library of Congress Catalog. Provides information about books and millions of other digital assets.
- WiserEarth.org. Community directory of organizations and individuals addressing sustainability issues.
Papers and Web Resources on serving Data on the Semantic Web
This section would benefit by being re-sorted into a logical read-order ... or by title ... for now, a quick sort by author.
- Alistair Miles et al.: Best Practice Recipes for Publishing RDF Vocabularies
- Christian Bizer, Richard Cyganiak, Tom Heath: How to publish Linked Data on the Web (Tutorial)
- Diego Berrueta, Sergio Fernández: Cooking HTTP content negotiation with Vapour
- ESW Wiki: DereferenceURI
- ESW Wiki: SparqlEndpointDescription
- Finin Ding: Characterizing the Semantic Web on the Web
- Francois Belleau: Bio2RDF: Towards A Mashup To Build Bioinformatics Knowledge System
- Frederick Giasson: Content negotiation: bad use cases I recently observed
- Frederick Giasson: Distribution of semantic web data
- Frederick Giasson: RDF dump vs. dereferenceable URIs
- Henry Story: I have a web 2.0 name ! together with Foaf enabling an enterprise and a discussion of the posts by Richard Cyganiak.
- Richard Cyganiak: Debugging Semantic Web sites with cURL
- Tim Berners-Lee: Linked Data
- Tim Berners-Lee: Browsable Data
No longer available
- Roller Blog Entries: There was a D2R Server running at http://roller.blogdns.net:2020/ which exported blog posts from a Roller Blog Server using the Atom OWL vocabulary. See SPARQLing Roller for details. The D2RQ mapping file should still be useful.