Tim Berners-Lee
Date: 2010-05-26, last change: $Date: 2023/06/05 17:35:26 $
Status: personal view only. Editing status: first draft. This was a talk I gave about Open Data at GovCon in 2010. When promoting to be a DesignIssue post, I looked for images on the web of bags of chips, and found recent ones which pretty much match the original. Added the abstract in 2023.

Up to Design Issues


Linked Data is Like a Bag of Chips

The value of data is the insight which comes when different bits of data are joined together. For that process to provide value, the world must contain all sorts of kinds of information of different types, and it must be linked together. Linked data involves using ontologies. But if you are a developer, how do you pick those ontologies? The art is to use several different ontologies in the same document, the same message. In a typical application, part of which you need to express will be in a very common idea, (like, say a title of a document) while part of the information will be concepts shared with particular groups, domains, like, say blood pressure. And some will be obscure data (like, say, blood pressure monitor calibration data) which is only understood by device engineers. Putting all this information together in a mixture of ontologies is the best thing to do. Some you will find, some you may work with others toward consensus, some you might use that day in that project. Using each of those ontologies gets you the most total interoperability. A bag of chips has all kinds of information of different types, and each user (the customer, the checkout scanner, celiac, the nutritionist) uses different bits and ignores the rest. With its mixture of ontologies and its rule of ignoring data you don't need, or you don't understand. the world of Linked Data is quite like a bag of chips.


Gov 2.0 Expo 2010: Tim Berners-Lee, "Open, Linked Data for a Global Community" May 26, 2010

Transcript - Bag of Chips talk

So I have a limited time so I'm just going to answer one frequently asked question about Linked Data itself. You know I've been going about saying that we should give people stars when they put government open data up.

Different levels of open data

We should give one huge star for putting anything up -- just for getting that data up there at all. We should because so there are so many social issues you have to get around to actually do it. Issues with yourself and with, maybe, with the bureaucracy around you before you can get the data out there.

But once you have your one-star data out there, there's an extra star if it's in machine-readable format. It's not a scan of a fax but it's a data table - it's actually something like a spreadsheet.

You get an extra star if it's not encoded in a proprietary format but it's an open standard: You get three stars for putting it in an open format. I'll include a CSV files -- I'll give you three stars for CSV file -- it's good, Comma Separated Values.

But for me to get the full value out of your data, you have to put it in Linked Data format. That means that each of the things you're talking about are going to get a URL -- one of those things starting with "http:".

Each thing, and also each of the properties of it too, the property gets a URI. Like the population of it and its area if it's the city for example each will get a URL.

That's putting in the Linked Data format. That means people can point to it and say that city,  actually when they say "Philadelphia" they mean the same thing as all these people mean when they say "Philadelphia". So this data about Philadelphia that allows people to link to it, which is very powerful. So if you do that, that gets you 4 stars.

You get the fifth star by actually doing the work of linking your data to other data.

So this means looking at that data and realizing that you've got "population" and putting a note in your data to say "when I say population that it means the same thing as when these people say population". And that linking the objects, properly,  that's the way Linked Data works and when it's linked up then you can keep all kinds of different things and it's very exciting

Ignore data to don't understand or you don't need

Now one of the frequently asked questions I get then is: "Ok, but surely it's just another syntax right? I've put my data up  but it still doesn't mean that you can understand it."

The whole point is the way Linked Data, unlike any other technology to date -- that's probably true yeah -- that your data can use terms that I don't understand. And that's OK.

Stepping back from data for a moment, what's interesting is when anybody communicates any information at all, typically they using a mixture of different vocabularies using lots and lots of different languages, and this is the key to how the world works, and how this Linked Data stuff works.

I was explaining this to somebody, as it came up over lunch except actually on the lunch table we happened to have this packet of chips.

The Front of the bag of chips

When you look at this packet of chips, at the label on this packet of chips, in fact what language is it in?

Well, it says "Kettle Classic Crunchy Potato Chips". That's clearly American English: if it was English English it would say potato crisps if so.

The Back of the bag of chips

But it's not just in English though. When you turn over the back of it you'll see that on the other side there's these nutrition facts. Now those nutrition facts are in a language which is understood by pretty much all Americans. Obviously the Food and Drug Administration has got a lot to do with it, as has the the US food industry -- so this actually won't allow you to sell this thing in Europe because still the language is different in Europe.

But all those terms such as the serving size and the number of calories per serving and number of servings per packet those are all things that everybody in the US looks for. When someone is looking for the number of calories, they are helped by this very standard packaging. That is understood easily in the USA because that's a little sub-language here understood by the US food food industry.

Meanwhile, what about this thing down here, the UPC barcode? That UPC barcode is a language which is understood by all retailers not just food industry globally. It's much a smaller, simple piece of information. It's just one number but it's a number which you can scan in any till pretty much anywhere in the world and the till will understand that language.

Now actually there's other things on the bag, like there's allergy information maybe on there which a lot of people in the audience here won't even understand. Sometimes I don't  understand that allergy information. But sometimes some people really have to check that their allergy isn't on that list --- so people with serious allergies lean that code.

So here is a bag of chips and the bag for it has been printed and it contains all that information in these different languages! Look, at the bottom -- there's actually a small number which says 110#71400 and I don't understand what that means at all. It's probably something put on by the printer as a batch number. It's not about the chips at all, it's about the bag itself.

When I am wondering whether to take this bag of chips, do I pick it up and read the label and then say "I don't know what "110#71400" means? No.

If I was an XML processor I might because you know I might think that that might violate the schema, if it was an XML schema.

If I was using object-oriented technology to do this it might say "don't read this, it isn't a valid object".

But in real life, with a real bag of chips, I just ignore what I don't understand and what I don't need.

With HTML tags, though, and with Linked Data, too, if my data model doesn't have a slot for that data, like the bag of chips, I can ignore all the stuff I don't understand.

I see that it says 'Crunchy potato chips",  I give a quick glance at the number calories and I open and eat it.

One message, many languages

So Linked Data works like that. When you design a project using Linked Data when you are sending data from one place to another and every message that's sent in this application, or for example every data file which you expose on the web by writing a little PHP script or little Python script, pulls data out of your system, and every line may be taken out of a different vocabulary.

When you design the system, instead of writing a great big monolithic schema, you look at all of all the terms out there in systems people have already designed. The beauty of it is with the Linked Data you pick -- you cherry pick -- different sets of terms, different vocabularies, that are out there.

Suppose you're, you know, recording data about security incidents of people entering buildings. You'll find one vocabulary which has already got start-date and end-date, but doesn't have terms about people.

But there's a sub-language made by other departments already, another allows you to describe people, and in another there are terms about buildings. You may have to invent a term for the concept of a Security Incident, but the dates and people and buildings will all be described by different existing languages. The Linked Data technology allows you to just use those. And your data will be interoperable with those systems.

A very large amount of the terms you're going to use in your new application will use ontologies that other people have already developed. These sets of terms -- vocabularies -- we call them ontologies -- that people have already developed, are your friends.

Pick your battles

Some of these languages, like the little number at the bottom of the bag of chips you are printing, you'll throw out there and you won't have to discuss with anybody. You'll invent your own one and that's perfectly fine to do that. You can invent your own one for data you don't really need to share with anyone, where you don't really need interoperability with anyone.

The way this works may seem a technical thing if you like but it has a huge social impact. It means that you can actually get work done in a real government environment, when you're sitting there in some group, well buried within some huge bureaucracy, you can just create a data file which will be useful to all kinds of people to different extents.

You don't have to make a government standard for it. You make up some of the terms you are using yourself. Other terms you use are designed by people you are nearby. Other terms you use are national standards, some of them are international standards. Some are developed by special interest groups which are global but very very narrow in domain of interest.

That's how these vocabularies get developed: it's when a small number of people get together and define and work out these terms. It is hard work!

And then you just cherry-pick them and you use them and you can go and do that without starting a very big committee and go out without getting lots and lots of people together to define the a new application and the one you define locally.  That's ok.

Publish now, connect later

You can even define terms which you think other people ought to have defined. Suppose I wanted to actually put this data on the web in RDF, I go to the FDA website and I can't find any actual machine-readable lists of what these terms are, like calories and so on. So I might just look at look at the data, type it up -- type a little file there in Linked Data as the Linked Data ontology, stick it on the web, and write the FDA and say "well I've made a set of terms for the nutrition facts. It would be be really good if you did, but for the moment I'm gonna use my own".  The great thing about Linked Data is if later they do define their own terms and they put one up at fda.gov, and I've still got mine in my web space, I can actually say afterwards, retrospectively, I could publish, in metadata, my Calories Per Serving. It means exactly the same thing as their Calories Per Serving. Then all my existing data will work when people query using their terms because that's how the system works. The systems can pick all that metadata up and use it.

So it's not a top-down system.

Linked Data doesn't need everybody to agree on all the terms. Most of the terms and most of the data out there most people don't understand. But there's a certain number of terms like Start Time and End Time and probably Temperature and things which pretty much most systems understand. Latitude and longitude are great because, given those, most systems will be able to put things on to map. We've already got URIs for those.

All these terms are defined by URIs so that you can look them up and see who owns them.

You can find them in search engines. When you doing your cherry picking you can search search engines like like Swoogle pieces like Sindice [Now, try LOV] semantic indexes out there are various search engines where you could go.

I'm looking for a term for Start Time, show me all the things to do with "start time" and you'll find a lot of different ontologies. You can pick the ones which come from a source you respect.

So there, that is the most critical thing to understand about why doing things for the Lined Data is fundamentally different. It's why doing stuff with Linked Data is something that is ground up

It is little pieces loosely joined.

It's not a top-down thing.

So now you'll know if, somebody comes after you since you're using Linked Data, "Oh is that that Semantic Web thing? Surely you know you'll never get everybody in the whole world to agree about a great big taxonomy of everything! That'll never work!" you can say "Actually, no, it's not like that. It's like a bag of chips!"

Thanks


References

Up to Design Issues

Tim BL