Semantic Web and microformats Bert Bos (W3C) PrivacyOS Wien 2009 Vienna, Austria October 27, 2009 I'm “screen-scraping” (i.e., extracting, converting and combining) all kinds of information from various Web pages myself and am actively trying through my work to make that process easier and widely used. It goes much beyond“mash-ups,” I'm using all the data, not just the APIs that the data providers made. That makes you aware of what computers can do and you get sensitive to information usages even outside the Web… On the way to the conference I saw at a train station a special promotion by the ÖBB, the Austrian railways, for people buying their tickets with their mobile phones. If you buy tickets by SMS instead of cash, you get 10% reduction. This a temporary promotion to celebrate the 10 years of this possibility, but could it be that privacy in Austria costs 10%? :-) This is not an attempt to define fully what W3C means by the “semantic Web.” It has to do with increasing the amount of available information by increasing the number of contributing people and by increasing the use people can make of that information by enabling re-use and automatic processing. The “semantic Web” is not a specific kind of Web, it is a direction for development, just like “east” is a direction and not a specific place. It is the goal behind all W3C's technologies, even if the path to that goal isn't always direct, as a result of (temporary) technical limitations, conflicting interests, and maintaining backwards compatibility. The term “semantic Web” to summarize the goal of the Web was invented by Tim Berners-Lee in the late eighties. It is slowly getting better known, but another term is also being used at the moment which means more or less the same thing: linked data. Two examples this year of what “linked data” can do are the site with collected data of the US government, data.gov, and a similar site by the UK government, data.gov.uk. With large amounts of data in common, open formats, they invite re-use and novel applications. To be able to compare data from various sources, you need a common base to express the data in. RDF is W3C's answer to that. It is a (mathematical) model of which the principle is that all data is normalized to (large numbers of)“triples”: The property P of object S has value V. E.g., the price (P) of this car (S) is € 20,000 (V). The model is complemented by sets of properties, each set describing a particular domain. Such a set is called an ontology. You don't always need to actually convert data to RDF. The theoretical possibility is enough. You can often do the actual computations in a higher-level format. Low-level formats are good for theory and (mathematical) proofs, but not so much for humans. Even programmers make fewer mistakes if they work with data that they can read easily themselves. That's why we don't convert the Web to RDF. We keep using HTML, SVG, XML, etc., and even build new formats on top of those. They are at a more human level. Microformats uses a feature (categorizing elements with the rel and class attributes) that was built into HTML expressly for that purpose, although the name “microformat” and the idea that microformats themselves can be standardized was only invented much later. Tantek Çelik inspired that idea, and maybe also invented the name. The guiding principle is that machine-readable data is likely to contain errors and out of date, unless it is also human-readable. The best is data that is made primarily for human consumption and nevertheless machine-readable. Thus microformats are just some simple, agreed conventions for how to use HTML, because HTML is what people use anyway. One of the microformats is called hCard, it is the Internet standard vCard expressed in HTML (and automatically convertible to vCard). vCard is an electronic business card. It encodes name, address, phone number, etc. in a machine-readable text file that can be attached to e-mail. hCard expresses the same, but so that it can be included in a Web page. It works by annotating elements in an HTML document with keywords from the vCard vocabulary. The structure of the HTML file can nearly always remain unchanged, and thus optimized for humans. The annotation makes the relevant data available for automatic processing. In the example above, the upper half is the original, without the hcard mark-up, the lower half is the same, but with the hcard annotations added. Microformats are designed to simple, easy to use in many scenarios, from people editing HTML by hand to HTML being generated from content management systems, wikis, blogs, etc. They stress simplicity over generality. You can express very common things in them, but they are limited. More detailed or more rare kinds of information need other formats to make them machine-readable. Microformats are not currently a standard. Maybe they will become standard one day, the community around them is interested in standardizing them eventually, maybe through W3C. Two other methods of building on HTML for making information both human- and machine-readable are already W3C standards. They are much more complex, but they allow anything that can be expressed in RDF to be expressed in HTML as well. (Whether that is the best way to express data in all cases is of course a different question.) GRDDL works by attaching a transformation (usually in XSLT) to an HTML file. The HTML can be structured in any way the author wants, the transformation takes care of creating RDF from it. Unless you can use a pre-existing GRDDL transformation, it means you need a programmer to write that transformation for you, so the technology is not accessible to everybody. RDFa is an alternative syntax for RDF that allows to include the RDF directly in XHTML. It is an extension to the XHTML format itself, which makes XHTML more complex, but avoids that an external file is needed, as with GRDDL. Whether RDFa will also be usable in HTML (the non-XML format), is still an open question. There is only a small number of microformats so far, because the community around them is very careful to only accept formats that are widely useful. (And for specialized applications there are other ways, as I explained above.) The formats with the most obvious privacy implications are highlighted: hCard, XFN, and geo. That last one on its own is not privacy-sensitive: it just says that two numbers in a text are coordinates, without saying what they are the coordinates of. But when geo is used inside hCard, it indicates the address of a person. There are no privacy features built-in to any of these formats, nor any permissions or licenses. It is not possible to say who may use which part of the information for which purpose. The rel-license microformat can be used to attach a copyright license to a Web page, but the contents of that license are not standardized and thus not machine-readable, except when the copyright is in the form of a link to one of the Creative Commons licenses. (But see “policy languages” below.) XFN is probably the oldest microformat, dating from before the term microformat itself. It is aimed at bloggers who link to other bloggers. Those links can be annotated with a type, to distinguish different reasons for linking: friends, colleagues, family relations, etc. Implicitly, a blog is identified with a person, because the person himself doesn't have a URL. There are many bigger and smaller problems with this vocabulary, in particular that it is rather biased towards relations that are important in western culture, in particular North American. Of course, any classification is somewhat arbitrary and subject to changing insights and requirements… The first example shows the geo microformat on its own. Its function is simply to identify the given numbers as latitude and longitude. With a suitable browser (such as Firefox with the Operator add-on) the coordinates are automatically turned into a link to some maps, such as Google Maps. The second example shows the geo microformat inside the hCard mark-up. Here it functions to set the address of a person. The semantic Web is distributed: it has multiple authors and is linked together over computer networks. That means it will have errors and contradictions. There is no help for that. The software will just have to be robust enough and not crash. Information on the Web has to be treated like any other information: check its origin and determine who wrote it and for what purpose. Metadata, just any other data, is created by somebody and is subject to copyrights and licenses. The fact that information is easy to collect doesn't mean it is permitted to collected it. The usage policy for some metadata is itself also metadata and should be machine-readable as well, if we really want computers to process information automatically. The rel-license microformat was already mentioned. It is a start of the solution: it links a Web page to another page that contains the license. But the license itself is not machine-readable. Expressing policies (licenses) is a hot topic at the moment. There are many languages being developed, some general, some for specific domains. None of the general ones seem ready to become a standard yet. Of those, ccRel seems the most developed. It is a vocabulary (ontology) for RDF of common copyright terms, which can be combined in various ways. ccRel was submitted by Creative Commons to W3C, but there is at this time (October 2009) no plan to standardize it or something similar. On the other hand, W3C does plan to standardize a new font format for fonts that are licensed for embedding in documents. The policy language in that case is very simple (“this font is licensed for embedding in documents X, Y and Z”) and limited to a very narrow domain (digital fonts). This talk: http://www.w3.org/Talks/2009/1026-Various-Privacy-Vienna/all Copyright © 2009 W3C (MIT, ERCIM, Keio)