Semantic Web and microformats

Bert Bos (W3C)
PrivacyOS Wien 2009
Vienna, Austria
October 27, 2009


I'm “screen-scraping” (i.e., extracting, converting and combining)
all kinds of information from various Web pages myself and am actively
trying through my work to make that process easier and widely used. It
goes much beyond“mash-ups,” I'm using all the data, not just the
APIs that the data providers made.

That makes you aware of what computers can do and you get sensitive to
information usages even outside the Web…


On the way to the conference I saw at a train station a special
promotion by the ÖBB, the Austrian railways, for people buying their
tickets with their mobile phones. If you buy tickets by SMS instead of
cash, you get 10% reduction. This a temporary promotion to celebrate
the 10 years of this possibility, but could it be that privacy in
Austria costs 10%? :-)


This is not an attempt to define fully what W3C means by the 
“semantic Web.” It has to do with increasing the amount of available
information by increasing the number of contributing people and by
increasing the use people can make of that information by enabling
re-use and automatic processing.

The “semantic Web” is not a specific kind of Web, it is a direction
for development, just like “east” is a direction and not a specific
place. It is the goal behind all W3C's technologies, even if the path
to that goal isn't always direct, as a result of (temporary) technical
limitations, conflicting interests, and maintaining backwards
compatibility.


The term “semantic Web” to summarize the goal of the Web was
invented by Tim Berners-Lee in the late eighties. It is slowly getting
better known, but another term is also being used at the moment which
means more or less the same thing: linked data.

Two examples this year of what “linked data” can do are the site
with collected data of the US government, data.gov, and a similar site
by the UK government, data.gov.uk. With large amounts of data in
common, open formats, they invite re-use and novel applications.


To be able to compare data from various sources, you need a common
base to express the data in. RDF is W3C's answer to that. It is a
(mathematical) model of which the principle is that all data is
normalized to (large numbers of)“triples”: The property P of object
S has value V. E.g., the price (P) of this car (S) is € 20,000 (V).

The model is complemented by sets of properties, each set describing a
particular domain. Such a set is called an ontology.

You don't always need to actually convert data to RDF. The theoretical
possibility is enough. You can often do the actual computations in a
higher-level format.


Low-level formats are good for theory and (mathematical) proofs, but
not so much for humans. Even programmers make fewer mistakes if they
work with data that they can read easily themselves.

That's why we don't convert the Web to RDF. We keep using HTML, SVG,
XML, etc., and even build new formats on top of those. They are at a
more human level.

Microformats uses a feature (categorizing elements with the rel and
class attributes) that was built into HTML expressly for that purpose,
although the name “microformat” and the idea that microformats
themselves can be standardized was only invented much later. Tantek
Çelik inspired that idea, and maybe also invented the name.

The guiding principle is that machine-readable data is likely to
contain errors and out of date, unless it is also human-readable. The
best is data that is made primarily for human consumption and
nevertheless machine-readable.

Thus microformats are just some simple, agreed conventions for how to
use HTML, because HTML is what people use anyway.


One of the microformats is called hCard, it is the Internet standard
vCard expressed in HTML (and automatically convertible to vCard). 
vCard is an electronic business card. It encodes name, address, phone
number, etc. in a machine-readable text file that can be attached to
e-mail. hCard expresses the same, but so that it can be included in a
Web page.

It works by annotating elements in an HTML document with keywords from
the vCard vocabulary. The structure of the HTML file can nearly always
remain unchanged, and thus optimized for humans. The annotation makes
the relevant data available for automatic processing.

In the example above, the upper half is the original, without the
hcard mark-up, the lower half is the same, but with the hcard
annotations added.


Microformats are designed to simple, easy to use in many scenarios,
from people editing HTML by hand to HTML being generated from content
management systems, wikis, blogs, etc.

They stress simplicity over generality. You can express very common
things in them, but they are limited. More detailed or more rare kinds
of information need other formats to make them machine-readable.

Microformats are not currently a standard. Maybe they will become
standard one day, the community around them is interested in
standardizing them eventually, maybe through W3C.

Two other methods of building on HTML for making information both
human- and machine-readable are already W3C standards. They are much
more complex, but they allow anything that can be expressed in RDF to
be expressed in HTML as well. (Whether that is the best way to express
data in all cases is of course a different question.)

GRDDL works by attaching a transformation (usually in XSLT) to an HTML
file. The HTML can be structured in any way the author wants, the
transformation takes care of creating RDF from it.

Unless you can use a pre-existing GRDDL transformation, it means you
need a programmer to write that transformation for you, so the
technology is not accessible to everybody.

RDFa is an alternative syntax for RDF that allows to include the RDF
directly in XHTML. It is an extension to the XHTML format itself,
which makes XHTML more complex, but avoids that an external file is
needed, as with GRDDL. Whether RDFa will also be usable in HTML (the
non-XML format), is still an open question.


There is only a small number of microformats so far, because the
community around them is very careful to only accept formats that are
widely useful. (And for specialized applications there are other ways,
as I explained above.)

The formats with the most obvious privacy implications are
highlighted: hCard, XFN, and geo. That last one on its own is not
privacy-sensitive: it just says that two numbers in a text are
coordinates, without saying what they are the coordinates of. But when
geo is used inside hCard, it indicates the address of a person.

There are no privacy features built-in to any of these formats, nor
any permissions or licenses. It is not possible to say who may use
which part of the information for which purpose. The rel-license
microformat can be used to attach a copyright license to a Web page,
but the contents of that license are not standardized and thus not
machine-readable, except when the copyright is in the form of a link
to one of the Creative Commons licenses. (But see “policy languages” 
below.)


XFN is probably the oldest microformat, dating from before the term
microformat itself. It is aimed at bloggers who link to other
bloggers. Those links can be annotated with a type, to distinguish
different reasons for linking: friends, colleagues, family relations,
etc. Implicitly, a blog is identified with a person, because the
person himself doesn't have a URL.

There are many bigger and smaller problems with this vocabulary, in
particular that it is rather biased towards relations that are
important in western culture, in particular North American.

Of course, any classification is somewhat arbitrary and subject to
changing insights and requirements…


The first example shows the geo microformat on its own. Its function
is simply to identify the given numbers as latitude and longitude. 
With a suitable browser (such as Firefox with the Operator add-on) the
coordinates are automatically turned into a link to some maps, such as
Google Maps.

The second example shows the geo microformat inside the hCard mark-up. 
Here it functions to set the address of a person.


The semantic Web is distributed: it has multiple authors and is linked
together over computer networks. That means it will have errors and
contradictions. There is no help for that. The software will just have
to be robust enough and not crash. Information on the Web has to be
treated like any other information: check its origin and determine who
wrote it and for what purpose.

Metadata, just any other data, is created by somebody and is subject
to copyrights and licenses. The fact that information is easy to
collect doesn't mean it is permitted to collected it.


The usage policy for some metadata is itself also metadata and should
be machine-readable as well, if we really want computers to process
information automatically.

The rel-license microformat was already mentioned. It is a start of
the solution: it links a Web page to another page that contains the
license. But the license itself is not machine-readable.

Expressing policies (licenses) is a hot topic at the moment. There are
many languages being developed, some general, some for specific
domains. None of the general ones seem ready to become a standard yet.

Of those, ccRel seems the most developed. It is a vocabulary
(ontology) for RDF of common copyright terms, which can be combined in
various ways. ccRel was submitted by Creative Commons to W3C, but
there is at this time (October 2009) no plan to standardize it or
something similar.

On the other hand, W3C does plan to standardize a new font format for
fonts that are licensed for embedding in documents. The policy
language in that case is very simple (“this font is licensed for
embedding in documents X, Y and Z”) and limited to a very narrow
domain (digital fonts).


This talk:
http://www.w3.org/Talks/2009/1026-Various-Privacy-Vienna/all

Copyright © 2009 W3C (MIT, ERCIM, Keio)