Sender: mho@w3.org Date: Mon, 29 Oct 2001 09:04:25 +0100 From: Martyn Horner X-Mailer: Mozilla 4.78 [en] (X11; U; Linux 2.2.12-20 i686) X-Accept-Language: en To: Eric Miller Subject: PRIMER (RDFCore): Requested text for primer Eric, I've just realized that I never submitted the piece you asked for for the primer (`Martyn agreed to write up his router example'). It's been lying around on my disk for weeks! I'm sending it to you for a quick check rather than labour the maillist with it (they seem to have enough they need to ignore...). Is that OK? M -- Martyn Horner Profium, Les Espaces de Sophia, Immeuble Delta, B.P. 037, F-06901 Sophia-Antipolis, France Tel. +33 (0)4.93.95.31.44 Fax. +33 (0)4.93.95.52.58 Mob. +33 (0)6.21.01.54.56 Internet: http://www.profium.com RDF in action: information routers RDF is not a theoretical concept: it is in service, directly and indirectly, in the real world. RDF is a major component of the Semantic Web but it has much valuable work to do on and off the World Wide Web. The world is full of information. Behind the millions of pages on the Internet's publicly visible part, the Web, there are many times as many documents flowing in and out of organizations via emails, cross-company networks and constant always-on information `feeds'. Every document that passes along the wires has to be inspected, processed or re-routed. A document simply written by one human being has to be read by another before anybody knows its worth or where it should be redirected. This is fine for a person-to-person email but, for information destined to a broad circulation, this can be expensive, often reducing the value of the information by raising its handling cost or simply making it late. For example, when an individual subscribes to a source of news, it's usually on the understanding that everything in that feed is of interest and so everything will be delivered without question. For the distributor to sort out the interesting ones for you manually would be time-consuming, expensive and boring; so instead we accept dozens of emails and delete most of them every morning. And of course it is time-consuming, expensive and boring. Subscription to some less self-critical sources is a step to be taken very seriously. When a company subscribes to a news feed, it may be risking a deluge of unwanted data. If it intends to circulate the information within the company or to a broad range of clients, it charges itself with checking every document by eye or investing in extra software technology. Without such protection, the company's networks will soon collapse under the load or its clients will consider themselves willfully `spammed' and withdraw their custom. The redirection of such feeds is therefore a matter of utmost commercial sensitivity in a context of huge and increasing volumes and complexity of data. The technology concerned is `routing' and, in the most modern cases, relies on RDF. The need, traditionally, for human inspection of incoming documents comes from the fact that, on its own, text has no value. It only has value when you know what it's about, what authority is its source and who it's intended for. Everything else is, as we know, just material for spamming. For a software agent to recognize a document's worth it must have access to an evaluation that is consistently readable, whatever the format of the document, and is reliable in its description. For those two objectives, we need an internationally standardized language and a globally recognized set of values. These are RDF and RDF Schemas such as Dublin Core and Prism. The longed-for independent evaluation takes the form of an associated RDF document. Not that every document from every information source comes with its associated RDF description... yet. It is the case however that almost every serious source supplies some value-based annotation in the form of metadata, the significant content of RDF. For example, news feeds generally come in one of a selection of annotated formats, mostly based on XML, such as NewsML or XMLNews. Most standards-oriented companies are adding freely-accessible metadata to their document formats. Adobe, for example, recently announced XMP whereby metadata can be inserted into (and more importantly extracted from) PDF documents. The message from such companies is that, even if you can't understand or even have no right to read the contents, you are entitled to know enough to make an evaluation for your own use or for clients who can use the information. For freely available information, standards are the key. Now this basic process (source embeds standard annotations: annotations are used to divert and sort documents) is certainly not new. Email (SMTP) and news (NNTP) protocols use standard keyword-value-pair headers which are fundamental to their operation: such documents are marked up according to known and publicized standards. What is new is to normalize all these local formats to a general one and thereby be able to appeal to a globally consistent set of values in making judgments. For a universal router to do its job, it needs to cancel out these variations in format. Until the world adopts one standard, this will be a matter of tact and ingenuity but the existence of a core standard is important here. When a slew of formats and value-systems need to be compared it is safer to have one standard to convert to first and then to compare rather than do it piecemeal - and that standard must be broader than all the others. Again, RDF (and RDF Schemas) standards are the natural choice. An Information Router collects metadata and stores it (rather like an enormous RDF document describing maybe millions of resources at once). This metadata store holds the descriptions in exactly the terms of RDF. They can therefore be exported or imported as industry-standard RDF without loss or confusion. World-wide, repositories of metadata may be synchronized and refreshed by exchanging RDF. While humans are exchanging images, videos and news items, metadata servers are exchanging compact RDF evaluations of them (the images, videos and news items, not the humans). The actual documents described, orders of magnitude larger than the metadata, can be stored elsewhere or just left where they are (located by URI, of course). The metadata is compact and loaded with value. Judgments about distributing material can be made in a context of universally accepted and agreed values (the standard predicate systems like Dublin Core) and a vast number of alternatives, all without moving the actual documents around or indeed even looking at them, by computer or by human eye. Judgments are made by applying RDF `queries' which are formulae for testing the value of a document to the reader: whether the subject is interesting, the content is suitable, the author respected, the source reliable, the document accessible, the cost reasonable, the language intelligible, the conclusion desirable, the format tractable, the medium handleable, etc, etc. The actual form of a query varies from product to product. (In any case, the consumer would be given a graphical way to express his wishes.) In one case, a query takes the form of a modified RDF description which, if you like, asks to be proved or disproved by a body of metadata. So an RDF Description that stated that a document exists with the title `Financial history of Belize' can be viewed as a request to find such a document. The news distributor's server runs, in addition to the usual server software, one of these Information Router packages, applies queries on behalf of its clients and delivers just those documents that survive the evaluation. If a complex multi-layered query describing just what it takes to please you is associated with your name as a subscriber, you can, using software available today, guarantee that what you get sent is exactly and only what you need. It's an end to spam, thanks to RDF. -- eric miller http://www.w3.org/people/em/ semantic web activity lead mailto:em@w3.org w3c world wide web consortium tel:1.614.763.1100 200 technology square, ne43-350 fax:1.208.330.5213 cambridge, ma 02139 usa