Personal information management and the semantic web

Contribution to the Bristol meeting debate

Jérôme Euzenat (Jerome.Euzenat@inrialpes.fr)
INRIA Rhône-Alpes

Personnal information (i.e., Agendas, Address books, Bibliographies, etc.) has several advantages with regards to the semantic web:

it can be found in great quantities over the web;
it is structured (and relatively standardized);
people are not shy at inputing it;
it is yet difficult to search on the web.

So, if the semantic web could provide some help in dealing with personnal information (or PIM data here), it could prove to be really useful right away. It could also produce high quality data that could be used for other applications.

Considering the existence of various independent initiatives for accomodating this data to the XML world in general and the semantic web in particular, I sketch a few ideas about what could be done for dealing with PIM data in a semantic web embryo.

Think link

In [Walsh 2002], Norman Walsh made the relevant diagnosis that though PIM data can now easily be extracted from digital organisers, they have the drawback of not being connected. The links between a conference schedule and its address, or a conference presentation and the paper presented are missing. We already faced that problem in trying to semantically annotate the web pages of the ISWC2002 conference.

There are two ways of introducing links in a PIM system:

identifying the object by a URL, i.e. by providing a handle to the identified application (or class of applications) that can handle it. This is alike the scheme of URLs or the APPL resource used on some systems for identifying the application. On the Palm many application knows that there can be various DOC document handlers and the [MegaWiki 2002] application generalizes this by adding links virtually anywhere. This solution is general, modular, scales and can be used for the interoperability in a whole PIM system like a PDA (e.g., busstop=MTro:Paris/Montparnasse)
identifying the class of an object in an "ontology" independently of the application that can handle it. This solution has the advantage of being regular, typed and extensible. Moreover, it exposes the semantics.

We can consider that the first version is the web way, and the second the semantic web way. In the former case, only the application knows the meaning of the data chunk. In the latter, the semantics of the chunk is transparent and can be handled, at various level of elaboration, by any application which wants to (and by different applications in different contexts).

In order, to build a meaningful linking between PIM data, a data model or ontology must be elaborated. It must be interconnected, i.e., it must define the relationship that can exist between objects.

Towards a global model for personnal information

Our feeling is that there is an opportunity for this group that meets in Bristol, to put forth a "standard" minimal model for these data. It is not necessary to have a precise model of the world but rather a set of concepts that:

are precise enough to propose the high-level concepts met in the PIM applications plus the specification of their relationships;
offer enough details so that those who are sharable among commonsense applications are already specified;
can be precised further for the needs of a particular applications;
are expressed in RDF with a canonical encoding in XML that garantee straightforward syntactical translation to standard formats.

There are several semantic web efforts for dealing with PIM data [ Iannella 2002, Studer 2001, Brickley 2001, Miller 2002] which currently use standard models for each of these units: iCalendar [RFC2245], vCard [RFC2426] and Dublin core [DublinCore 1999]. The wide agreement and software adoption of this standards is a good reason for building on them, and let expect the production of a relevant model of these data in the short term.

This task which will necessitate a great deal of caution but we can start with nicely designed data models: vCard+FOAF [RFC2426, Brickley 2001], vCal+iCalendar [RFC2245], DC+BibTeX+MARC [DublinCore 1999]. These data models need classes and relationships between each others.

A first step, would consist in defining such an integrated data model (or ontology) with top-level classes and typed attributes (see Figure 1). Then, applications could be developed on that basis which would be interoperable among us. The next step would be to promote such a scheme to normalization bodies and PIM software providers with nice applications in hands.

One can identify a number of simple classes: location, document, person, organization, event, task and period. Among which there is a minimal number of relationships (e.g., locatedAt(event,location), occursAt(event,period), belongsTo(people,organization)) and many other possibly useful relationships that can be added by specialized applications (e.g., parentOf(person,person) for a geneological software). A draft skeleton can be found in the following picture.

Figure 1: A few relationships between PIM objects (relations are in italics, italisized labels of classes are self-referent relations).

Expressing this model in RDF would provide a solid and useful foundation for the language in which expressing queries and getting answers.

Semantic web applications of PIM data

As intensive users of PIM data we started modest experiments for using it withing the semantic web:

A transformation workbench for generating conference programs in various formats through XML transformations. This workbench can generate vCal and vCard data [Euzenat 2002]. It has been used for a few conference schedule so far but proved useful albeit limited by the lack of connexion between formats (see ISWC2002 [ISWC 2002] annotations which are not connected to the schedule).
An application to the search, verification and completion of PIM data. We have started the development of an application which finds, on the web, information for completing a vCard. I describe its principles below.

Dealing with PIM data on the semantic web can help answering questions like "What is the homepage of this individual", "what is the phone number of his assistant", "will I and him have an opportunity to meet within a month", etc. In general the questions are simple but the mobilized resource to find the answer can be large. The semantic part will consist in developping a model of the PIM domain that allows to answer these question by taking advantage of the relationships between these PIM information.

The relationships involved with PIM data can be very complex. Think of the following rules:

IF workat( x, y ) AND address( y, z ) THEN address( x, z )
IF workat( x, y ) AND addressing-scheme( y, z )
THEN email( x, apply-scheme( x, z ))
IF author( x, y ) AND subject( y, s )
THEN work-on( x, s )
IF located( x, y ) AND city( y, z ) AND country( z, k ) AND salestax(
k, t )
THEN salestax( k, t )

Many such rules can be used for deducing information from that available on the web or elsewhere. These rules can also take many parameters into account such as the confidence in sources, the confidence in the rule, the result merging policies, etc.

We started developing a system (see Figure 2) whose goal is simply to fill a vCard from the web [Charre 2002]. It takes advantage of wrappers to known web site and has only limited inference capabilities.

Figure 2: Graphic interface to vCard data and completion application.

Another interesting application along these lines is Vitæ checking

A killer application for connected handhelds...

The problem with current PIM format designs has been reproduced with the connected version of the PalmOS-based organizers: the wireless application can only talk with one server which very often only use its database, and there are no connection among the various servers. Consequently, it is again very difficult to answer queries involving data not dealt with by the same PQA.

Applications such as that sketched above offer the opportunity to search and use the (semantic) web and not one particular data provider. It can be used for checking or filling, offline or online, the records in a PDA database. This would put part of the semantic web in the palm of your hand.

... and a challenge for the semantic web?

We have proposed this PIM data completion task as a general challenge for the semantic web [Klein 2002]. The SW Challenge is aimed at providing yearly snapshots of the state of the semantic web technology and promoting emulation and comparison of various solutions.

The PIM data completion task would demonstrates the ability to merge heterogeneous information sources. It also demonstrate the ability to take advantage of the Web. It has been proposed as a MUC-like challenge which proposes a range of problems whose solutions can be unambiguously defined (see the queries above) and compare the results of various systems. We did an embryo of such a challenge, on a different area, a few years ago [Escrire 2000] with more difficulties to define a data set.

The challenge can be divided in several categories. The main category is the fully open category in which the participant can take advantage of any kind of information available on the web or already compiled. In a second time, in order to show the specificity of the semantic web, we can set a category in which the resourse used have to be expressed in a language like RDF or OWL.

The challenge can evolve on several aspects over the years:

By complexifying the queries;
By enriching the query and answer APIs (providing more PIM information);
By measuring other parameters (than just the correct answer): degree of confidence associated with the answer, rank of the correct answer, cost of the resource burned, ability to explain the reasoning, etc.
By imposing constraints on the languages or tools to use (one can imagine a Prolog-class challenge with only the rules compared).

We expect the application presented to the challenge to more professional each year and to become shortly common usage.

Conclusion

This note just started as an email for emphasizing the element I found important to be considered for this Bristol meeting to which I cannot attend. I feel that what I wrote is certainly shared by most of the participants.

Anyway, if I can summarize what's important in three actions, they are:

get PIM data in standard format (RDF);
link it;
model it (with ontologies+rules);
play with it.

This requires a common top-level model that anyone can use that is:

based on common standards (vCard, vCal...);
expressed in semantic web languages (RDF, OWL with a trend towards a canonical XML representation);
using links and classes.

Acknowledgements

The vCard completion application has been developed in large part by Bruno Charre.

Bibliography

[Brickley 2001] Dan Brickley et al., Friend of a friend RDF vocabulary, http://xmlns.com/foaf/0.1/
[Charre 2002] Bruno Charre, Web sémantique et recherche d'informations personnelles, Mémoire de DESS Intelligence artificielle, Université Pierre et Marie Curie, Paris (FR), 2002 ftp://ftp.inrialpes.fr/pub/exmo/reports/dessia-charre.pdf
[DublinCore 1999] Dublin core metadata element set 1.1, http://dublincore.org/documents/dces/, 1999
[Dumbill 2002] Ed Dumbill, XML watch: finding friends with XML and RDF, 2002 http://www-106.ibm.com/developerworks/xml/library/x-foaf.html
[Escrire 2000] http://escrire.inrialpes.fr
[Euzenat 2002] Jérôme Euzenat, Confman, http://co4.inrialpes.fr/xml/pimlib/confman
[Iannella 2002] Renato Iannella, Representing vCard objects in RDF/XML, Note, W3C, 2001 http://www.w3.org/TR/vcard-rdf
[RFC2245] Network working group, Internet Calendaring and Scheduling core object specification (iCalendar), RFC 2445, IETF, 1998 http://www.imc.org/rfc2445
[RFC2426] vCard, RFC 2426, IETF, 1998 http://www.imc.org/rfc2426
[ISWC 2002] ISWC2002 annotations, http://annotation.semanticweb.org/iswc/documents.html
[Klein 2002] Michel Klein, Ubbo Visser, http://challenge.semanticweb.org/
[MegaWiki 2002] http://www.geocities.com/ddvteach/MegaWiki.html
[Miller 2002] Libby Miller, Greg Fitzpatrick, Ban Brickley, SkiCal and iCalendar in DAML+OIL: a case study, http://ilrt.org/discovery/2002/03/skical-daml/
[Studer 2001] Rudi Studer, York Sure, Raphael Volz, Zheng Jijuan, Robert Meersman, Seed Ontology, Ontoweb deliverable 6.1, http://www.ontoweb.org, 2001
[Walsh 2002] Norman Walsh, Generalized metadata in your Palm, Proc. Extreme markup languages, Montréal (PQ CA), 2002

$Id: SyncLink.html,v 1.1 2002/10/03 16:26:48 lmiller Exp $