Learning new things about your friends
A FOAF Position Paper

Gunnar AAstrand Grimnes & Alun Preece & Pete Edwards

Computing Science Dept.
King's College
University of Aberdeen
AB24 3UE Scotland
{ ggrimnes, apreece, pedwards }@csd.abdn.ac.uk

Introduction

We believe that as the Semantic Web grows and gains in popularity, machine learning will be increasingly important for identifying interesting and potentially useful classes of information for which membership is not explicitly stated anywhere on the Semantic Web. Such missing classes may not be stated explicitly for several reasons:

The original author designed his markup for a specific purpose, and did not envisage any alternative conceptualisation. For example, a university has a semantic pay-roll system giving the employment level of people: lecturer, research assistant, teaching fellow etc. However, this does not give information about research area or research group membership.
The classification may be left out for space or optimisation reasons.
Numerous classifications are of such a personal or context dependent nature, that it would not make sense to specify class membership explicitly; consider for instance the class of restaurants you like, or people you have met at a conference.

We believe that although these classifications are not stated explicitly on the Semantic Web, they could probably be inferred using the right rules; for instance, the research group membership could probably be determined based on looking at co-authored papers, and it is for identifying these rules we believe machine learning will be of tremendous importance. When the missing classes are identified, they could potentially be added to future versions of ontologies or used on the fly for personalisation tasks.

Work to date

In our work to date [1] [2] we have investigated the learning of such unspecified classes within the graph of FOAF people. To accomplish this we used a crawl of the FOAF space from September 2003, consisting of about 150000 triples describing 8908 people. We initially employ a Hierarchical Agglomerative Clustering (HAC) algorithm to detect clusters of similar people in the FOAF graph. HAC works by initially creating a cluster for each individual then recursively computing the distance between all clusters and merging the two closest clusters until a similarity threshold is met or only one cluster is left. The result is a set of clusters, arranged in one or more trees. For a similarity measure for FOAF people we experimented with several methods, but ended up using a variation on the similarity measure for conceptual graphs as detailed in [3], our approach allows comparison of the RDF subgraph surrounding a person on two axes: conceptual and relational similarity, i.e. overlap of resource nodes and overlap of property edges.

From the set of clusters of similar FOAF people we selected the most interesting clusters, identified as the clusters where the similarity internally in the cluster is relatively small compared to the similarity with the parent cluster in the tree. These potentially interesting clusters were fed to the Inductive Logic Programming (ILP) program Aleph.

Aleph takes a list of positive and negative examples, and descriptions of these in Prolog. The members of a cluster became our positive examples, and all other people the negative examples. The Prolog descriptions were converted from RDF; this conversion is straightforward, and an example is shown in Figure 1.

Figure 1 : FOAF Fragment Converted to Prolog.

<foaf:Person>
  <foaf:name>Gunnar AAstrand Grimnes</foaf:name>
  <foaf:mbox>ggrimnes@csd.abdn.ac.uk</foaf:mbox>
  <foaf:knows>
    <rdf:Description>
      <foaf:mbox rdf:resource="mailto:apreece@csd.abdn.ac.uk" />
    </rdf:Description>
  </foaf:knows>
</foaf:Person>

rdf___type (`genid:002', `foaf___Person').
foaf___name (`genid:002', `Gunnar AAstrand Grimnes').
foaf___mbox (`genid:002', `ggrimnes@csd.abdn.ac.uk').
foaf___knows (`genid:002', `genid:003').
foaf___mbox (`genid:003', `apreece@csd.abdn.ac.uk').

Initial experiments with this method of using clustering and Aleph showed that using even only a tiny subset of 5% of the total number of people in the FOAF crawl, Aleph was already overwhelmed; a single experiment took 1-3 weeks to run. To improve the performance we sorted all the predicates used in the FOAF graph by frequency, and included only the top 100 predicates out of the total 1066 used when generating the Prolog descriptions. In addition we took out the foaf:knows predicate and it's generated inverse (generated by the crawler). If present, Aleph would learn rules only constructed from foaf:knows links, as they are the most useful for distinguishing people in the FOAF graph. However, we felt these rules were not very re-usable and, in addition, we were interesting in learning the missing classifications, not just variations on the explicit information.

Results

Using the methods outlined above we were able to identify several clusters and rules for describing these clusters that we believe are potentially interesting. Some of these rules are shown in Figure 2; rule 1 is the group of people trusted highly by someone, rule 2 the members of an implicitly specified research group, rule 3 is the group of people positioned in Aberdeen, rule 4 is the group of people who have created a postscript document (the postscript cult!), and rule 5 the authors of a particular paper. The rules are presented here in Prolog, but the inverse of the conversion shown in Figure 1 is equally trivial using a RDF representation of Horn clauses, such as SWRL.

Figure 2 : Selected FOAF Cluster Description Rules.

#	Rule	Cluster Size	Recall	False Neg.	False Pos.
1	member (A) :- trust___trustsHighly (B,A).	8	8	0	0

2	member (A) :- foaf___groupHomepage (A,'http://www.aktors.org').	13	13	0	0

3	member (A) :- pim___nearestAirport (A, 'http://www.daml.org/cgi-bin/airport?ABZ').	12	12	0	2

4	member (A) :- dc___creator (B,A), dc___format (B, 'application/postscript').	17	15	2	0

5	member (A) :- dc___creator (B,A), dc___title (B, 'Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web').	8	8	0	0

Conclusion

We have shown one way in which machine learning techniques can be used to identify potentially re-usable and interesting classifications that are not explicitly stated from a corpus of RDF documents. In developing the methods above, we have identified the following issues:

Current RDF data is very scruffy; problems range from simple human errors, to use of different ontologies, different units (meters vs. feet anyone?), RDF model ambiguities (resource vs. literal) etc. We do not believe these anomalies will go away as the Semantic Web takes off, rather we believe programmers will have to deal with them. Pre-processing and cleaning of data before attempting any sort of automatic processing is very important.
The scalability issues are HUGE; our experiments were only conducted using a subset of the FOAF data that was available in September 2003. With the data available today even just crawling, database entry and smushing is a challenge, and that's not even considering any type of automatic inference or processing!

Please refer to [1] and [2] for more detail on this work; in addition, the primary author is currently working on a PhD on the intersection of machine learning and Semantic Web technology.

Bibliography

G. A. Grimnes, P. Edwards, and A. Preece.
Learning from Semantic Flora and Fauna.
In, Semantic Web Personalization Workshop, AAAI, 2004.
(web)
G. AA. Grimnes, P. Edwards, and A. Preece.
Learning Meta-Descriptions of the FOAF Network.
In, Proceedings of International Semantic Web Conference, Hiroshima, Japan, November 2004, to appear..
M. Monads-y-Gómez, A. Globules, and A. López-López.
Comparison of conceptual graphs.
In, Lecture Notes in Artficial Intelligence, volume 1793, pages 548-556, Springer Verlag, 2000.

Learning new things about your friendsA FOAF Position Paper