We believe that as the Semantic Web grows and gains in popularity, machine learning will be increasingly important for identifying interesting and potentially useful classes of information for which membership is not explicitly stated anywhere on the Semantic Web. Such missing classes may not be stated explicitly for several reasons:
We believe that although these classifications are not stated explicitly on the Semantic Web, they could probably be inferred using the right rules; for instance, the research group membership could probably be determined based on looking at co-authored papers, and it is for identifying these rules we believe machine learning will be of tremendous importance. When the missing classes are identified, they could potentially be added to future versions of ontologies or used on the fly for personalisation tasks.
In our work to date   we have investigated the learning of such unspecified classes within the graph of FOAF people. To accomplish this we used a crawl of the FOAF space from September 2003, consisting of about 150000 triples describing 8908 people. We initially employ a Hierarchical Agglomerative Clustering (HAC) algorithm to detect clusters of similar people in the FOAF graph. HAC works by initially creating a cluster for each individual then recursively computing the distance between all clusters and merging the two closest clusters until a similarity threshold is met or only one cluster is left. The result is a set of clusters, arranged in one or more trees. For a similarity measure for FOAF people we experimented with several methods, but ended up using a variation on the similarity measure for conceptual graphs as detailed in , our approach allows comparison of the RDF subgraph surrounding a person on two axes: conceptual and relational similarity, i.e. overlap of resource nodes and overlap of property edges.
From the set of clusters of similar FOAF people we selected the most interesting clusters, identified as the clusters where the similarity internally in the cluster is relatively small compared to the similarity with the parent cluster in the tree. These potentially interesting clusters were fed to the Inductive Logic Programming (ILP) program Aleph.
Aleph takes a list of positive and negative examples, and descriptions of these in Prolog. The members of a cluster became our positive examples, and all other people the negative examples. The Prolog descriptions were converted from RDF; this conversion is straightforward, and an example is shown in Figure 1.
Initial experiments with this method of using clustering and Aleph showed that using even only a tiny subset of 5% of the total number of people in the FOAF crawl, Aleph was already overwhelmed; a single experiment took 1-3 weeks to run. To improve the performance we sorted all the predicates used in the FOAF graph by frequency, and included only the top 100 predicates out of the total 1066 used when generating the Prolog descriptions. In addition we took out the foaf:knows predicate and it's generated inverse (generated by the crawler). If present, Aleph would learn rules only constructed from foaf:knows links, as they are the most useful for distinguishing people in the FOAF graph. However, we felt these rules were not very re-usable and, in addition, we were interesting in learning the missing classifications, not just variations on the explicit information.
Using the methods outlined above we were able to identify several clusters and rules for describing these clusters that we believe are potentially interesting. Some of these rules are shown in Figure 2; rule 1 is the group of people trusted highly by someone, rule 2 the members of an implicitly specified research group, rule 3 is the group of people positioned in Aberdeen, rule 4 is the group of people who have created a postscript document (the postscript cult!), and rule 5 the authors of a particular paper. The rules are presented here in Prolog, but the inverse of the conversion shown in Figure 1 is equally trivial using a RDF representation of Horn clauses, such as SWRL.
|#||Rule||Cluster Size||Recall||False Neg.||False Pos.|
|1||member (A) :- trust___trustsHighly (B,A).||8||8||0||0|
|2||member (A) :- foaf___groupHomepage (A,'http://www.aktors.org').||13||13||0||0|
|3||member (A) :- pim___nearestAirport (A, 'http://www.daml.org/cgi-bin/airport?ABZ').||12||12||0||2|
|4||member (A) :- dc___creator (B,A), dc___format (B, 'application/postscript').||17||15||2||0|
|5|| member (A) :- dc___creator (B,A), dc___title (B, |
'Managing Reference: Ensuring Referential Integrity of Ontologies for the Semantic Web').
We have shown one way in which machine learning techniques can be used to identify potentially re-usable and interesting classifications that are not explicitly stated from a corpus of RDF documents. In developing the methods above, we have identified the following issues:
Please refer to  and  for more detail on this work; in addition, the primary author is currently working on a PhD on the intersection of machine learning and Semantic Web technology.