This is an archive of an inactive wiki and cannot be modified.

This is one of the possible Use Cases.

1. Abstract

Creation of rules and ontologies is a demanding knowledge engineering task. Instead of and in addition to manual creation, one should take advantage of large amounts of unstructured and partially structured data already available on the Web. Data mining and machine learning techniques can be used to extract semantic annotations in the form of relations, ontologies and rules.

2. Status

Originally proposed by IgorMozetic at the first F2F meeting.

3. Links to Related Use Cases

Most other Use Cases deal with manually created rules only. However, there are two related cases:

4. Relationship to OWL/RDF Compatibility

It seems that the query interface to RDF(S) and OWL suffices for compatibility. This means that SparQL and OWL-QL queries may be included in the body of a rule, as proposed by the Blackbox approach in OWL Compatibility and Viewing an RDF(S) graph as a fact base in RDF Compatibility.

5. Examples of Rule Platforms Supporting this Use Case

The work by Data Mining Group on Predictive Model Markup Language provides an XML Schema format to specify statistical and data mining models. To a large degree this covers most of the 0-order rules (essentially propositional) which can be automatically learned. However, it does not extent to a more expressive 1-order representation (e.g., rules learned by Inductive Logic Programming techniques).

6. Benefits of Interchange

7. Requirements on the RIF

8. Breakdown

8.1. Actors and their Goals

8.2. Main Sequence

  1. Data gathering - collecting of the relevant data by e.g. focused web crawling or from distributed databases.

  2. Preprocessing - collected data are cleaned (e.g., markup removed, stemming, lemmatization) and transformed into a form appropriate for efficient learning (e.g., Bag-of-Words represented by sparse vectors).

  3. Learning - depending on the goal, different learning or clustering algorithm are applied (e.g., k-means clustering, Support Vector Machine, Latent Semantic Indexing, Inductive Logic Programming, ...). Can use appropriate background knowledge if provided.

  4. Evaluation - the rules are evaluated w.r.t. testing data or by inspection (in the case of a human expert). The evaluation results are fed back to step 3. for incremental learning if needed.

  5. Application - the reasoner is queried to address appropriate tasks, e.g., classification, prediction, visualization.

9. Narratives

To be further elaborated...

9.1. Virtual organizations

Structuring of organizations' competencies (from web sources of a large pool of potential partners) to facilitate dynamic creation of alliances for specific business opportunities.

9.2. User profiling

Monitoring and analysis of user profiles (e.g. during online shopping) to learn association rules.

9.3. Online learning from stream data

Learning from stream data (e.g. business news) to predict possible future events. Prediction rules change over time gradually, must be inspected, manipulated and adapted.

9.4. Cross-lingual document retrieval and categorization

Multilingual documents are aligned and language independent representation learned. These rules can then be used for document retrieval depending only on the meaning of the query and regardless of the language. Also, document classification and clustering can be done once, for all languages.

9.5. Cross media

Different media (text, images, sound, video) about the same source can be aligned and used for annotation. The same sparse vector representation is used, just over different primitives (words, textures, visual patterns). In the same way as two languages are correlated, one can correlate text to images, text to sound, text to video, ...

9.6. Data compression

From large data sets rules can be learned which preserve some data properties, eg. the ability to discriminate between one attribute values based on other attributes. They are not equivalent to the original data, but require considerably less space.

10. Commentary

Comments, issues, etc.