Use Case Extraction-Annotation

/Discussion

Version	3
Date/Time	February 05, 2008
Original author	PeterVojtas
Current lead	lead PeterVojtas
Last Modified By
Primary Actors	Web user
Secondary Actors
Application domain	Automating current web
Triggering event	user wants a web scale overview of available information

Purpose/Goals

The motivating situation is a user (or a web service) that wants a web scale overview of available information – e.g. overview over all car selling shops. The advantage would be a possibility of comparison of different market offers. Another application is competitor tracking system.

Main problem is the size of data and the fact that these data are mainly designed for human consumption.

Many of our use cases assume that e.g. "web resources has been populated using a property set and property values that have a machine processable representation of the vocabulary used". On the other side, the W3C activity Gleaning Resource Descriptions from Dialects of Languages (GRDDL) see www.w3.org/TR/grddl/ in GRDDL specification introduces markup based on existing standards for declaring that an XML document includes data compatible with the Resource Description Framework (RDF) and for linking to algorithms (typically represented in XSLT), for extracting this data from the document (e.g. products in an e-shop).

Our approach tries to generalize this to arbitrary HTML, XHTML sources and extending "Dialects of languages" to semi-structured html pages and also to dominantly text pages (e.g. accident reports). Main goal is to do this "gleaning" (also web content mining, extraction) automatically for a large number of resources. Task is easy for humans, nevertheless humans can not process a large number of pages. Task is difficult for machines, nevertheless machines can process large number of resources. The main trick is to find a trade-off between amount of human assistance (especially in training and ontology creation) and automation. Second issue is domain dependence. One can easily write a script extracting RDF triples from a single page. The goal is to extract data from pages never visited. Third dimension of the problem is "machine difficulty" of the resource. Some pages (e.g. generated from a database) are easier for machine extraction than other dominantly textual.

Our task is: given such a resource and an ontology extract data contained in this resource (to obtain an instance of some ontology parts, typically instances of a class and some properties of that class) and annotate the original resource (wrt given ontology).

Issues and Relevance to Uncertainty

Solution are extraction and annotation tools. There are many annotation tools linked on http://annotation.semanticweb.org/annotationtool_view, mainly using a proprietary uncertainty representations (or built in uncertainty handling). One of main tasks of this XG is to provide fundamentals of a standardized representation of uncertainty that could serve as the basis for information exchange. Here uncertainty annotation of results would be especially helpful.

In what follows we use acquaintance from uncertainty issues in experiments with web content mining as described in http://c4i.gmu.edu/ursw2007/files/papers/URSW2007_T9_VojtasEtAl.pdf, see also presentation http://c4i.gmu.edu/ursw2007/files/talks/URSW2007_T9_VojtasEtAl_Slides.pdf.

In what follows we present issues and relevance to uncertainty which are specific for this use case and we annotate them (UncAnn) with reference to Uncertainty Ontology (UncertaintyOntology) and extensions to classes and properties described in Fine grained version of Uncertainty Ontology.

Assume that a user is looking for notebooks and we would like to provide a machine support for his/her search. A typical statement which is a subject of uncertainty assignment in this use case is: (UncAnn Sentence)An html coded web page with URL contains informations, which according to an ontology o1 (UncAnn World: DomainOntology) about notebooks can be expressed by a RDF triple (ntb1, O1:has_priceProperty, 20000). The agent producing this statement is (UncAnn Agent:MachineAgent) especially an induction agent (UncAnn Agent:MachineAgent:InductiveAgent).

Uncertainty nature of this statement is (UncAnn UncertaintyNature:Epistemic:MachineEpistemic), uncertainty type is usualy (UncAnn UncertaintyType:Empirical:Randomness). Instances used for training an extraction tool (UncAnn World:DomainOntology:Instances) are web pages, the uncertainty model is usually complicated (mixture of html structure, regular expressions, annotation ontology and similarity measures) and combination of several models, typically (UncAnn UncertaintyModel:CombinationOfSeveralModels:ProbabilityAndFuzzySetsCombinationModels) . Depending on this the evidence for this uncertainty statement (UncAnn World:DomainOntology:Instances:Evidence) are precision and recall on this training set.

Assumptions/Preconditions

arbitrary web shop page, e.g. http://notebook.cz/model/Acer
data structure, ontology, containing attributes of notebooks
special part of extraction ontology supporting extraction - e.g. typical values, ranges of values, regular expressions, ... Here owl:oneOf could be extended to owlExtension:Usualy_One_Of and extension of possibilities defining data

types, e.g.

o1:ntb_memory owlExtension:usualy_has_range 128MB..8GB

o1:ntb_disk owlExtension:usualy_has_range 20GB..500GB.

Required resources

as above

Associate methodologies that could help

papers on data extraction see e.g. http://www.cs.uic.edu/~liub/Web-Content-Mining-2.pdf

recommend those aspects that are considered most important to be included in a standard representation of vagueness and uncertainty

The goal of this use case is to find out which models of uncertainty and vagueness are appropriate. Especially it is clear that a more detailed ontology is needed (containing information supporting successful automatic extraction - it is not a human uncertainty, it is a machine uncertainty). One can expect that the system is learning/improving during usage and the extraction ontology is extended.

Extraction form textual pages need another type of knowledge, e.g. transforming a sentence to a (Subject Verb Object) tree (full, partial).

Successful End

We will be able to extract RDF data from pages which are plain (both structured, textual) HTML files wrt a given ontology and annotate the original page with RDFa. Moreover the result should be machine understandable and an input for further processing - see Discovery use case.

Failed End

lot of pages will be not machine processable, lot of information will be practically unachievable for a human.

Main Scenario

First type of scenario, is describing the process of extraction and annotation (details above or in links), e.g

classify different web resources - some are easy for extraction (simple tables), some more difficult (non correct html pages with non regularities in the tag tree) up to dominantly text content
specify whether we are training a wrapper for a specific page which is often changed or for a wide spectrum of pages
specif methods and uncertainty issues
and so on

Or we can understand this scenario as sequences needed for inclusion to final report, then this scenario has following steps

today form of the use case b. design formally a core of an extraction ontology
c. connect it to the Uncertainty ontology http://www.w3.org/2005/Incubator/urw3/wiki/UncertaintyOntology

and/or

pr-owl ontology http://pr-owl.org

and/or

Task_oriented_uncertainty_ontology http://www.w3.org/2005/Incubator/urw3/wiki/Discussion

Additional background information or references

quoted papers contain sufficient reference

Variations

Open Issues