3.3 AO Facets

Faceted classification as proposed in this paper will combine some notions of the software-oriented technology designed for text scheme management such as full text indexing and retrieval, free-text scan, document clustering, unique word and vector-space [3].

The full text's approach generates, in the first place, a list of strings associated with a document . Then, at retrieval time , a string match will be tried between each string in the index and a string in the available thesaurus. This strategy is combined with the unique-word and the vector-space approaches in order to give more retrieval power to the strings that occur more often in the text.

Document Clustering attempts to mimic the human thought process by grouping together documents with related ideas, concepts and terminology [8]. This notion is managed by SOUR's COMPARATOR &MODIFIER subsystem [14] as described later in this paper (see section 5).

All these notions can be put together to provide a default facet classification which will be tried by the so-called SOUR's Attempt Automatic Conceptualization (AAC) mechanism [17]. The AAC is applied to the HTML source text of the URL currently being accessed if, of course, the URL identifies a HTML file. The quality of the available CTS/LTS pair is of crucial importance to obtain good results in faceted classification.

For the relevant information to be extracted from the HTML source text, we choose the words that are included in the following HTML structures [6]:

Title - words between elements <TITLE> and </TITLE>;
Headings - words between elements <Hy> and </Hy> where y is a number between 1 and 6 specifying the level of the heading;

Since we are interested on the classification of documents by their contents, these must be reflected in the lexical terms available in the LTS. If, for example, we have a special interest in documents talking about the WWW, then the LTS shall have terms like Internet, Information, Web, Hypertext, Virtual, Browser, CERN, HTML, and so on. This specialization of lexical terms, which can improve both the conceptualization and the query mechanism, is supported by the SOUR's capability of working with several CTS/LTS repositories .

CTS provides the capability to cope with features of human reasoning such as classifying by analogy and terminological vagueness [10][16][15]. In particular, lexical terms can be connected by conceptual distances interrelating terms (words) according to its contextual meaning. These distances may be regarded as degrees of membership of arcs in a fuzzy graph.

Figure 2 shows a possible set of conceptual relations among the terms described above.

Figure 2: Example of Conceptual Relations

The fuzzy logic technique associated with this information structure provides a method to reduce the so-called precision/recall trade-off. This is one of the methods that has had some success in decreasing the changes of missing important information [3].

The 6-tuples of predefined terms presented earlier (see section 2.1) were designed by Prieto-Diáz for the specific task of software classification. It is an open problem how these can be extended or adapted to so generic information as accessed through any Internet navigation.

The pre-inserted values for each one of these facets will serve as guidelines for document classification. Possible matches among those values and the words extracted from the HTML text reflect part of the so-called AAC mechanism. The others will be described in the sections below.

3.3 AO Facets

F. Luís Neves and José N. Oliveira , "Classifying Internet Objects" in WWW National Conference'95, Minho University, Braga, Portugal