This is an archive of an inactive wiki and cannot be modified.

High-level Thesaurus (HILT)

Contact e-mail:

Application

General purpose and services to the end user

Problems relating to the use of terminologies use have been an impediment to information retrieval for many years, but the growth of Web, associated heterogeneous digital repositories, and the need for distributed cross-searching within multi-scheme information environments has recently drawn the issue into sharp focus. The HILT project, which is now in phase III, aims to research, investigate and develop solutions for problems pertaining to cross-searching multi-subject scheme information environments, as well as providing a variety of other terminological searching aids. The project is currently at a pilot stage.

The current phase of HILT (phase III) is researching and developing the creation of an M2M demonstrator that will offer web-services access via the (SOAP-based) SRW protocol and use SKOS-Core as the 'mark-up' for sending terminology sets and maintaining the structural nature of the terminological data requested and/or found in the database.

The expectation is that services will employ Search/Retrieve Web service (SRW) clients to interact transparently with the SRW compliant terminology mapping server during normal service operation. Client requests made to the server will be sent to a database of terminology sets and associated mappings to DDC (the Dewey Decimal Classification system is used as the basis of vocabulary switching). Hits identified are then sent back to the server for onward communication to the SRW clients. Although one of the primary purposes of HILT is to provide mappings, it also offers a variety of other terminological functions (e.g. data for interactive query expansion, hierarchical browsing of specific scheme hierarchies, etc.).

Experimentation with bona fide services has been conducted as part of HILT phase III (e.g. GoGeo!: http://www.gogeo.ac.uk/). To date, only a pilot implementation is available.

Functionality examples

In brief, HILT provides a series of functions that can be invoked by client services for a variety of purposes. It is therefore difficult to anticipate how such data might be used by third parties or how they might enhance the functionality of local services. However, the current functions are described and summarised in table below and hints at anticipated use are also provided. Only those functions requesting terminological data (and ergo SKOS-Core) are described below.

Application architecture

The application is invoked via an appropriately configured SRW client. The application comprises an SRW server, SOAP server (requests handler, wrapping responses in SKOS-Core) and terminology database (complete with mappings to the DDC spine). The testing architecture also includes collections databases (e.g. the UK Information Environment Services Registry (IESR), Scottish Collections Network (SCONE, etc.); however, this does not return terminological data and can be ignored for the purposes of this document. Figure 1 illustrates how the various nodes interact.

Figure 1: HILT (Phase III) architecture

Integration between vocabulary-linked functions and other application functions

HILT is a web-service and it is up to client service administrator whether they wish to incorporate the types of functionalities mentioned previously. In addition, HILT is a third party service and therefore has little knowledge of the documents held (or their indexes) by services. HILT does currently use the Google spellchecker (via the Google API) on the server side. As it does not meet the functionality required by HILT, it is expected to be replaced with a spellchecking/suggestions tool more suitable for the application in the near future.

Additional references

HILT website: http://hilt.cdlr.strath.ac.uk/ HILT III pilots & demonstrators: http://hilt.cdlr.strath.ac.uk/hilt3web/pilots.html HILT III requirements document (Version 6.0): http://hilt.cdlr.strath.ac.uk/hilt3web/reports/h3requirementsv6.pdf

Nicholson, D. & McCulloch, E. Investigating the feasibility of a distributed, mapping-based, approach to solving subject interoperability problems in a multi-scheme, cross-service, retrieval environment, International Conference on Digital Libraries, 5-8 December 2006, India Habitat Center, New Delhi, India, 2006.

Nicholson, D. & McCulloch, E. HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot, 5th European Networked Knowledge Organization Systems (NKOS) Workshop, 10th ECDL Conference, 21 September 2006, Alicante, Spain, 2006. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination/HILTECDLwithnotes.pdf

Nicholson D. & McCulloch E. Interoperable Subject Retrieval in a Distributed Multi-Scheme Environment: New Developments in the HILT Project, Ibersid, 2-4 November 2005, Zaragoza, Spain, 2005. Available: http://cdlr.strath.ac.uk/pubs/nicholsond/ZaragosaPaperFinal.pdf

Macgregor, G., Joseph, A. & Nicholson, D. A SKOS Core approach to implementing an M2M terminology mapping server, International Conference on Semantic Web and Digital Libraries (ICSD-2007), 21-23 February 2007, Documentation Research & Training Centre (DRTC), Indian Statistical Institute (ISI), R.V. College, Bangalore, India, 2007. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination

Vocabulary

Titles

For HILT to solve issues pertaining to interoperability, it is possible that any terminology may need to be modelled using SKOS-Core. Currently however, HILT is experimenting with the following terminologies:

Because HILT has to serve data on each terminology dynamically, it is not possible to model the particular nuances of every terminology exactly; rather, a generic approach is taken since most of the terminologies represent some form of relational vocabulary (e.g. thesauri, subject heading lists, etc.). The exception to this is DDC. Terminological data pertaining to DDC is modelled in SKOS according to basic guidance discussed on the SKOS email list http://lists.w3.org/Archives/Public/public-esw-thes/ and http://esw.w3.org/topic/SkosDev/ClassificationPubGuide.

General characteristics (size, coverage) of the vocabulary

Language(s) in which the vocabulary is provided

All terminologies are currently provided in English (UK/US). Experimentation and the incorporation of multi-lingual terminologies is a future aspiration of the HILT team.

Structure explanation

HILT is a web-service. Any terminological data HILT provides is wrapped in SKOS-Core and is delivered to clients in response to client service requests. It is up to client services to use the data how they wish, this includes how they may wish to parse the data and how they may wish to present the terminological data requested to the user. The data sent to clients is modelled rather generically, but is sufficiently accurate to allow clients to parse data correctly, particularly for browsing dynamically created scheme specific hierarchical trees.

The terminologies offered are widely used by digital libraries, repositories and information services within the UK and beyond. Further details on any specific characteristics of the terminologies used can be gleaned at the URLs given in the previous section. Most are relational vocabularies, such as thesauri and subject heading lists (i.e. UNESCO, NMR, HASSET, MeSH, IPSV, LCSH, AAT). The mapping spine used for switching (i.e. DDC) is a taxonomic classification with analytico-synthetic attributes. There are several terms lists (i.e. JACS, JITA, GCMD); their simple structure reflects this.

Machine-readable representation of the vocabulary

A mock client (i.e. HILT SOAP client demonstrator) has been created to enable testing: http://hiltm2m.cdlr.strath.ac.uk/hiltm2m/hiltsoapclient.php. This allows the HILT functions to be invoked and for machine-readable representations of the vocabularies (i.e. SKOS-Core) and their mappings (where applicable) to be viewed within SOAP envelopes. Representations of the terminologies in use can be viewed in this way.

Please refer to the functions table (Table 1) to interpret HILT functions correctly. Please note that regular and ongoing modifications to the SKOS-Core wrappings are made to improve the way in which terminologies are modelled.

Software applications used to create and/or maintain the vocabulary, features lacking for the case

Maintenance of the terminologies is not applicable in our case since all the terminologies used are maintained by external agencies; we normally receive copies of these terminologies from maintenance agencies (e.g. XML) for database importation. Where updates have taken place terminologies are re-imported.

Occasional 'cleaning up' of this data is required in order to eliminate non-standard characters arising from ASCII, but this is normally undertaken by editing data directly or running routines in the database (SQL Server). Mappings (using DDC notation) and their equivalences are maintained in this way also, although it is our intention to create a suitable user interface to aid mapping management.

Structure of the database used to currently manage the vocabulary

A relational database management system (SQL server) is used to manage the terminologies. The table structure is different for each terminology in order to maintain the structure of the terminology as received from external agencies. This aids consistency and makes maintenance simpler. Even though all the terminologies are included in a single database, they remain independent of each other.

Owing to the large number of terminologies HILT is managing - many of which are complex - we provide only a small sample in the attached file below: AAT (two tables), HASSET (five tables) and IPSV (three tables).

Table diagram

Please note: RT is Related Term, BT Broader Term, PT Preferred Term and NPT Non Preferred Term.

Additional references

Nicholson, D. & McCulloch, E. (2006). HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot, 5th European Networked Knowledge Organization Systems (NKOS) Workshop, 10th ECDL Conference, 21 September 2006, Alicante, Spain. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination/HILTECDLwithnotes.pdf

Mitchell, J, S. (ed). (2001). People, places and things. A list of popular Library of Congress Subject Headings with Dewey numbers. Forest Press, Ohio.

Vocabulary mappings

Mapped vocabularies

As mentioned before, HILT is using a number of terminologies. Concepts from these terminologies are mapped to a central spine (DDC) which is used as a switching language to facilitate the following HILT functions: get_all_records, get_ddc_records and get_non_ddc_records.

Extracts of Mappings

HILT is a web-service. Any terminological data HILT provides is wrapped in SKOS-Core (and the Mapping Vocabulary Specification (MVS) where applicable) and is delivered to clients in response to client service requests. Client services are responsible for how they wish to present mappings or other terminological data to users.

Some illustrative SKOS-Core and MVS examples are provided below. However, since HILT uses a number of terminologies, readers are encouraged to use the mock client (i.e. HILT SOAP client demonstrator, http://hiltm2m.cdlr.strath.ac.uk/hiltm2m/hiltsoapclient.php) to view the mappings.

Useful test terms to input when using the SOAP client demonstrator for get_all_records and get_ddc_records are:

Useful test DDC numbers to input for get_non_ddc_records are:

see RucHilt/BriefExamples

Types of mapping used

The mapping types currently used in HILT are those specified in the Mapping Vocabulary Specification (MVS): exactMatch, narrowMatch, broadMatch, minorMatch, majorMatch.

However, it is felt that in the HILT context these mapping types may be inadequate for such things as:

In addition to the above, it is also worth noting that HILT offers the ability for clients to create a 'disambiguation' stage during user searching. This process of disambiguation not only resolves the existence of homographs (as the term 'disambiguation' may suggest), but encompasses a variety of processes allowing users to qualify their search requirements.

To this end HILT has been exploring the use of a second set of match types to be used by clients instead of - or in tandem with - the SKOS MVS match types. The match types considered include some of those proposed by Chaplan (1995). Chaplan's match types are in many respects a departure from the conceptual approach taken in the SKOS MVS and the general proclivity on the Semantic Web for representing concepts. The focus of Chaplan's match types is more on differences in the way in which a term is represented (e.g. singular/plural match, spelling variation, word order variation, etc.) rather than reconciling concepts. However, we hypothesise that the requirement for a set of more detailed match types will be needed to assist users during the aforementioned disambiguation stage and the other issues noted above. We also consider both approaches to be complementary, with the conceptual nature of the SKOS MVS providing a level of abstraction above - and preceding the use of - a lexically-based set of match types.

It is worth noting that this more detailed set of match types need not necessarily be lexically-based (although we currently consider such an approach useful) and could feasibly be an extension of the current MVS with the introduction of finer match types for such purposes. In particular, it is thought that the majorMatch and minorMatch types would benefit from further definition and perhaps the introduction of further gradations. For example, there is currently no indication whether a majorMatch between two concepts is 'weak' or 'strong' (i.e. quantification of the level of match beyond 50% as currently specified as a guideline).

Additional references

Chaplan, M. A. (1995). Mapping Laborline thesaurus terms to Library of Congress subject headings: implications for vocabulary switching, Library Quarterly, 65(1), 39-61.

Nicholson, D., Dawson, A. & Shiri, A. (2006). HILT: A pilot terminology mapping service with a DDC spine, Cataloging & Classification Quarterly 42(3/4), 187-200. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination