High-level Thesaurus (HILT)
George Macgregor: <george DOT macgregor AT strath DOT ac DOT uk>
Emma McCulloch: <e DOT mcculloch AT strath DOT ac DOT uk>
Dennis Nicholson: <d DOT m DOT nicholson AT strath DOT ac DOT uk>
General purpose and services to the end user
Problems relating to the use of terminologies use have been an impediment to information retrieval for many years, but the growth of Web, associated heterogeneous digital repositories, and the need for distributed cross-searching within multi-scheme information environments has recently drawn the issue into sharp focus. The HILT project, which is now in phase III, aims to research, investigate and develop solutions for problems pertaining to cross-searching multi-subject scheme information environments, as well as providing a variety of other terminological searching aids. The project is currently at a pilot stage.
The current phase of HILT (phase III) is researching and developing the creation of an M2M demonstrator that will offer web-services access via the (SOAP-based) SRW protocol and use SKOS-Core as the 'mark-up' for sending terminology sets and maintaining the structural nature of the terminological data requested and/or found in the database.
The expectation is that services will employ Search/Retrieve Web service (SRW) clients to interact transparently with the SRW compliant terminology mapping server during normal service operation. Client requests made to the server will be sent to a database of terminology sets and associated mappings to DDC (the Dewey Decimal Classification system is used as the basis of vocabulary switching). Hits identified are then sent back to the server for onward communication to the SRW clients. Although one of the primary purposes of HILT is to provide mappings, it also offers a variety of other terminological functions (e.g. data for interactive query expansion, hierarchical browsing of specific scheme hierarchies, etc.).
In brief, HILT provides a series of functions that can be invoked by client services for a variety of purposes. It is therefore difficult to anticipate how such data might be used by third parties or how they might enhance the functionality of local services. However, the current functions are described and summarised in table below and hints at anticipated use are also provided. Only those functions requesting terminological data (and ergo SKOS-Core) are described below.
Notes and anticipated use
Get records that include – or are directly or indirectly mapped to records that include – specified term or term phrase.
User enters term via the embedded SRW client service (and request is sent to SRW server and SOAP requests handler). The client service processes the results to offer DDC and non-DDC records (e.g. LCSH, IPSV, AAT, etc.) to the user.
Get any DDC record that either includes the term specified, or that is mapped to by a record that includes the term specified.
User enters term via the embedded SRW client service (and request is sent to SRW server and SOAP requests handler). The client service processes the results to offer the user captions possibly relevant to their query from DDC with corresponding DDC numbers.
Get any non-DDC record that includes a mapping to the DDC number sent.
Following identification of a DDC number by the user (either via a disambiguation process or by some other means), get_non_DDC_records allows the client to request details of any non-DDC record that includes a mapping to the DDC number sent. This process enables the identification of a variety of terms from disparate terminologies associated with a particular DDC number to be used to search relevant repositories or information services using the correct terminology to match local indexes.
Get record and fields terminology set that meets the specified parameters.
Filters would be by things such as subject scheme (e.g. UNESCO and LCSH only, say), or specified fields (e.g. mappings only, say, or broader terms and narrower terms only, say) and so on. The primary anticipated purpose of get_filtered_set is to facilitate the enrichment of users' search vocabulary, provide user feedback and allow limited interactive query expansion. The filtered search can provide (where they exist) related terms (RT), broader terms (BT), narrower terms (NT), preferred terms (PT), and non-preferred terms (NPT). Scope notes may also be provided, depending on the characteristics of the terminology. As a requirement of earlier project work, get_filtered_set can also be invoked to provide the data necessary to create browsable hierarchical concept trees.
The application is invoked via an appropriately configured SRW client. The application comprises an SRW server, SOAP server (requests handler, wrapping responses in SKOS-Core) and terminology database (complete with mappings to the DDC spine). The testing architecture also includes collections databases (e.g. the UK Information Environment Services Registry (IESR), Scottish Collections Network (SCONE, etc.); however, this does not return terminological data and can be ignored for the purposes of this document. Figure 1 illustrates how the various nodes interact.
Integration between vocabulary-linked functions and other application functions
HILT is a web-service and it is up to client service administrator whether they wish to incorporate the types of functionalities mentioned previously. In addition, HILT is a third party service and therefore has little knowledge of the documents held (or their indexes) by services. HILT does currently use the Google spellchecker (via the Google API) on the server side. As it does not meet the functionality required by HILT, it is expected to be replaced with a spellchecking/suggestions tool more suitable for the application in the near future.
HILT website: http://hilt.cdlr.strath.ac.uk/ HILT III pilots & demonstrators: http://hilt.cdlr.strath.ac.uk/hilt3web/pilots.html HILT III requirements document (Version 6.0): http://hilt.cdlr.strath.ac.uk/hilt3web/reports/h3requirementsv6.pdf
Nicholson, D. & McCulloch, E. Investigating the feasibility of a distributed, mapping-based, approach to solving subject interoperability problems in a multi-scheme, cross-service, retrieval environment, International Conference on Digital Libraries, 5-8 December 2006, India Habitat Center, New Delhi, India, 2006.
Nicholson, D. & McCulloch, E. HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot, 5th European Networked Knowledge Organization Systems (NKOS) Workshop, 10th ECDL Conference, 21 September 2006, Alicante, Spain, 2006. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination/HILTECDLwithnotes.pdf
Nicholson D. & McCulloch E. Interoperable Subject Retrieval in a Distributed Multi-Scheme Environment: New Developments in the HILT Project, Ibersid, 2-4 November 2005, Zaragoza, Spain, 2005. Available: http://cdlr.strath.ac.uk/pubs/nicholsond/ZaragosaPaperFinal.pdf
Macgregor, G., Joseph, A. & Nicholson, D. A SKOS Core approach to implementing an M2M terminology mapping server, International Conference on Semantic Web and Digital Libraries (ICSD-2007), 21-23 February 2007, Documentation Research & Training Centre (DRTC), Indian Statistical Institute (ISI), R.V. College, Bangalore, India, 2007. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination
For HILT to solve issues pertaining to interoperability, it is possible that any terminology may need to be modelled using SKOS-Core. Currently however, HILT is experimenting with the following terminologies:
Art and Architecture Thesaurus (AAT), The J. Paul Getty Trust. See: http://www.getty.edu/research/conducting_research/vocabularies/aat/
Dewey Decimal Classification (DDC), OCLC. See: http://www.oclc.org/dewey/default.htm
Global Change Master Directory (GCMD) (Science Keywords), NASA. See: http://gcmd.nasa.gov/Resources/valids/keyword_list.html
HASSET Thesaurus, UK Data Archive at the University of Essex. See: http://www.data-archive.ac.uk/search/hassetSearch.asp
Integrated Public Sector Vocabulary (IPSV), e-Government Unit (UK). See: http://www.esd.org.uk/standards/ipsv/2.00/viewer/
Joint Academic Coding System (JACS), Universities and Colleges Admission Service (UK). See: http://www.ucas.ac.uk/figures/ucasdata/subject/
JITA Classification Schema, E-Prints in Library and Information Science (E-LIS). See: http://eprints.rclis.org/jita.html
Library of Congress Subject Headings (LCSH), Library of Congress (USA). See: http://www.loc.gov/cds/lcsh.html
Medical Subject Headings (MeSH), National Library of Medicine (USA). See: http://www.nlm.nih.gov/mesh/
National Monuments Record Thesaurus (NMR), English Heritage. See: http://thesaurus.english-heritage.org.uk/
UNESCO Thesaurus, UNESCO and the University of London Computer Centre. See: http://www2.ulcc.ac.uk/unesco/
Because HILT has to serve data on each terminology dynamically, it is not possible to model the particular nuances of every terminology exactly; rather, a generic approach is taken since most of the terminologies represent some form of relational vocabulary (e.g. thesauri, subject heading lists, etc.). The exception to this is DDC. Terminological data pertaining to DDC is modelled in SKOS according to basic guidance discussed on the SKOS email list http://lists.w3.org/Archives/Public/public-esw-thes/ and http://esw.w3.org/topic/SkosDev/ClassificationPubGuide.
General characteristics (size, coverage) of the vocabulary
- AAT: 34,000 concepts, comprising 131,000 terms. Focussed on describing art, architecture, decorative arts, material culture, and archival materials.
- DDC: 40,000 numbers and associated captions. A universal classification scheme aiming to accommodate most areas of knowledge at varying levels of specificity. Note that DDC contains many more numbers for representing concepts; however, the version in use by HILT is a truncated version supplied by OCLC based on their People, Places and Things handbook (See Mitchell, 2001).
- GCMD: 1,500 terms. Focussed on concepts pertaining to earth science (e.g. geology, marine science, oceanography, etc.).
- HASSET: 10,000 terms.
- IPSV: 8,000 terms. A scheme primarily optimised for resource discovery in UK public sector organisations.
- JACS: A simple term list pertaining to the UK Higher Education sector, with only 100 terms.
- JITA: A simple term list as used by E-LIS. Focussed on library and information science and comprises 150 terms.
- LCSH: 62,000 terms. A universal subject heading scheme aiming to accommodate most areas of knowledge.
- MeSH: 25,000 terms, focussed on medicine and allied sciences.
- NMR: 10,000 terms. A scheme primarily focussed on representing common assets found in the area of national heritage, such as buildings, monuments, cultural sites, and so forth.
- UNESCO: 5,000 terms. The UNESCO Thesaurus includes subject terms for the areas of education, science, culture, social and human sciences, information and communication, and politics, law and economics. It also includes countries and groupings of countries: political, economic, geographic, ethnic and religious, and linguistic groupings.
Language(s) in which the vocabulary is provided
All terminologies are currently provided in English (UK/US). Experimentation and the incorporation of multi-lingual terminologies is a future aspiration of the HILT team.
HILT is a web-service. Any terminological data HILT provides is wrapped in SKOS-Core and is delivered to clients in response to client service requests. It is up to client services to use the data how they wish, this includes how they may wish to parse the data and how they may wish to present the terminological data requested to the user. The data sent to clients is modelled rather generically, but is sufficiently accurate to allow clients to parse data correctly, particularly for browsing dynamically created scheme specific hierarchical trees.
The terminologies offered are widely used by digital libraries, repositories and information services within the UK and beyond. Further details on any specific characteristics of the terminologies used can be gleaned at the URLs given in the previous section. Most are relational vocabularies, such as thesauri and subject heading lists (i.e. UNESCO, NMR, HASSET, MeSH, IPSV, LCSH, AAT). The mapping spine used for switching (i.e. DDC) is a taxonomic classification with analytico-synthetic attributes. There are several terms lists (i.e. JACS, JITA, GCMD); their simple structure reflects this.
Machine-readable representation of the vocabulary
A mock client (i.e. HILT SOAP client demonstrator) has been created to enable testing: http://hiltm2m.cdlr.strath.ac.uk/hiltm2m/hiltsoapclient.php. This allows the HILT functions to be invoked and for machine-readable representations of the vocabularies (i.e. SKOS-Core) and their mappings (where applicable) to be viewed within SOAP envelopes. Representations of the terminologies in use can be viewed in this way.
Please refer to the functions table (Table 1) to interpret HILT functions correctly. Please note that regular and ongoing modifications to the SKOS-Core wrappings are made to improve the way in which terminologies are modelled.
Software applications used to create and/or maintain the vocabulary, features lacking for the case
Maintenance of the terminologies is not applicable in our case since all the terminologies used are maintained by external agencies; we normally receive copies of these terminologies from maintenance agencies (e.g. XML) for database importation. Where updates have taken place terminologies are re-imported.
Occasional 'cleaning up' of this data is required in order to eliminate non-standard characters arising from ASCII, but this is normally undertaken by editing data directly or running routines in the database (SQL Server). Mappings (using DDC notation) and their equivalences are maintained in this way also, although it is our intention to create a suitable user interface to aid mapping management.
Structure of the database used to currently manage the vocabulary
A relational database management system (SQL server) is used to manage the terminologies. The table structure is different for each terminology in order to maintain the structure of the terminology as received from external agencies. This aids consistency and makes maintenance simpler. Even though all the terminologies are included in a single database, they remain independent of each other.
Owing to the large number of terminologies HILT is managing - many of which are complex - we provide only a small sample in the attached file below: AAT (two tables), HASSET (five tables) and IPSV (three tables).
Please note: RT is Related Term, BT Broader Term, PT Preferred Term and NPT Non Preferred Term.
Nicholson, D. & McCulloch, E. (2006). HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot, 5th European Networked Knowledge Organization Systems (NKOS) Workshop, 10th ECDL Conference, 21 September 2006, Alicante, Spain. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination/HILTECDLwithnotes.pdf
Mitchell, J, S. (ed). (2001). People, places and things. A list of popular Library of Congress Subject Headings with Dewey numbers. Forest Press, Ohio.
As mentioned before, HILT is using a number of terminologies. Concepts from these terminologies are mapped to a central spine (DDC) which is used as a switching language to facilitate the following HILT functions: get_all_records, get_ddc_records and get_non_ddc_records.
Extracts of Mappings
HILT is a web-service. Any terminological data HILT provides is wrapped in SKOS-Core (and the Mapping Vocabulary Specification (MVS) where applicable) and is delivered to clients in response to client service requests. Client services are responsible for how they wish to present mappings or other terminological data to users.
Some illustrative SKOS-Core and MVS examples are provided below. However, since HILT uses a number of terminologies, readers are encouraged to use the mock client (i.e. HILT SOAP client demonstrator, http://hiltm2m.cdlr.strath.ac.uk/hiltm2m/hiltsoapclient.php) to view the mappings.
Useful test terms to input when using the SOAP client demonstrator for get_all_records and get_ddc_records are:
- Environmental impacts (GCMD)
- Shore protection (DDC)
- Plant genetics (HASSET)
- Civil emergencies (IPSV)
- Land use site (NMR)
Useful test DDC numbers to input for get_non_ddc_records are:
- 363.73 (DDC caption: Pollution)
- 627.58 (DDC caption: Shore protection)
- 631.53 (DDC caption: Plant propagation)
- 363.34 (DDC caption: Disasters)
- 333 (DDC caption: Economics of land and energy)
Types of mapping used
The mapping types currently used in HILT are those specified in the Mapping Vocabulary Specification (MVS): exactMatch, narrowMatch, broadMatch, minorMatch, majorMatch.
However, it is felt that in the HILT context these mapping types may be inadequate for such things as:
- The ranking of large result sets according to the degree of concordance with users' preferred terminology.
- For example, a user query, 'tooth', is submitted. A large result set is returned comprising hundreds of exactMatch resources. Among the large result set, resources pertaining to 'teeth' are found and – under SKOS MVS definitions – are considered equivalent to those on 'tooth'. However, in such a large result set it would be useful to rank results more meaningfully for the user. In the aforementioned example therefore, resources indexed 'tooth' would feature slightly higher in the result set than 'teeth' since the term is an exact match and is, ultimately, most relevant to the user since it exactly matches the original information query. As such, 'teeth' might be considered a plural match and therefore feature lower in the rankings. Match types exemplifying greater specificity would be useful to aid such ranking.
- Providing users with details of the precise nature of the relationship(s) between their entered query and their retrieved result set (which will invariably include mapped terms from other terminologies, or comprise resources retrieved using terms derived from mapped terminologies), or imparting sufficient information during subject hierarchy browsing to enable users to make informed decisions about the relevance of mapped terms
- The need to reconcile resources by concept is clearly necessary on the Semantic Web. However, the role of match types is of importance (in HILT and probably other potential services) when informing users (via client services) of why particular terms have been retrieved in response to a user query. For example, a user that is searching for resources on lung disease and submits the query 'lung disease' may retrieve resources indexed under 'pneumoconiosis'. These resources are indeed relevant, but for the user who is uniformed about the way in which indexes are mapped and matched may doubt that pneumoconiosis is (under the SKOS MVS) an exactMatch when they browse their results. Going to great lengths to inform users is envisaged as necessary in order to facilitate the re-formulation of subsequent queries. Providing users with such mappings could also be used to generate (potentially) improved relevance feedback.
- Helping identify mapping regularities between specific terminologies, thus facilitating the research and development of improved automated routines to assist in large-scale terminology mapping. (This is not directly relevant to HILT, but the identification subtle patterns in mapping relationships between terminologies that could assist HILT at a later date if large scale, machine-assisted terminology mappings was being undertaken).
In addition to the above, it is also worth noting that HILT offers the ability for clients to create a 'disambiguation' stage during user searching. This process of disambiguation not only resolves the existence of homographs (as the term 'disambiguation' may suggest), but encompasses a variety of processes allowing users to qualify their search requirements.
To this end HILT has been exploring the use of a second set of match types to be used by clients instead of - or in tandem with - the SKOS MVS match types. The match types considered include some of those proposed by Chaplan (1995). Chaplan's match types are in many respects a departure from the conceptual approach taken in the SKOS MVS and the general proclivity on the Semantic Web for representing concepts. The focus of Chaplan's match types is more on differences in the way in which a term is represented (e.g. singular/plural match, spelling variation, word order variation, etc.) rather than reconciling concepts. However, we hypothesise that the requirement for a set of more detailed match types will be needed to assist users during the aforementioned disambiguation stage and the other issues noted above. We also consider both approaches to be complementary, with the conceptual nature of the SKOS MVS providing a level of abstraction above - and preceding the use of - a lexically-based set of match types.
It is worth noting that this more detailed set of match types need not necessarily be lexically-based (although we currently consider such an approach useful) and could feasibly be an extension of the current MVS with the introduction of finer match types for such purposes. In particular, it is thought that the majorMatch and minorMatch types would benefit from further definition and perhaps the introduction of further gradations. For example, there is currently no indication whether a majorMatch between two concepts is 'weak' or 'strong' (i.e. quantification of the level of match beyond 50% as currently specified as a guideline).
Chaplan, M. A. (1995). Mapping Laborline thesaurus terms to Library of Congress subject headings: implications for vocabulary switching, Library Quarterly, 65(1), 39-61.
Nicholson, D., Dawson, A. & Shiri, A. (2006). HILT: A pilot terminology mapping service with a DDC spine, Cataloging & Classification Quarterly 42(3/4), 187-200. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination