Section 0. Contact and confidentiality
George Macgregor: <george DOT macgregor AT strath DOT ac DOT uk>
Emma McCulloch: <e DOT mcculloch AT strath DOT ac DOT uk>
Dennis Nicholson: <d DOT m DOT nicholson AT strath DOT ac DOT uk>
Do you mind your use case being made public on the working group website and documents?
Section 1. Application
In this section we ask you to provide some information about the application for which the vocabulary(ies) and or vocabulary mappings are being used. Please note:
- If your use case does not involve any specific application, but consists rather in the description of a specific vocabulary, skip straight to Section 2.
- If your application makes use of links between different vocabularies, do not forget to fill in Section 3!
1.1. What is the title of the application?
High-level Thesaurus (HILT) (http://hilt.cdlr.strath.ac.uk/)
1.2. What is the general purpose of the application?
- What services does it provide to the end-user?
Problems relating to the use of terminologies use have been an impediment to information retrieval for many years, but the growth of Web, associated heterogeneous digital repositories, and the need for distributed cross-searching within multi-scheme information environments has recently drawn the issue into sharp focus. The HILT project, which is now in phase III, aims to research, investigate and develop solutions for problems pertaining to cross-searching multi-subject scheme information environments, as well as providing a variety of other terminological searching aids. The project is currently at a pilot stage. The current phase of HILT (phase III) is researching and developing the creation of an M2M demonstrator that will offer web-services access via the (SOAP-based) SRW protocol and use SKOS-Core as the 'mark-up' for sending terminology sets and maintaining the structural nature of the terminological data requested and/or found in the database. The expectation is that services will employ Search/Retrieve Web service (SRW) clients to interact transparently with the SRW compliant terminology mapping server during normal service operation. Client requests made to the server will be sent to a database of terminology sets and associated mappings to DDC (the Dewey Decimal Classification system is used as the basis of vocabulary switching). Hits identified are then sent back to the server for onward communication to the SRW clients. Although one of the primary purposes of HILT is to provide mappings, it also offers a variety of other terminological functions (e.g. data for interactive query expansion, hierarchical browsing of specific scheme hierarchies, etc.). Experimentation with bona fide services has been conducted as part of HILT phase III (e.g. GoGeo!: http://www.gogeo.ac.uk/). To date, only a pilot implementation is available.
1.3. Provide some examples of the functionality of the application. Try to illustrate all of the functionalities in which the vocabulary(ies) and/or vocabulary mappings are involved.
In brief, HILT provides a series of functions that can be invoked by client services for a variety of purposes. It is therefore difficult to anticipate how such data might be used by third parties or how they might enhance the functionality of local services. However, the current functions are described and summarised in Table 1 and hints at anticipated use are also provided. Only those functions requesting terminological data (and ergo SKOS-Core) are described below.
Notes and anticipated use
Get records that include – or are directly or indirectly mapped to records that include – specified term or term phrase.
User enters term via the embedded SRW client service (and request is sent to SRW server and SOAP requests handler). The client service processes the results to offer DDC and non-DDC records (e.g. LCSH, IPSV, AAT, etc.) to the user.
Get any DDC record that either includes the term specified, or that is mapped to by a record that includes the term specified.
User enters term via the embedded SRW client service (and request is sent to SRW server and SOAP requests handler). The client service processes the results to offer the user captions possibly relevant to their query from DDC with corresponding DDC numbers.
Get any non-DDC record that includes a mapping to the DDC number sent.
Following identification of a DDC number by the user (either via a disambiguation process or by some other means), get_non_DDC_records allows the client to request details of any non-DDC record that includes a mapping to the DDC number sent. This process enables the identification of a variety of terms from disparate terminologies associated with a particular DDC number to be used to search relevant repositories or information services using the correct terminology to match local indexes.
Get record and fields terminology set that meets the specified parameters.
Filters would be by things such as subject scheme (e.g. UNESCO and LCSH only, say), or specified fields (e.g. mappings only, say, or broader terms and narrower terms only, say) and so on. The primary anticipated purpose of get_filtered_set is to facilitate the enrichment of users' search vocabulary, provide user feedback and allow limited interactive query expansion. The filtered search can provide (where they exist) related terms (RT), broader terms (BT), narrower terms (NT), preferred terms (PT), and non-preferred terms (NPT). Scope notes may also be provided, depending on the characteristics of the terminology. As a requirement of earlier project work, get_filtered_set can also be invoked to provide the data necessary to create browsable hierarchical concept trees.
1.4. What is the architecture of the application?
- What are the main components?
- Are the components and/or the data distributed across a network, or across the Web?
The application is invoked via an appropriately configured SRW client. The application comprises an SRW server, SOAP server (requests handler, wrapping responses in SKOS-Core) and terminology database (complete with mappings to the DDC spine). Our testing architecture also includes collections databases (e.g. the UK Information Environment Services Registry (IESR), Scottish Collections Network (SCONE, etc.); however, this does not return terminological data and can be ignored for the purposes of this document. Figure 1 illustrates how the various nodes interact.
1.5. Briefly describe any special strategy involved in the processing of user actions, e.g. query expansion using the vocabulary structure.
Not applicable in our instance. See 1.6 below.
1.6. Are the functionalities associated with the controlled vocabulary(ies) integrated in any way with functionalities provided by other means? (For example, search and browse using a structured vocabulary might be integrated with free-text searching and/or some sort of social bookmarking or recommender system.)
HILT is a web-service and it would up to client service administrator whether they wished to incorporate the types of functionalities mentioned in 1.6. In addition, HILT is a third party service and therefore has little knowledge of the documents held (or their indexes) by services. HILT does currently use the Google spellchecker (via the Google API) on the server side. This is expected to be removed in future as the 'Did you mean' suggestions offered by Google does not meet the functionality required by HILT. The Google spellchecker is expected to be replaced with a spellchecking/suggestions tool more suitable for the application in the near future.
1.7. Any additional information, references and/or hyperlinks.
HILT website: http://hilt.cdlr.strath.ac.uk/ HILT III pilots & demonstrators: http://hilt.cdlr.strath.ac.uk/hilt3web/pilots.html HILT III requirements document (Version 6.0): http://hilt.cdlr.strath.ac.uk/hilt3web/reports/h3requirementsv6.pdf Nicholson, D. & McCulloch, E. Investigating the feasibility of a distributed, mapping-based, approach to solving subject interoperability problems in a multi-scheme, cross-service, retrieval environment, International Conference on Digital Libraries, 5-8 December 2006, India Habitat Center, New Delhi, India, 2006. Nicholson, D. & McCulloch, E. HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot, 5th European Networked Knowledge Organization Systems (NKOS) Workshop, 10th ECDL Conference, 21 September 2006, Alicante, Spain, 2006. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination/HILTECDLwithnotes.pdf Nicholson D. & McCulloch E. Interoperable Subject Retrieval in a Distributed Multi-Scheme Environment: New Developments in the HILT Project, Ibersid, 2-4 November 2005, Zaragoza, Spain, 2005. Available: http://cdlr.strath.ac.uk/pubs/nicholsond/ZaragosaPaperFinal.pdf Macgregor, G., Joseph, A. & Nicholson, D. A SKOS Core approach to implementing an M2M terminology mapping server, International Conference on Semantic Web and Digital Libraries (ICSD-2007), 21-23 February 2007, Documentation Research & Training Centre (DRTC), Indian Statistical Institute (ISI), R.V. College, Bangalore, India, 2007. (An online version should be available by late Feb 2007)
Section 2. Vocabulary(ies)
In this section we ask you to provide some information about the vocabulary or vocabularies you would like to be able to represent using SKOS. Please note:
- If you have multiple vocabularies to describe, you may repeat this section for each one individually or you may provide a single description that encompasses all of your vocabularies.
- If your use case describes a generic application of one or more vocabularies and/or vocabulary mappings, you may skip this section.
- If your vocabulary case contains cross-vocabulary links (between the vocabularies you presented or to external vocabularies), please fill in section 3!
For HILT to solve issues pertaining to interoperability, it is possible that any terminology may need to be modelled using SKOS-Core. Currently however, HILT is experimenting with the following terminologies: • Art and Architecture Thesaurus (AAT), The J. Paul Getty Trust. See: http://www.getty.edu/research/conducting_research/vocabularies/aat/ • Dewey Decimal Classification (DDC), OCLC. See: http://www.oclc.org/dewey/default.htm • Global Change Master Directory (GCMD) (Science Keywords), NASA. See: http://gcmd.nasa.gov/Resources/valids/keyword_list.html • HASSET Thesaurus, UK Data Archive at the University of Essex. See: http://www.data-archive.ac.uk/search/hassetSearch.asp • Integrated Public Sector Vocabulary (IPSV), e-Government Unit (UK). See: http://www.esd.org.uk/standards/ipsv/2.00/viewer/ • Joint Academic Coding System (JACS), Universities and Colleges Admission Service (UK). See: http://www.ucas.ac.uk/figures/ucasdata/subject/ • JITA Classification Schema, E-Prints in Library and Information Science (E-LIS). See: http://eprints.rclis.org/jita.html • Library of Congress Subject Headings (LCSH), Library of Congress (USA). See: http://www.loc.gov/cds/lcsh.html • Medical Subject Headings (MeSH), National Library of Medicine (USA). See: http://www.nlm.nih.gov/mesh/ • National Monuments Record Thesaurus (NMR), English Heritage. See: http://thesaurus.english-heritage.org.uk/ • UNESCO Thesaurus, UNESCO and the University of London Computer Centre. See: http://www2.ulcc.ac.uk/unesco/ Because HILT has to serve data on each terminology dynamically, it is not possible to model the particular nuances of every terminology exactly; rather, a generic approach is taken since most of the terminologies represent some form of relational vocabulary (e.g. thesauri, subject heading lists, etc.). The exception to this is DDC. Terminological data pertaining to DDC is modelled in SKOS according to basic guidance discussed on the SKOS email list http://lists.w3.org/Archives/Public/public-esw-thes/ and http://esw.w3.org/topic/SkosDev/ClassificationPubGuide.
2.1. What is the title of the vocabulary? If you're describing multiple vocabularies, please provide as many titles as you can.
See answer to question 2.
2.2. Briefly describe the general characteristics of the vocabulary, e.g. scope, size...
• AAT: 34,000 concepts, comprising 131,000 terms. Focussed on describing art, architecture, decorative arts, material culture, and archival materials. • DDC: 40,000 numbers and associated captions. A universal classification scheme aiming to accommodate most areas of knowledge at varying levels of specificity. Note that DDC contains many more numbers for representing concepts; however, the version in use by HILT is a truncated version supplied by OCLC based on their People, Places and Things handbook (See Mitchell, 2001). • GCMD: 1,500 terms. Focussed on concepts pertaining to earth science (e.g. geology, marine science, oceanography, etc.). • HASSET: 10,000 terms. • IPSV: 8,000 terms. A scheme primarily optimised for resource discovery in UK public sector organisations. • JACS: A simple term list pertaining to the UK Higher Education sector, with only 100 terms. • JITA: A simple term list as used by E-LIS. Focussed on library and information science and comprises 150 terms. • LCSH: 62,000 terms. A universal subject heading scheme aiming to accommodate most areas of knowledge. • MeSH: 25,000 terms, focussed on medicine and allied sciences. • NMR: 10,000 terms. A scheme primarily focussed on representing common assets found in the area of national heritage, such as buildings, monuments, cultural sites, and so forth. • UNESCO: 5,000 terms. The UNESCO Thesaurus includes subject terms for the areas of education, science, culture, social and human sciences, information and communication, and politics, law and economics. It also includes countries and groupings of countries: political, economic, geographic, ethnic and religious, and linguistic groupings.
2.3. In which language(s) is the vocabulary provided?
- In the case of partial translations, how complete are these?
All terminologies are currently provided in English (UK/US). Experimentation and the incorporation of multi-lingual terminologies is a future aspiration of the HILT team.
2.4. Please provide below some extracts from the vocabulary. Use the layout or presentation format that you would normally provide for the users of the vocabulary. Please ensure that the extracts you provide illustrate all of the features of the vocabulary.
This question is not necessarily applicable to HILT (see 2.6 for machine-readable representations of some of the terminologies used by HILT). HILT is a web-service. Any terminological data HILT provides is wrapped in SKOS-Core and is delivered to clients in response to client service requests. It is up to client services to use the data how they wish, this includes how they may wish to parse the data and how they may wish to present the terminological data requested to the user. The data sent to clients is modelled rather generically, but is sufficiently accurate to allow clients to parse data correctly, particularly for browsing dynamically created scheme specific hierarchical trees. The terminologies offered are widely used by digital libraries, repositories and information services within the UK and beyond. Further details on any specific characteristics of the terminologies used can be gleaned at the URLs given in the answer to question 2. Most are relational vocabularies, such as thesauri and subject heading lists (i.e. UNESCO, NMR, HASSET, MeSH, IPSV, LCSH, AAT). The mapping spine used for switching (i.e. DDC) is a taxonomic classification with analytico-synthetic attributes. There are several terms lists (i.e. JACS, JITA, GCMD); their simple structure reflects this.
2.5. Describe the structure of the vocabulary.
- What are the main building blocks?
- What types of relationship are used? If you can, provide examples by referring to the extracts given in paragraph 2.4.
See 2.4 above.
2.6. Is a machine-readable representation of the vocabulary already available (e.g. as an XML document)? If so, we would be grateful if you could provide some example data or point us to a hyperlink.
We have created a mock client (i.e. HILT SOAP client demonstrator) to enable testing. This allows the HILT functions to be invoked and for machine-readable representations of the vocabularies (i.e. SKOS-Core) and their mappings (where applicable) to be viewed within SOAP envelopes. Representations of the terminologies in use can be viewed in this way. The HILT SOAP client demonstrator is available at: http://hiltm2m.cdlr.strath.ac.uk/hiltm2m/hiltsoapclient.php. Please refer to the functions table (Table 1) provided in 1.3 to interpret HILT functions correctly. Please note that regular and ongoing modifications to the SKOS-Core wrappings are made to improve the way in which terminologies are modelled.
2.7. Are any software applications used to create and/or maintain the vocabulary?
- Are there any features which these software applications currently lack which are required by your use case?
Maintenance of the terminologies is not applicable in our case since all the terminologies used are maintained by external agencies; we normally receive copies of these terminologies from maintenance agencies (e.g. XML) for database importation. Where updates have taken place terminologies are re-imported. Occasional 'cleaning up' of this data is required in order to eliminate non-standard characters arising from ASCII, but this is normally undertaken by editing data directly or running routines in the database (SQL Server). Mappings (using DDC notation) and their equivalences are maintained in this way also, although it is our intention to create a suitable user interface to aid mapping management.
2.8. If a database application is used to store and/or manage the vocabulary, how is the database structured? Illustration by means of some table sample is welcome.
A relational database management system (SQL server) is used to manage the terminologies. The table structure is different for each terminology in order to maintain the structure of the terminology as received from external agencies. This aids consistency and makes maintenance simpler. Even though all the terminologies are included in a single database, they remain independent of each other. Owing to the large number of terminologies HILT is managing - many of which are complex - we provide only a small sample below. Please feel free to get in touch to discuss this further or to receive further samples. Below is a table diagram of AAT (two tables), HASSET (five tables) and IPSV (three tables). Please note: RT - Related Term BT - Broader Term PT - Preferred Term NPT - Non Preferred Term
2.9. Were any published standards, textbooks or written guidelines followed during the design and construction of the vocabulary?
- Did you decide to diverge from their recommendations in any way, and if so, how and why?
Not applicable. See 2.7.
2.10. How are changes to the vocabulary managed?
Changes to the terminologies are not managed by HILT (See 2.7). As noted in 2.7, terminology mappings and their equivalences are maintained by editing the database directly.
2.11. Any additional information, references and/or hyperlinks.
Nicholson, D. & McCulloch, E. (2006). HILT Phase III: Design requirements of an SRW-compliant Terminologies Mapping Pilot, 5th European Networked Knowledge Organization Systems (NKOS) Workshop, 10th ECDL Conference, 21 September 2006, Alicante, Spain. Available: http://hilt.cdlr.strath.ac.uk/hilt3web/Dissemination/HILTECDLwithnotes.pdf Mitchell, J, S. (ed). (2001). People, places and things. A list of popular Library of Congress Subject Headings with Dewey numbers. Forest Press, Ohio.
Section 3. Vocabulary Mappings
In this section we ask you to provide some information about the mappings or links between vocabularies you would like to be able to represent using SKOS. Please note:
- If your use case does not involve vocabulary mappings or links, you may skip this section!
3.1. Which vocabularies are you linking/mapping from/to?
As mentioned in sections 1 and 2, HILT is using a number of terminologies. These are listed in section 2. Concepts from these terminologies are mapped to a central spine (DDC) which is used as a switching language to facilitate the following HILT functions: get_all_records, get_ddc_records and get_non_ddc_records.
3.2. Please provide below some extracts from the mappings or links between the vocabularies. Use the layout or presentation format that you would normally provide for the users of the mappings. Please ensure that the examples you provide illustrate all of the different types of mapping or link.
As in 2.4, this question is not necessarily applicable to HILT. HILT is a web-service. Any terminological data HILT provides is wrapped in SKOS-Core (and the Mapping Vocabulary Specification (MVS) where applicable) and is delivered to clients in response to client service requests. Client services are responsible for how they wish to present mappings or other terminological data to users. Some illustrative SKOS-Core and MVS examples are provided below. However, since HILT uses a number of terminologies, readers are encouraged to use the mock client (i.e. HILT SOAP client demonstrator) to view the mappings. The HILT SOAP client demonstrator is available at: http://hiltm2m.cdlr.strath.ac.uk/hiltm2m/hiltsoapclient.php. Useful test terms to input when using the SOAP client demonstrator for get_all_records and get_ddc_records are: • Environmental impacts (GCMD) • Shore protection (DDC) • Plant genetics (HASSET) • Civil emergencies (IPSV) • Land use site (NMR) Useful test DDC numbers to input for get_non_ddc_records are: • 363.73 (DDC caption: Pollution) • 627.58 (DDC caption: Shore protection) • 631.53 (DDC caption: Plant propagation) • 363.34 (DDC caption: Disasters) • 333 (DDC caption: Economics of land and energy)
3.3. Describe the different types of mapping used, with reference to the examples given in paragraph 3.2.
The mapping types currently used in HILT are those specified in the Mapping Vocabulary Specification: exactMatch, narrowMatch, broadMatch, minorMatch, majorMatch. However, it is felt that in the HILT context these mapping types may be inadequate for such things as: • The ranking of large result sets according to the degree of concordance with users' preferred terminology. For example, a user query, 'tooth', is submitted. A large result set is returned comprising hundreds of exactMatch resources. Among the large result set, resources pertaining to 'teeth' are found and – under SKOS MVS definitions – are considered equivalent to those on 'tooth'. However, in such a large result set it would be useful to rank results more meaningfully for the user. In the aforementioned example therefore, resources indexed 'tooth' would feature slightly higher in the result set than 'teeth' since the term is an exact match and is, ultimately, most relevant to the user since it exactly matches the original information query. As such, 'teeth' might be considered a plural match and therefore feature lower in the rankings. Match types exemplifying greater specificity would be useful to aid such ranking. • Providing users with details of the precise nature of the relationship(s) between their entered query and their retrieved result set (which will invariably include mapped terms from other terminologies, or comprise resources retrieved using terms derived from mapped terminologies). • The need to reconcile resources by concept is clearly necessary on the Semantic Web. However, the role of match types is of importance (in HILT and probably other potential services) when informing users (via client services) of why particular terms have been retrieved in response to a user query. For example, a user that is searching for resources on lung disease and submits the query 'lung disease' may retrieve resources indexed under 'pneumoconiosis'. These resources are indeed relevant, but for the user who is uniformed about the way in which indexes are mapped and matched may doubt that pneumoconiosis is (under the SKOS MVS) an exactMatch when they browse their results. Going to great lengths to inform users is envisaged as necessary in order to facilitate the re-formulation of subsequent queries. Providing users with such mappings could also be used to generate (potentially) improved relevance feedback. • Imparting sufficient information during subject hierarchy browsing to enable users to make informed decisions about the relevance of mapped terms. (This is similar and related to the issues raised in the bullet point immediately above). • Helping identify mapping regularities between specific terminologies, thus facilitating the research and development of improved automated routines to assist in large-scale terminology mapping. (This is not directly relevant to HILT, but the identification subtle patterns in mapping relationships between terminologies that could assist HILT at a later date if large scale, machine-assisted terminology mappings was being undertaken). In addition to the above, it is also worth noting that HILT offers the ability for clients to create a 'disambiguation' stage during user searching. This process of disambiguation not only resolves the existence of homographs (as the term 'disambiguation' may suggest), but encompasses a variety of processes allowing users to qualify their search requirements. To this end HILT has been exploring the use of a second set of match types to be used by clients instead of - or in tandem with - the SKOS MVS match types. The match types considered include some of those proposed by Chaplan (1995). Chaplan's match types are in many respects a departure from the conceptual approach taken in the SKOS MVS and the general proclivity on the Semantic Web for representing concepts. The focus of Chaplan's match types is more on differences in the way in which a term is represented (e.g. singular/plural match, spelling variation, word order variation, etc.) rather than reconciling concepts. However, we hypothesise that the requirement for a set of more detailed match types will be needed to assist users during the aforementioned disambiguation stage and the other issues noted above. We also consider both approaches to be complementary, with the conceptual nature of the SKOS MVS providing a level of abstraction above - and preceding the use of - a lexically-based set of match types. It is worth noting that this more detailed set of match types need not necessarily be lexically-based (although we currently consider such an approach useful) and could feasibly be an extension of the current MVS with the introduction of finer match types for such purposes. In particular, it is thought that the majorMatch and minorMatch types would benefit from further definition and perhaps the introduction of further gradations. For example, there is currently no indication whether a majorMatch between two concepts is 'weak' or 'strong' (i.e. quantification of the level of match beyond 50% as currently specified as a guideline). Recent informal discussions between Alistair Miles and members of the HILT team indicated that we might be in a position to inform the W3C Semantic Web Deployment Working Group before the end of 2006 on our match type work in this area. However, due to work priorities this has unfortunately been undoable. We nevertheless intend to continue this line of research as soon as possible and test the utility of both the SKOS MVS and Chaplan-based match types in a controlled user study.
3.4. Any additional information, references and/or hyperlinks.
Chaplan, M. A. (1995). Mapping Laborline thesaurus terms to Library of Congress subject headings: implications for vocabulary switching, Library Quarterly, 65(1), 39-61. Nicholson, D., Dawson, A. & Shiri, A. (2006). HILT: A pilot terminology mapping service with a DDC spine, Cataloging & Classification Quarterly 42(3/4), 187-200.