Extended Metadata Registry (XMDR) Prototype
Contact e-mail:
John McCarthy <JLMcCarthy AT lbl DOT gov>
Bruce Bargmeyer <BEBargmeyer AT lbl DOT gov>
Application
General purpose and services to the end user
This is a prototype implementation of metadata design specifications proposed for edition 3 of ISO/IEC 11179 part 3
A description of use cases is available at http://hpcrd.lbl.gov/SDM/XMDR/use-cases.html
Functionality examples
- registration of metadata, including concept systems such as terminologies and ontologies, as well as data elements and value domains (codesets).
- registering and managing any semantic information that is useful in data management, data administration, data analysis and linkage of concepts to data.
- provide semantics services for semantic computing such as the Semantic Web, semantics service oriented architectures, and semantic grids.
- Interrelate concept systems with other concept systems
- Interrelate concept systems with data held in databases, terminologies, and metadata deriving from natural language text understanding systems
- Enable use of new services for semantic computing: Semantics Service Oriented Architecture, Semantic Grids, semantics based workflows, Semantic Web...
- Capture semantics with more formal techniques (in addition to natural language) -- First Order Logic, Description Logic, Common Logic, OWL
- Encourage and enable the sharing of concept systems and traditional metadata through means that reduce the cost of accessing, obtaining and interacting with the broadest range of content
- provide semantic services needed to support semantic computing, such as dereferencing the URIs used in creating RDF statements, by providing relevant information describing the referenced concept and its authoritative standing within some community of interest.
Application architecture
The application relies on a REST (Representational State Transfer) architecture, making use of Metamodel (in OWL) and data format (XML), RegistryStore (Persistence and Versioning), Metadata Content Validation, Indexing (text, asserted, and logical inference), Mapping, Authentication, (Human) User Interface. More details on http://hpcrd.lbl.gov/SDM/XMDR/arch.html
The application is not distributed at present; but in the future, content data as well as extended metadata registry software might be.
Special strategies involved in the processing of user actions
Users may choose to expand queries to include inferred as well as asserted information. Users may draw inferences based on XMDR metamodel (ISO/IEC 11179) as well as specific content and relationship of individual sets of metadata.
Integration between vocabulary-linked functions and other application functions
One of the main purposes of the XMDR Prototype is to demonstrate how concept systems can be used to help integrate, search, and harmonize more traditional metadata registry information about data elements, valid value sets, etc. The functionality for humans and computers is to enable linkage of concept systems and data. The system combines text search with inference (semantic) search.
Additional references
Project website: http://xmdr.org/
Vocabulary
Titles
XMDR has loaded a number of different concept systems in order to demonstrate different kinds of capabilities, particularly for large, complex concept systems. For the list of the current concept systems included and proposed for the XMDR Prototype, and a summary of their respective characteristics, see the table at http://hpcrd.lbl.gov/SDM/XMDR/contentlist.html
General characteristics (size, coverage) of the vocabulary
A portion of http://hpcrd.lbl.gov/SDM/XMDR/contentlist.html is included below:
Dataset Name
XMDR Contact
Graph Structure
Priority
Licensing Issues
Status Survey Form
Status LexGrid Loading
Status XMDR Loading
References and Comments
DTIC Thesaurus (Defense Technology Info. Center Thesaurus)
Gail Hodge (IIA for USGS)
Directed graph (tree + related terms)
1
No.
Yes
Yes
This is an outdated version.
NCI Thesaurus (National Cancer Institute Thesaurus)
Sherri De Coronado (NCI)
Directed graph (tree + related terms)
1
No
Yes
Yes
.
NCI caDSR (National Cancer Institute Data Standards Repository)
Sherri De Coronado (NCI)
Directed graph
1
No
NA
In progress
.
ISO 3166 Country Codes
Frank Olken
List
1
No
Yes, may need a language reload
Yes
ISO 3166 Country Codes Download Page (English and French) or extract from EPA EDR
GEMET (GEneral Multilingual Environmental Thesaurus)
Gail Hodge (IIA for USGS) and Linda Spencer (EPA)
Directed graph (trees + related terms)
1
No
missing
Yes
Yes
Bruce Bargmeyer has a new set of GEMET files from 2006/04. Nothing has been done with the new GEMET files.
Multilingual.
Structure explanation
XMDR is intended to input concept systems in their entirety from any format.
Wherever possible, XMDR uses LexGrid as an intermediate step in loading content. As SKOS gains wider acceptance and software tools to work with it, using SKOS for many of the same purposes for which XMDR is currently using LexGrid will be considered.
Should SKOS be able to incorporate many of the current features of LexGrid, XMDR could easily use concept systems that use SKOS. LexGrid and XMDR may prove to be useful tools for working with content expressed in SKOS. In the meantime, it might be very useful to have software that could translate from SKOS to LexGrid and vice-versa.
Machine-readable representation of the vocabulary
There is substantial content available in two prototype implementation instances on XMDR web site at http://xmdr.org/.
Structure of the database used to currently manage the vocabulary
Content Systems are translated into XML files that conform to the XMDR metamodel, as described at https://xmdr.lbl.gov/mediawiki/index.php/11179_Diagrams
Standards and guidelines considered during the design and construction of the vocabulary
XMDR is trying to coordinate its work with development of ISO/IEC 11179 edition 3, and other standards efforts, particularly ISO TC37 and the W3C Semantic Web Working Groups (XML, RDF, OWL and SKOS). Other related ISO standards include 639, 704, 3166, 11179, 12620 and and UML (Universal Modeling Language).
XMDR also has used the LexGrid specification (http://LexGrid.org/) because it bridges the SKOS/OWL boundary.
Management of changes
Changes to vocabularies are the responsibility of the different organizations from which XMDR obtains them. How to keep the experimental XMDR Prototype updated with respect to changing external sources is an active research and development topic.
Vocabulary mappings
Although the XMDR Prototype does not yet include facilities for mapping between different concept systems, that is one of XMDR important goals.
Mapped vocabularies
This part of XMDR research and development efforts has just begun, starting with mappings between the old Standard Industrial Classification (SIC) codes and their successor, the North American Industrial Classification (NAIC) Codes.
Extracts of Mappings
See for example the mappings provided with NAICS 2002, http://www.census.gov/epcd/naics02/N02TOS87.HTM, where a one-to-many matching indicates ambiguity. Translation tables are sometimes qualified with a confidence and/or completeness scale or measure, which is necessarily direction-dependent.
An example (in LexGrid format):
<lgRel:association association="mapsTo" forwardName="mapsTo" reverseName="mappedFrom" targetCodingScheme="NAICS"> <lgRel:sourceConcept sourceConcept="10"> <lgRel:targetConcept targetConcept="21"> <lgRel:associationQualification associationQualifier="approx" /> </lgRel:targetConcept> </lgRel:sourceConcept> </lgRel:association>
Types of mapping used
The Census Bureau mapping provides three levels of statistical comparability:
Comparable
NAICS derivable from SIC data
Almost comparable
Sales or receipts from SIC are within 3% of NAICS sales or receipts
Not comparable
NAICS sales or receipts cannot be estimated within 3% from SIC data.
XMDR might expand the target mapping attribute to be “almost exact” for a tri-level mapping between NAICS and SIC. The mappings can be 1-to-1 (i.e. exact), or 1-to-many, or many-to-many. Since exact mappings are straightforward, what XMDR should capture is the inexact mappings, describing the dispersion of a single NAICS code into multiple SIC codes and vice versa. If a single SIC code maps wholly (with no leftover) to a list of NAICS codes, that would be a 1-to-many mapping. However if an SIC code maps to only part of particular NAICS codes, the mapping is many-to-many. One way would be to create a list of mapping targets for each code source and identify for each target whether the target is an exact match or an approximate match.
The mapping can be statistical (i.e. which portion of economic for an SIC code is apportioned to one or more NAICS codes, or vice versa) or semantic (i.e. the meaning of the codes are identical or they have overlapping meanings).
Different general approaches to mapping representation will be considered:
- Translation or Correspondence Tables (table of pairs between the classification scheme items, concepts, or values which have corresponding or overlapping meaning).
- DL-based Translation (using a description logic such as OWL to express mappings)
- FOL-based Translation (using First-order logic such as Simple Common Logic)
- Rule- or Query-based Translation (using relational database views, or SWRL for OWL/RDF).