Extended Metadata Registry (XMDR) Prototype

Contact e-mail:

John McCarthy <JLMcCarthy AT lbl DOT gov>
Bruce Bargmeyer <BEBargmeyer AT lbl DOT gov>

Application

General purpose and services to the end user

This is a prototype implementation of metadata design specifications proposed for edition 3 of ISO/IEC 11179 part 3

A description of use cases is available at http://hpcrd.lbl.gov/SDM/XMDR/use-cases.html

Functionality examples

registration of metadata, including concept systems such as terminologies and ontologies, as well as data elements and value domains (codesets).
registering and managing any semantic information that is useful in data management, data administration, data analysis and linkage of concepts to data.
provide semantics services for semantic computing such as the Semantic Web, semantics service oriented architectures, and semantic grids.
Interrelate concept systems with other concept systems
Interrelate concept systems with data held in databases, terminologies, and metadata deriving from natural language text understanding systems
Enable use of new services for semantic computing: Semantics Service Oriented Architecture, Semantic Grids, semantics based workflows, Semantic Web...
Capture semantics with more formal techniques (in addition to natural language) -- First Order Logic, Description Logic, Common Logic, OWL
Encourage and enable the sharing of concept systems and traditional metadata through means that reduce the cost of accessing, obtaining and interacting with the broadest range of content
provide semantic services needed to support semantic computing, such as dereferencing the URIs used in creating RDF statements, by providing relevant information describing the referenced concept and its authoritative standing within some community of interest.

Application architecture

The application relies on a REST (Representational State Transfer) architecture, making use of Metamodel (in OWL) and data format (XML), RegistryStore (Persistence and Versioning), Metadata Content Validation, Indexing (text, asserted, and logical inference), Mapping, Authentication, (Human) User Interface. More details on http://hpcrd.lbl.gov/SDM/XMDR/arch.html

The application is not distributed at present; but in the future, content data as well as extended metadata registry software might be.

Special strategies involved in the processing of user actions

Users may choose to expand queries to include inferred as well as asserted information. Users may draw inferences based on XMDR metamodel (ISO/IEC 11179) as well as specific content and relationship of individual sets of metadata.

Integration between vocabulary-linked functions and other application functions

One of the main purposes of the XMDR Prototype is to demonstrate how concept systems can be used to help integrate, search, and harmonize more traditional metadata registry information about data elements, valid value sets, etc. The functionality for humans and computers is to enable linkage of concept systems and data. The system combines text search with inference (semantic) search.

Additional references

Project website: http://xmdr.org/

Vocabulary

Titles

XMDR has loaded a number of different concept systems in order to demonstrate different kinds of capabilities, particularly for large, complex concept systems. For the list of the current concept systems included and proposed for the XMDR Prototype, and a summary of their respective characteristics, see the table at http://hpcrd.lbl.gov/SDM/XMDR/contentlist.html

General characteristics (size, coverage) of the vocabulary

A portion of http://hpcrd.lbl.gov/SDM/XMDR/contentlist.html is included below:

Dataset Name	XMDR Contact	Graph Structure	Priority	Licensing Issues	Status Survey Form	Status LexGrid Loading	Status XMDR Loading	References and Comments
DTIC Thesaurus (Defense Technology Info. Center Thesaurus)	Gail Hodge (IIA for USGS)	Directed graph (tree + related terms)	1	No.	DTIC Thesarus survey form	Yes	Yes	This is an outdated version.
NCI Thesaurus (National Cancer Institute Thesaurus)	Sherri De Coronado (NCI)	Directed graph (tree + related terms)	1	No	NCI Thesaurus survey form	Yes	Yes	.
NCI caDSR (National Cancer Institute Data Standards Repository)	Sherri De Coronado (NCI)	Directed graph	1	No	survey form	NA	In progress	.
ISO 3166 Country Codes	Frank Olken	List	1	No	survey form	Yes, may need a language reload	Yes	ISO 3166 Country Codes Download Page (English and French) or extract from EPA EDR
GEMET (GEneral Multilingual Environmental Thesaurus)	Gail Hodge (IIA for USGS) and Linda Spencer (EPA)	Directed graph (trees + related terms)	1	No	missing	Yes	Yes	Bruce Bargmeyer has a new set of GEMET files from 2006/04. Nothing has been done with the new GEMET files. Multilingual.

Structure explanation

XMDR is intended to input concept systems in their entirety from any format.

Wherever possible, XMDR uses LexGrid as an intermediate step in loading content. As SKOS gains wider acceptance and software tools to work with it, using SKOS for many of the same purposes for which XMDR is currently using LexGrid will be considered.

Should SKOS be able to incorporate many of the current features of LexGrid, XMDR could easily use concept systems that use SKOS. LexGrid and XMDR may prove to be useful tools for working with content expressed in SKOS. In the meantime, it might be very useful to have software that could translate from SKOS to LexGrid and vice-versa.

Machine-readable representation of the vocabulary

There is substantial content available in two prototype implementation instances on XMDR web site at http://xmdr.org/.

Structure of the database used to currently manage the vocabulary

Content Systems are translated into XML files that conform to the XMDR metamodel, as described at https://xmdr.lbl.gov/mediawiki/index.php/11179_Diagrams

Standards and guidelines considered during the design and construction of the vocabulary

XMDR is trying to coordinate its work with development of ISO/IEC 11179 edition 3, and other standards efforts, particularly ISO TC37 and the W3C Semantic Web Working Groups (XML, RDF, OWL and SKOS). Other related ISO standards include 639, 704, 3166, 11179, 12620 and and UML (Universal Modeling Language).

XMDR also has used the LexGrid specification (http://LexGrid.org/) because it bridges the SKOS/OWL boundary.

Management of changes

Changes to vocabularies are the responsibility of the different organizations from which XMDR obtains them. How to keep the experimental XMDR Prototype updated with respect to changing external sources is an active research and development topic.

Vocabulary mappings

Although the XMDR Prototype does not yet include facilities for mapping between different concept systems, that is one of XMDR important goals.

Mapped vocabularies

This part of XMDR research and development efforts has just begun, starting with mappings between the old Standard Industrial Classification (SIC) codes and their successor, the North American Industrial Classification (NAIC) Codes.

Extracts of Mappings

See for example the mappings provided with NAICS 2002, http://www.census.gov/epcd/naics02/N02TOS87.HTM, where a one-to-many matching indicates ambiguity. Translation tables are sometimes qualified with a confidence and/or completeness scale or measure, which is necessarily direction-dependent.

An example (in LexGrid format):

<lgRel:association association="mapsTo" forwardName="mapsTo" reverseName="mappedFrom" targetCodingScheme="NAICS">
        <lgRel:sourceConcept sourceConcept="10">
                <lgRel:targetConcept targetConcept="21">
                        <lgRel:associationQualification associationQualifier="approx" /> 
                </lgRel:targetConcept>
        </lgRel:sourceConcept>
</lgRel:association>

Types of mapping used

The Census Bureau mapping provides three levels of statistical comparability:

Comparable	NAICS derivable from SIC data
Almost comparable	Sales or receipts from SIC are within 3% of NAICS sales or receipts
Not comparable	NAICS sales or receipts cannot be estimated within 3% from SIC data.

XMDR might expand the target mapping attribute to be “almost exact” for a tri-level mapping between NAICS and SIC. The mappings can be 1-to-1 (i.e. exact), or 1-to-many, or many-to-many. Since exact mappings are straightforward, what XMDR should capture is the inexact mappings, describing the dispersion of a single NAICS code into multiple SIC codes and vice versa. If a single SIC code maps wholly (with no leftover) to a list of NAICS codes, that would be a 1-to-many mapping. However if an SIC code maps to only part of particular NAICS codes, the mapping is many-to-many. One way would be to create a list of mapping targets for each code source and identify for each target whether the target is an exact match or an approximate match.

The mapping can be statistical (i.e. which portion of economic for an SIC code is apportioned to one or more NAICS codes, or vice versa) or semantic (i.e. the meaning of the codes are identical or they have overlapping meanings).

Different general approaches to mapping representation will be considered:

Translation or Correspondence Tables (table of pairs between the classification scheme items, concepts, or values which have corresponding or overlapping meaning).
DL-based Translation (using a description logic such as OWL to express mappings)
FOL-based Translation (using First-order logic such as Simple Common Logic)
Rule- or Query-based Translation (using relational database views, or SWRL for OWL/RDF).