Specification of Requirements/Metadata

From Ontology-Lexica Community Group

Summary on Requirements on the Metadata for the Lexicon Ontology Model (Synthesis by AS)

The lexicon-ontology model should provide metadata information to assess the level of expressivity of an RDF vocabulary/dataset with respect to a given language (or, even, linguistic resource, see below).

The purpose of this metadata is really close to the spirit of vocabularies such as VOID (http://www.w3.org/TR/void/), or VOAF (http://purl.org/vocommons/voaf), which provide general metadata about RDF Datasets and Vocabularies (RDFS vocabs or OWL ontologies), respectively. As much as the scope of VoID and VOAF is to support linking between datasets, and discovery and reuse of vocabularies, the metadata vocabulary of the Lexicon Ontology Model scope is two fold:

  1. enabling immediate linking between datasets/vocabularies and linguistic resources by semantically specifying the type of dependency (e.g. a dataset D may owl:import a linguistic resource LR - and this is specified in the imports declaration as well as in its VoIO description - but it should be allowed to tell that LR is used to decorate O for the purpose of linguistically enriching its content, with respect to language L).
  2. supporting mediation/mapping activities (mapping/matching agents may thus understand the linguistic nature of a dataset and prepare the

A list of more specific requirements for this metadata vocabulary:

  1. Reporting quantitative and qualitative information on the level of expressivity of an RDF dataset with respect to a given language / linguistic resource
    • Allowing for referencing specific entities from "famous" linguistic resources (e.g. FrameNet frames or WordNet synsets), to enrich datasets and ontologies (this is actually a requirement on the "Modelling lexical resources", which can then be exploited here).
  2. Specify the resources used for enriching the content of RDF datasets, and clarify their role as such
  3. (not sure it fits, to be discussed) Expressing the reliability of links between ontological and linguistic objects (these could also be used for evaluation purposes).

This metadata vocabulary should be included as a module/extension of the core LOM vocabulary, and should rely at least on the parts of the vocabulary dedicated to: Modelling lexical resources and Express meaning with respect to ontology.

Premise

This set of requirements cannot probably be fully formalized until requirements and core vocabulary are defined (or at least, their structure is more or less clear) for the following areas:

  1. Modelling lexical resources
  2. Express meaning with respect to ontology
  3. Properties and Relations of Lexical Entries merged from "Relations between Lexical Entries" and "Lexical and linguistic properties of lexical entries"


Domain of the main Metadata properties

The following resources can be considered as candidates for the domain of the main metadata properties in the metadata vocabulary

  • owl:Ontology
  • void:Dataset
  • voaf:Vocabulary

note that:

voaf:Vocabulary  rdfs:subClassOf	void:Dataset

thus, if extending void:Dataset, the LOM metadata vocabulary can be instantiated on a given ontology, or on a Dataset.

Discussion on the domain of Metadata properties

Note that there is an indirection betwen void:Dataset and the dataset it is describing. The purpose of void:Dataset is to reify a given dataset and to provide metadata about it. The triple space of a void description is not the same one of the dataset it is describing. Actually, the baseuri of a dataset is not the same URI used to describe it in VoID.

For instance, in: http://www.w3.org/TR/void/#dataset the example:

:DBpedia a void:Dataset .

tells that the RDF resource :DBPedia is intended as a proxy for the well-known DBpedia dataset.

I asked one of the main authors of VOAF (Bernard Vatant) if the same holds for voaf:Vocabulary and owl:Ontology, as to me it makes sense to provide statistical metadata about ontology vocabularies directly attached to owl:Ontology. Despite the subClassOf relationship between voaf:Vocabulary and void:Dataset, the best practice that the authors suggest for VOAF are to create resources which are both instances of voaf:Vocabulary and owl:Ontology, like in the following example, taken from http://www.w3.org/ns/adms:

<owl:Ontology rdf:about="&adms;">
    <rdfs:label xml:lang="en">Asset Description Metadata Schema (ADMS)</rdfs:label>
   ...
    <rdf:type rdf:resource="&voaf;Vocabulary"/>
    <voaf:specializes rdf:resource="&rad;"/>
    <vann:preferredNamespaceUri>&adms;</vann:preferredNamespaceUri>
     ...

There is also another option for the domain: the enrichment itself between an ontology/dataset and a linguistic resource as an entity. Furthermore, while reyfing an enrichment with a URI is quite natural, this enrichment could even bear the same indirection, and thus we could have RDF documents on the Web describing how FOAF has been decorated with references to WordNet.

Subjects of the Metadata

There are mainly two type of resources addressed for metadata:

  1. Generic Dataset/Vocabularies: for expressing "how the given dataset/vocabulary has been enriched with linguistic content". So, in this case, the domain of the metadata should be: void:Dataset
  2. Linguistic Resources: as much as we are [Specification of Requirements/Linked Data|Modelling lexical resources]], we may be interested in providing specific metadata for them.

Information of Interest

By considering the two subjects of Metadata (), there are thus two sets of Metadata information which should be represented.

Metadata about Linguistic Asset of a generic RDF Dataset/Vocabulary

The following information should (at least) be provided by the metadata vocabulary, to support use cases such as SAOM, by enabling an initial "linguistic coordination" between agents willing to attempt on-the-fly alignments [1] of their heterogeneous ontologies (cross-language queries) when communicating:

  1. Language coverage
    • List of languages for which the ontology is being linguistically described (this enables immediate verification of linguistic compatibility between heterogeneous ontologies to be aligned)
    • For each language
      • the percentage of rdf resources, per type (classes, properties, concepts) described by at least a term
      • average number of terms per resource
  2. Information about use of a specific linguistic resource (see below)
    • this requires some agreement on the reuse of elements from popular linguistic resources and how to include them in the lexicon-ontology model (part of Modelling lexical resources)
    • An example: if wordnet synsets are reused as instances of Sense in our model, and linked to ontology through properties defined in Express meaning with respect to ontology, we may then, per that linguistic resource:
      • the percentage of rdf resources, per type (classes, properties, concepts) described by at least a synset from wordnet
  3. Information about linguistic resources
    • as much as info from LRs should be properly described in the LOM model, a LR could be described as a whole through a proper subclass of void:Dataset
  4. Linguistic Model being adopted
    • LOM Metadata should account for the presence of linguistic information modeled even through non-LOM vocabularies. Thus knowing if linguistic info is available in the form of traditional rdfs:labels, skos labels, skos-xl reified labels
  5. URILangs: Evocative names for URIs?
    • Are the local names of the URIs of the resource expressed in some natural language?
    • Not sure...some regexp rewriting for cleaning localname expressions and make them more readable names (i.e. "RedCar" --regexp--> "Red Car"
    • it should be told it these URILang expressions are already covered by labels expressed through LOM or any other modeling vocabulary (other option is not to provide this URILang info at all)

Metadata about RDF Linguistic Resources

Lexical Resources modeled here need proper metadata so that they can be immediately analyzed by agents needing them. Also, this metadata will provide useful info for yellow pagers which may thus index Lexical Resources and help agents to access the right resource for their purpose. Note that here we do not address how certain information will be represented in Lexical Resources at content level (which is addressed here), but only which resuming information is important to represent at the metadata level.

The following resuming info may be of help when describing the overall characteristics of a Lexical Resource. Pls be patience with the most-probably-improper terminology, but I feel these elements are important, despite the improper name they may have here.

  1. SemanticallyIndexed: Is the resource content centered about entities representing agglomerates of meaning? e.g. Wordnet synsets provide a conceptual backbone around which words are attached. I would call these general objects SemanticAnchors (partial/total overlap with what we called Senses?)
  2. SemanticallyIndexed-->Multilingual: this resource has more languages attached to the same "meaning constructs" (e.g. EuroWordnet, where the synsets are the same across different languages).
    • languages
  3. SemanticallyIndexed-->isTaxonomical: referred in particular to the presence or not of a hierarchy of SemanticAnchors in the resource
  4. Bilingual (maybe worth telling if there is a simmetry or not in the two languages)
    • source Language
    • target Language
  5. hasGlosses

The expression A-->B is meant to represent that the characteristic B is intended to further describe a resource which exposes characteristic A. Actually, these characteristics could be represented as classes of the resource.

References

  1. XC00086D, (2001). FIPA Ontology Service Specification.