Specification of Requirements/Linked Data
This requirement concerns the use of the OntoLex model for rendering existing language resources on the web as linked data.
- 1 Lexical *Nets
- 2 Examples of Modelling in RDF
- 3 Examples of Modelling in RDF (Alternative approach)
- 4 Example: WordNet as lemon-SKOS
- 5 Standards
- 6 Naming Guidelines
A major class of resources existent on the web are networks of lexical concepts and related elements, i.e. wordnets, framenets, and the like. The OntoLex model must have clear mapping for the element types included in such resources:
- Synsets (equivalence classes of senses, used in all wordnets)
- Senses that compose synsets
- Words, consisting of a lemma and ordered into a part-of-speech group (following Princeton WordNet, these are generally 'noun', 'verb', 'adjective', 'adjective satellite')
- Relations between synsets (e.g., hypernym, pertainym etc.)
- Frames (equivalence classes of situation/event types, e.g. related to verb valences)
- Relations between frames
Requirements for checking if mappings can be reasonably complete can be based on existing ontologies/vocabularies for *Nets: W3C WordNet-RDF, FrameNet-OWL, VerbNet-OWL, etc.
It'd be a plus if the OntoLex model allows the representation of mappings between existing lexical vocabularies. This requirement opens up for two possible (and combinable) approaches/requirements:
- The model should have primitives that make it a lingua franca between different lexical vocabularies. This would require the preliminary analysis of the vocabularies for the most representative lexical resources, and the construction of a model that is able to express the merged semantics of all those vocabularies. Examples:
- wordnetschema:Sense owl:equivalentClass ontolex:Sense
- framenetschema:LexicalUnit owl:equivalentClass ontolex:Sense
- wordnet:wordsense-connect-verb-1 rdf:type ontolex:Sense
- framenetlu:connect.v rdf:type ontolex:Sense
- The model should provide a basic mapping vocabulary between classes/properties of lexical vocabularies, e.g. by reusing SKOS. Additional linguistic mapping relations can be introduced by subpropertying SKOS ones. Examples:
- wordnetschema:Sense skos:exactMatch framenetschema:LexicalUnit
- wordnet:wordsense-connect-verb-1 skos:exactMatch framenetlu:connect.v
- Ontological mappings using OWL/RDFS axioms
- The model might also provide a more sophisticated mapping statement, in order to represent confidence for a mapping. Possible design patterns here include at least:
- Annotating the relation with a confidence value (via RDF reification or OWL axiom annotation). This enables to reuse SKOS-based triples by annotating them.
- Creating a Mapping class with appropriate relations to mapped entities, confidence, and eventually provenance, algorithm if any, etc. This has been e.g. formalized in an vocabulary by François Scharffe and Jerome Euzenat for the AlignmentAPI in the NeOn project, and now it has evolved to a mapping description language EDOAL, optimized for ontology matching data. However, its edoal:Cell element has the features needed by this requirement.
Order of senses
Some models (notably WordNet) order senses according to their prominence. We should capture this in the model:
Capturing by means of a datatype property, e.g.,
<cat:v> a lemon:LexicalEntry ; lemon:sense <cat::2:29:0::>, <cat::2:35:0::> ; <cat::2:29:0::> a lemon:Lexical Sense ; wordnet:senseNumber "6"^^xsd:integer ; lemon:reference <VerbSynset76400> .
Other possible modelling strategies by means of rdf:List or rdf:Seq?
Examples of Modelling in RDF
Synsets as ontology classes
ubylemonwn:WN_LexicalEntry_397 lemon:canonicalForm [ a lemon:Form ; lemon:writtenRep "cat"@en ] ; lemon:sense ubylemonwn:WN_Sense_573, ubylemonwn:WN_Sense_574, ubylemonwn:WN_Sense_575, ubylemonwn:WN_Sense_576, ubylemonwn:WN_Sense_577, ubylemonwn:WN_Sense_578, ubylemonwn:WN_Sense_579 . ubylemonwn:WN_Sense_574 a lemon:LexicalSense ; lemon:reference ubylemonwn:WN_Synset_11048 . ubylemonwn:WN_Synset_11048 a owl:Class ; rdfs:comment "feline mammal usually having thick soft fur and no ability to roar: domestic cats"@en
Synsets as broader/equivalent senses
ubylemonfn:FN_LexicalEntry_7176 lemon:canonicalForm [ a lemon:Form ; lemon:writtenRep "catch"@en ] ; lemon:sense ubylemonfn:FN_Sense_9295 . ubylemonfn:FN_Sense_9295 a lemon:LexicalSense ; lemon:broader ubylemonfn:FN_SemanticPredicate_653 .
Synsets only using properties
example:e1 lemon:canonicalForm [ lemon:writtenRep "small"@en ] ; lemon:sense example:s1 . example:e2 lemon:canonicalForm [ lemon:writtenRep "little"@en ] ; lemon:sense example:s2 . example:e3 lemon:canonicalForm [ lemon:writtenRep "tiny"@en ] ; lemon:sense example:s3 . s2 lexinfo:synonym s1 . s3 lexinfo:synonym s1 . s1 lexinfo:synonym s2 . s3 lexinfo:synonym s2 . s1 lexinfo:synonym s3 . s2 lexinfo:synonym s3 .
Examples of Modelling in RDF (Alternative approach)
LOM: Lexicon Ontology Model
My perspective is to offer a metamodel able to host specific resources, which may differ in the semantics/organization they expose. These resources can thus be identifiable from the commonalities in their structure, while their content can be reconducted to a basic set of elements characterizing the LOM. This is much in the spirit of the plain old OKBC (http://www.ai.sri.com/~okbc/), that is: don't deal with the details of each specific model (and we know, linguistis may have very different theories for representing even very close resources!) and offer something which can be shared. Where OKBC was offering a commong ground of programming interfaces for accessing ontologies, LOM should provide a core set of classes/properties, not overloading them with excessive semantics, but usable as recognizable hooks for linking ontologies to elements of linguistic resources. As resulting from the call held on 1st of March (http://www.w3.org/community/ontolex/wiki/Teleconference,_2013.1.03,_12-13_pm_CET), some of us referenced a synset as a mere agglomerate of words (which in effect is what its etymology may suggest), though, as pointed out by others, it actually conveys some semantics, and the fact that we have a gloss attached to it, tells that, in effect, the lexicographers are "seeing" some kind of concept in it. The point is, we don't have to delve into the details of what a synset is (or analogous element in another resource, maybe with a slightly different intended meaning). We should just represent the fact that some resources deal only with words (e.g. some poor translation dictionaries do not even provide a sense distinction in their translations of a term), while others provide agglomerates for these words. Whether this agglomerate is a sense (and how fine grained it is?), or something more or less close to a notion of concept, is not something which should be IMHO of interest to us.
Obviously, use of a specific linked lexical resource should not be made "anonymous". Users knowing that a given ontology/dataset has been enriched with synsets from Wordnet, may make the assumptions which the use of such a resource grant. At the same time, a mapping agent recognizing that the linguistic identifiers connected to the two ontologies it is trying to align, are exactly synsets (or, more in general, come from the same lexical resource), may use them as an interlingua (instead of words) for improving the precision (and performance) of the matching algorithm. The matching algorithm itself, can be described on the basis of elements described with our metamodel.
lom: prefix for our lexicon ontology model
SemanticIndex: a class describing a generic semantic index for collecting lexical entries under some "semantic hat".
Definition of a synset in Wordnet RDF
<wn20schema:NounSynset rdf:about="wn20instances:synset-entity-person-1" rdfs:label="entity"> <wn20schema:synsetId>00007846</wn20schema:synsetId> </wn20schema:NounSynset>
Wrapping the wn20schema:Synset class under our lom:SemanticIndex hat
<rdf:Description rdf:about="wn20schema:Synset"> <rdfs:subClassOf rdf:resource="lom:SemanticIndex"/> </rdf:Description>
Enriching an ontology with a wordnet synset, by using the generic property: lom:semanticDescriptor
<someOntology:Person> <lom:semanticDescriptor rdf:resource="wn20instances:synset-entity-person-1"> </someOntology:Person>
note that the model here may be oversimplified, and I used by purpose another fictious property, such as lom:semanticDescriptor, whereas a longer path containing "sense" and "reference" would have been required. What is meant here, is to propose this alternative model, and understand if we can go in this direction.
Is it not the case that this SemanticIndex is the same as skos:Concept?
In this case we could do the following modelling:
:cat a lom:Word ; lom:sense [ lom:reference :cat_synset_1 ] . :cat_synset_1 a lom:SemanticIndex ; rdfs:comment "A domestic animal of species felis catus"@en ; owl:sameAs dbpedia:Cat .
Where the semantic index is as follows:
lom:SemanticIndex rdfs:subclassOf skos:Concept .
Armando: Absolutely yes, in fact my proposal is that our vocabulary for describing lex resources can inherit from the SKOS/SKOS-XL one
Example: WordNet as lemon-SKOS
<cat:v> lexinfo:derivedForm <cat:n> ; lexinfo:partOfSpeech lexinfo:verb ; lemon:canonicalForm <cat:v#canonicalForm> ; lemon:language "eng" ; lemon:otherForm <cat:v#otherForm1>, <cat:v#otherForm2> ; lemon:sense <cat::2:29:0::>, <cat::2:35:0::> ; lemon:synBehavior <VerbFrames#frame2>, <VerbFrames#frame9> ; a lemon:LexicalEntry . <cat:v#canonicalForm> lemon:writtenRep "cat"@eng ; a lemon:LexicalForm . <cat:v#otherForm1> lemon:writtenRep "catted"@eng ; a lemon:LexicalForm . <cat:v#otherForm2> lemon:writtenRep "catting"@eng ; a lemon:LexicalForm . <cat::2:29:0::> a lemon:LexicalSense ; lemon:reference <VerbSynset76400> . <cat::2:35:0::> a lemon:LexicalSense ; lemon:reference <VerbSynset1411870> . <VerbSynset76400> a skos:Concept ; skos:broader <VerbSynset72989> ; skos:definition "eject the contents of the stomach through the mouth; \"After drinking too much, the students vomited\"; \"He purged continuously\"; \"The patient regurgitated the food we gave him last night\""@eng . <VerbSynset1411870> a skos:Concept ; skos:broader <VerbSynset1411085> ; skos:definition "beat with a cat-o'-nine-tails"@eng .
The model should have a simple mapping to LMF to enable LMF and LMF-based resources to be easily represented with this model. In particular we should be able to model
- Core LMF model (Senses, lexical entries, forms and representations)
- Morphology (see Specification_of_Requirements/Morphology)
- Machine readable dictionary extensions (Subjects, contexts, equivalence links)
- Syntax (Subcategorization frames, argument structures)
- Semantics (Predicate representations, synsets)
- Multilingual Representation (Translation links)
- Morphological Patterns (see Specification_of_Requirements/Morphology)
- Multi-word expression (see Specification_of_Requirements_on_Terminological_Analysis)
The model should be able to represent terms with feature structures, part of speech and definitions
TBX, UTX, XLIFF
The model should be able to accommodate data from translation memory formats, it is not intended to be a model for translation memories.
From the point of view of OntoLex we use URIs as identifiers, which are essentially physical objects (referring to a file on some server) so we cannot mandate the use of a particular naming scheme. However, we can recommend the use of a particular scheme and we will certainly take the Kyoto scheme into account.
The convention in wordnet-LMF is to use identifiers of the form LLL-VV-OOOOOOOOO-P where LLL is the language, VV is the version, OOOOOOOO is the offset and P is the part of speech. So: instead of syn_n_08225481 people use: eng-30-08225481-n. If we could adopt the same convention it would make interoperability a little bit easier.