Community & Business Groups

Ontology-Lexica Community Group

The mission of the Ontology-Lexicon community group is to: (1) Develop models for the representation of lexica (and machine readable dictionaries) relative to ontologies. These lexicon models are intended to represent lexical entries containing information about how ontology elements (classes, properties, individuals etc.) are realized in multiple languages. In addition, the lexical entries contain appropriate linguistic (syntactic, morphological, semantic and pragmatic) information that constrains the usage of the entry. (2) Demonstrate the added value of representing lexica on the Semantic Web, in particularly focusing on how the use of linked data principles can allow for the re-use of existing linguistic information from resource such as WordNet. (3) Provide best practices for the use of linguistic data categories in combination with lexica. (4) Demonstrate that the creation of such lexica in combination with the semantics contained in ontologies can improve the performance of NLP tools. (5) Bring together people working on standards for representing linguistic information (syntactic, morphological, semantic and pragmatic) building on existing initiatives, and identifying collaboration tracks for the future. (6) Cater for interoperability among existing models to represent and structure linguistic information. (7) Demonstrate the added value of applications relying on the use of the combination of lexica and ontologies.

Note: Community Groups are proposed and run by the community. Although W3C hosts these conversations, the groups do not necessarily represent the views of the W3C Membership or staff.

No Reports Yet Published

Learn more about publishing.

Chairs, when logged in, may publish draft and final reports. Please see report requirements.

Publish Reports

Greetings from not so sunny California

Paul and John, we met at K-cap a few years back.

Hello everybody.

Your call for participation in the ontolex workgroup addresses an issue that has been my main concern with RDF and OWL and most of the tools operating with it. I have been building a framework that allows RDF import/export but in order to correlate different languages (change the appropriate rdfs:label) there has to be much a much more elaborate structure supporting this unless you want to rely on a mapping tool (RDF-translator?).

The structures to support this are interwoven with concept nodes of the RDF framework type and link inter-language synonyms (or synonymic phrases) (we call them SYNTRANS relations), then branch out into their compositions (syntactic presentations). Each of the layers of course has their own SYNTRANS relations.
What results is a true semantic network with the ability to generate phrases in any ‘learned’ language or recognize concepts (even concepts that are described in whole sentences) in any language (incl. multiple alphabets and/or containing coding systems), even if the input consists of each word in a different language.
Unfortunately it does not lend itself very well to RDF style sharing.

We are running at the moment 4 syntactic forms (noun-, adjectival-, verbal- and adverbial forms) that each link to their specific concepts (with SYNTRANS to their corresponding syntactic forms in different languages). Each syntactic form relates to its head concept in a particular way (certain class on semantic relations). Each syntactic form contains all script forms representing that form in a particular language (e.g. syntactic roles case, plurality, gender …). There are multiple adjectival forms that are semantically linked to different roots (e.g. ‘purchased’ as in ‘purchased goods’ is linked to the noun form while ‘purchasing’ as in ‘purchasing party’ is linked to the verb form, the verb form is linked to the noun form).

Once you get further away from the syntactical nodes the nodes are effectively language independent RDF exportable nodes. You can pick each one and request it to be presented in a language and in the syntactic role that is needed to generate a sentence, reversely it is recognized within a search input and identifies the best matching nodes and can generate a semantic network graph representing the input. I can send you some examples if you like.

In terms of use cases: I can see.
• Obviously ontology based machine translation.
• Lexicon style explanation of terms, even interactively. The system can generate an explanation and if not clear, modify the explanation with simpler terms or different wording, depending on the depth of the semantic network.
• Rephrasing of a concept (assist in human based translation)
• Recognition of a story and correlation to known knowledge or flagging of unknown or conflicting knowledge (I really like this feature)
• Supplementing of concepts from, e.g. a news story, with knowledge collected from other sources.
• Identification of content not mentioned. ( a form of supplementing that can be explored to indicate that content has been purposely or not – omitted)

The multi-language feature greatly assists in seeding our system. Through the exploration of word-lists or dictionaries, the mapping of single words in one language to complex terms in another greatly assisted in building up common semantic relations. The more languages the merrier! I think Ed Hovy was presenting some report in that direction a few years back.

It would not take much to synthesize speech with the generated output. The syntactic forms we store contain IPA presentations (another thing you might want to think about).
We had a meeting with a voice recognition group recently and talked about a piece of hardware that can reliably transform spoken word into IPA or some other coded form than that becomes an input that can be SYNTRANS’ed.
Then, provided enough CPU power, we are talking true Babelfisch. You will be picking up the phone in the US, speak English into it, have a conference call with a Russian, German and Chinese, each hearing and speaking in their own languages simultaneously, even looking at the same corresponding web pages in their own languages.

What has been recognized as a great danger of this technology is what its biggest asset is also, it will eventually settle on the basic truth and that might be very uncomfortable for a lot of people. Whoever controls it controls what is publicly known (as with other media).

So, this hopefully didn’t sound as much like a sales-pitch as me stating that I have spent quite some time on the subject of language representation in ontology/lexica, as well as machine generated presentation of the concepts.

I hope this experience can be useful to this group.