Paul and John, we met at K-cap a few years back.
Your call for participation in the ontolex workgroup addresses an issue that has been my main concern with RDF and OWL and most of the tools operating with it. I have been building a framework that allows RDF import/export but in order to correlate different languages (change the appropriate rdfs:label) there has to be much a much more elaborate structure supporting this unless you want to rely on a mapping tool (RDF-translator?).
The structures to support this are interwoven with concept nodes of the RDF framework type and link inter-language synonyms (or synonymic phrases) (we call them SYNTRANS relations), then branch out into their compositions (syntactic presentations). Each of the layers of course has their own SYNTRANS relations.
What results is a true semantic network with the ability to generate phrases in any ‘learned’ language or recognize concepts (even concepts that are described in whole sentences) in any language (incl. multiple alphabets and/or containing coding systems), even if the input consists of each word in a different language.
Unfortunately it does not lend itself very well to RDF style sharing.
We are running at the moment 4 syntactic forms (noun-, adjectival-, verbal- and adverbial forms) that each link to their specific concepts (with SYNTRANS to their corresponding syntactic forms in different languages). Each syntactic form relates to its head concept in a particular way (certain class on semantic relations). Each syntactic form contains all script forms representing that form in a particular language (e.g. syntactic roles case, plurality, gender …). There are multiple adjectival forms that are semantically linked to different roots (e.g. ‘purchased’ as in ‘purchased goods’ is linked to the noun form while ‘purchasing’ as in ‘purchasing party’ is linked to the verb form, the verb form is linked to the noun form).
Once you get further away from the syntactical nodes the nodes are effectively language independent RDF exportable nodes. You can pick each one and request it to be presented in a language and in the syntactic role that is needed to generate a sentence, reversely it is recognized within a search input and identifies the best matching nodes and can generate a semantic network graph representing the input. I can send you some examples if you like.
• Obviously ontology based machine translation.
• Lexicon style explanation of terms, even interactively. The system can generate an explanation and if not clear, modify the explanation with simpler terms or different wording, depending on the depth of the semantic network.
• Rephrasing of a concept (assist in human based translation)
• Recognition of a story and correlation to known knowledge or flagging of unknown or conflicting knowledge (I really like this feature)
• Supplementing of concepts from, e.g. a news story, with knowledge collected from other sources.
• Identification of content not mentioned. ( a form of supplementing that can be explored to indicate that content has been purposely or not – omitted)
The multi-language feature greatly assists in seeding our system. Through the exploration of word-lists or dictionaries, the mapping of single words in one language to complex terms in another greatly assisted in building up common semantic relations. The more languages the merrier! I think Ed Hovy was presenting some report in that direction a few years back.
It would not take much to synthesize speech with the generated output. The syntactic forms we store contain IPA presentations (another thing you might want to think about).
We had a meeting with a voice recognition group recently and talked about a piece of hardware that can reliably transform spoken word into IPA or some other coded form than that becomes an input that can be SYNTRANS’ed.
Then, provided enough CPU power, we are talking true Babelfisch. You will be picking up the phone in the US, speak English into it, have a conference call with a Russian, German and Chinese, each hearing and speaking in their own languages simultaneously, even looking at the same corresponding web pages in their own languages.
What has been recognized as a great danger of this technology is what its biggest asset is also, it will eventually settle on the basic truth and that might be very uncomfortable for a lot of people. Whoever controls it controls what is publicly known (as with other media).
So, this hopefully didn’t sound as much like a sales-pitch as me stating that I have spent quite some time on the subject of language representation in ontology/lexica, as well as machine generated presentation of the concepts.
I hope this experience can be useful to this group.