Unstructured List of Requirements
Appearance
- Part-of-speech categories (IE,QA,NLG,LOE,OA...)
- IE: "gross domestic product" is a noun
- QA: "paint" (en), "malte" (de) are (transitive) verbs
- NLG: "french" is an adjective, "painter" is a noun, "born" is a past participle
- LOE: "asset manager" and "hedge fund" are noun
- OA example: equivalence between OntologyA#book and OntologyB#book can be safely discarded if one is verb and other is noun
- REQ: external POS categories / registries / tagsets should be reusable (assumption: they have a URI that can be pointed to)
- Lemma / Inflectional Information (IE,QA, NLG, LOE ...)
- QA: "painted" (en), "malte" (de) are inflectional variants of the lemmas "paint" (de) and "malen", respectively
- REQ: relation to inflected form should be explicit
- LOE: hedge funds is the plural of hedge fund. Same for assets, which is the plural of asset, but in certain cases lemmatisation should not be applied: only the plural assets is used for "fixed assets". Reduction to lemma would be wrong here.
- Gender (IE, QA, NLG, ...)
- QA: "pintor" is masculine and thus requires the article "el" and the adjective "francés" as well as the participle "nacido"; "Maler" ist also masculine and thus requires the inflected adjective "französischer" as modified
- REQ: should be possible to reuse external gender categories
- Number, Case (IE, QA, NLG, LOE) are needed for similar reasons to gender
- REQ: should be possible to reuse external number and case categories
- LOE: mainly "case" relevant for determining the (semantic) role of the string
- Abbreviations (IE,QA,NLG,OA...)
- IE/OA: "GDP" is the abbreviation of "gross domestic product"
- REQ: relation to long form should be explicit
- Subcategorization Information / Valence (IE,QA,NLG,LOE)
- IE: "gross domestic product" CAN subcategorize a prepositional phrase headed by the preposition "of" as well as a prepositional phrase headed by "in".
- QA: "paint" (en) is a transitive verb that requires a subject and an object
- NLG: "was born" subcategorizes a subject as well as a prepositional phrase headed by "in" expressing the location of birth or a prepositional phrase headed by "in" or "on" expressing the year or date, respectively
- NLG "die" subgateorizes an object as well as optional prepositional phrases headed by "in"/"on" expressing the location of death and/or the year or data of death
- REQ: arguments can be specified to be optional!
- LOE: relevant for detecting ontological relations, not only arguments but also adjuncts (temporal, etc)
- OMT: Syntactic roles must be inferrable
- Mapping subcategorization structures to ontology structures
- IE: "gross domestic product(of: X, in:Y) is Z" maps to <RDF> b rdf:type ex:gdp, b ex:year Y, b ex:country:X, b ex:value Z </RDF>
- QA: "paint(subj:X, obj:Y)" maps to <RDF> Y dbpedia:artist X </RDF>
- NLG: "french N" maps to N ex:nationality "french" or N rdf:type "French_People", ...
- LOE: relevant, example to come
- Compliance to basic semiotic distinctions (expression, meaning, reference, linguistic act or context) (e.g. as implemented in the semiotics.owl OWL ontology)
- IE1/SS: (NER) "Barack Obama" (reference) vs. "sinking of the Titanic" (?reference?) vs. "US President" (?meaning?) vs. "State of the Union Address" (?linguistic act?)
- IE2/LOE: (logical representation of sentences) reference vs. meaning as with "John hates cats" vs. "Dogs hate cats". They should be extracted with different formal patterns, e.g. <math>\forall(c)(Cat(c) \rightarrow hates(John,c))</math> vs. <math>\forall(c,d)((Cat(c) \wedge Dog(d)) \rightarrow hates(d,c)))</math> (an OWL representation for these exists, but FOL is clearer in this case)
- IE3/OA: (term extraction) expressions can have different meanings in different syntactic contexts: e.g. in "Al Pacino is an American film and stage actor", "film actor" cannot be trivially extracted; instead we have alternatives: "American film"+"stage actor" vs. "film and stage actor" vs. "film"+"stage actor". It is then important to follow good practices of distinguishing expression vs. meaning patterns
- QA: "who's afraid of the big bad wolf?" needs to be answered in one of several ways:
- via meaning: the frame being afraid should be extracted from the question, and detected in a corpus of potential answers, possibly by typing the named entities that are themes in the frame (?humans, ?goats, ?pigs)
- via reference: the frame being afraid should be extracted from the question, and detected in a corpus of potential answers, directly on specific entities, e.g. "the Three Little Pigs"
- via context or linguistic act: in the context of the movie Shining, it is a song when Jack-as-nuts enacts the Big Bad Wolf
- via expression: the expression "who's afraid of the big bad wolf?" is musically attached to the refrain "tra la la la la" (which is actually the answer given by WolframAlpha to the question!)
- NLG: compare relations between referencing expressions vs. relations between a referencing expression and a concept:
- <dbp:History_of_South_Khorasan_Province> <skos:broader> <dbp:South_Khorasan_Province> ---> "History of South Khorasan Province is related to the South Khorasan Province"
- <dbp:History_of_South_Khorasan_Province> <skos:broader> <dbp:History_of_Iran_by_province> ---> "History of South Khorasan Province is an example/a part/a case of the History of Iran by province"
- LLD: WordNet's "synsets", VerbNet's "verb classes", FrameNet's "frames" can be partly mapped as meaning types, while WordNet's "words", VerbNet's "verbs", FrameNet's "lexemes" can be partly mapped as expression types, etc.
- TRANS: different translations often depend on different linguistic act representations, not only on different meanings
- Higher-order ontology mapping (OMT)
- OMT: Lexical entries should map to concepts constructed from multiple ontology properties, e.g., "X gives Y to Z"
- Morphological variants(OA, LOE)
- QA: "paints" is the third person singular form of "paint". "painted" (en), "malte" (de) are the past forms of the lemmas "paint" (de) and "malen", respectively
- LOE: Reduction to lemma helpful in case an expression is not constrained to be in plural (for example).
- REQ: Morphological variants should be expressible by classes, e.g., "aqua" is a first Declension noun in Latin. Furthermore these variants should be expressible with some "parametric" forms, e.g., "to speak" is a strong English verb with past "spoke" and perfect "spoken".
- REQ(?): Representation of such patterns should define the classes in an imperative manner, e.g., the 3p.p.s form of an English verb is made by adding "~(e)s"
- Lexical Variants/Paraphrases:
- QA: artist(of: Y), created(subj: X, obj:Y), painted(subj: X, obj:Y) are all valid ways of expressing the propery <RDF> Y dbpedia:artist X </RDF>
- LOE: Variants/paraphrases detection is relevant in order not to induce ontology evolution on the basis of paraphrase
- Metadata about type of linguistic resource
- LOE: If the linguistic resource is structured, associated to lexical semantics, etc.
- Metadata about expressivity of an ontology/concept scheme for a given language (in general) or, more specifically, for a given lexical resource
- void-like statistical info about:
- the expressivity of an ontology/conceptscheme with respect to a given language (e.g. how many resources are labelled or linguistically expressed somehow, in a given language)
- the coverage of an ontology/conceptscheme with respect to a given lexical resource (i.e. how much the conceptual content is covered by that resource's lexical content)
- the "linguistic enrichment" of an ontology wrt a given lexical resource, could be an entity on its own, thus mapping the specific relation ontology-lexresource for a specific lexresource, and statistical data of point above would be associated to each "linguistic enrichment" instance
- void-like statistical info about:
- LOE: Ontology to be modified/extented only if a certain number of expressions are available for a concept, which is not yet in the ontology, etc. Threshold to be determined
- Supporting translation mapping varieties
- Representation of forms, senses, synsets in a lexical resource
- LOE: If new lexical entries appear, but can be linked to synsets that are already in relation to established ontolgoy elements, then no need to modify the ontology
- Representation of relations between entities in LRs
- Representations of relations between lexical entries for the same concept across languages (ANY EXAMPLE?)
- Representation of the contextual conditions under which one lexicalization is preferred, can be not used etc. (IE, QA, NLG, OA)
- X dbpedia:creator Y can only be verbalized as "X painted Y" if X is a painting, i.e. X rdf:type painting holds.
- Lexico-syntactic Patterns (cf. LSP page)(LOE)
- LOE: "The statement of financial position (sometimes called the balance sheet) includes an entity’s assets, liabilities and equity as of the end of the reporting period." (IFRS)
- Lexico Syntactic ODP Eqivalence relation NP<class> (call in passive) NP<class>: [NP0 statement of financial position][VBN sometimes called] [NP1 balance sheet]
- Hearst pattern: [NP0 statement of financial position] [VBG includes] [NP1 assets] [NP2 liabilities] [NP3 equity]
- LOE: "The statement of financial position (sometimes called the balance sheet) includes an entity’s assets, liabilities and equity as of the end of the reporting period." (IFRS)
- Term structure, term decomposition(LOE)
- LOE: Rechnungsabgrenzungsposten (Deferred charges and accrued income): [NP0 Rechnung] [infix s] [NP1 Abgrenzung][infix s] [NP2 Posten] + Number, Case, POS
- General Desirata
- Concisness: Use fewer triples to do the job
- Clarity: Use clear readable names. Use single word identifiers; avoid "JavaBeans identifiers" (hasProperty)
- Modularity: Ensure that there are modules (as per formal definition of Ontology Modularity)
- Extensibility: Attempt to anticipate future changes.
General Issues
- How to represent meaning of a lexical entry?
- Do we need lexical relations (synonymy, hypernymy, hyponymy) ?
- Do we need to include?
- Top-level semantic classes of referenced ontology entities
- Higher-order ontological predicates
- basic similarity measures