Cypher

From W3C Wiki

The Cypher alpha release is the AI software program available which generates the RDF graph and SPARQL/SeRQL query representation of a plain language input, allowing users to speak plain language to update and query databases. With robust definition languages, Cypher's grammar and lexicon can quickly and easily be extended to process highly complex sentences and phrases of any natural language, and can cover any vocabulary. Equipped with Cypher, programmers can now begin building next generation semantic web applications that harness what is already the most widely used tool known to man - natural language.

[edit] Transcography

Cypher rigorously conforms to a sub-discipline of natural language processing called Transcography, which was developed by Monrai with the goal of merging the field of natural language processing with the increasingly popular Semantic Web movement. Transcography is a set of core principles for converting parsed phrases into RDF triples. More specifically, transcography is the process of parsing the phrase structure of a natural language construct, and translating the grammar tree output into a semantic graph. The output of each NL construct is three things:

  1. a URI representation of the NL construct
  2. a set of one or more subject-object-value triples involving the URI
  3. the set of all triples produced by sub-phrases

Thus, Cypher views any and all linguistic input as a URI + related triples. This notion makes the lexical component a powerful NL resource for Cypher.

As an example of transcographic output, consider the phrase: John's coach. The transcographic process produces a URI representing the phrase, for example: http://john.mysite.com/MrDouglass, and a set of triples representing the statements involved in the phrase:

{http://john.mysite.com/me} jo:hasCoach {http://john.mysite.com/MrDouglass}

Cypher leverages these triples to create either an RDF model or an SPARQL query. The mode of output is based on whether the NL construct is a clause or description, or if it's a noun phrase or question. The triples of sub-phrases are recursively merged to produce a root graph representing the root NL phrase or clause. For example, consider: John's coach knows Martin. The URI produced will represent this clause (e.g. the URI of a reified RDF triple, or the URI of a semantic frame), and a graph containing:

{qv:node1} foaf:knows {http://john.mysite.com/MartinCrump}

The URI qv:node1 represents a SPARQL query variable of a SPARQL query which was serialized in RDF. This is because the phrase John's coach is a relational noun phrase, and thus, is anaphora reference. By re-constructing the SPARQL query for the variable (by following the links from qv:node1), and then executing the query, a program can retrieve the resource that represents John's coach at the time of the query. This technique is used because John may have a new coach at the time of the query. Transcography stipulates that any anaphora reference be represented by a query variable (linked to the RDF representation of the SPARQL query) unless the program is ready to apply the variable value (e.g. to presenting it to a human user in an interface).

The word transcography is the combination of transcode, which means "to convert media from one format to another", and -graphy which is "writing or text representation produced in a specified manner or by a specified process". Thus the literal meaning is "text transcoding". Knowledge representation frameworks used in the process include RDF and Frame Semantics.

The following six principles form the core of transcography:

[edit] Symbolic Reference

Each constituent of a phrase must resolve to a concept, referenced either by description (e.g. the blue bird) or unique identifier (i.e. Henry Ford). A transcoder, therefore, produces either a URI or BNode which represents the phrase, plus a set of triples representing the description given by the phrase.

[edit] Node Expansion

The set of triples produced by each child node of a phrase is included in the parent phrase’s output.

[edit] Subcategorization

Transcography conforms to the theory that verbs and other atomic units of meaning in a language subcategorize for their arguments, and that this information is specified in the lexicon.

[edit] Identity Transfer

The human language processor produces semantic output by consulting a dictionary, and retrieving an entry for each word encountered in the input. The entry contains the description of an anonymous entity, and this description is transferred to the instance concept.

[edit] Inference

Each phrase and clause in natural language expresses information not explicit in the phrase. The human language processor makes use of a dictionary which provides a semantic map, linking the explicit description provided by the phrase, to implicit descriptions inferred from the phrase.

[edit] Gestalt

Because of the influence from Frame Semantics, Cypher adheres to the principles of gestalt. The mind tends to see things not in isolation, but as part of a greater whole which encompasses (or is, rather encompassed, by) the body of world knowledge we gather from prior experiences. Thus, the phrase a book inherently makes reference to its author (though anonymous) and its topic (though unknown), as these are a couple of the things brought to mind by the word book. When such implied elements are not present in the grammatical context, the mind tends to fill in the semantic gaps with anonymous objects that fit the minimum requirements of that missing element.

[edit] MetaLanguage Ontology (MLO)

Cypher uses explicit information about phrase structure, lexical rules, and semantic relations. This information is encapsulated in the MetaLanguage, which is an RDF ontology. An example of a lexical entry in MLO is:

<mlo:Sense rdf:about="&mlo-terms;meet"> <rdfs:label>meet</rdfs:label> <mlo:lemma>meet</mlo:lemma> <mlo:pos>V</mlo:pos> <mlo:allows rdf:resource="&mlo-terms;Clause/subject"/>

<mlo:requires rdf:resource="&mlo-terms;Clause/directObject"/> <mlo:ref rdf:nodeID="meetFrame"/> </mlo:Sense>

<mlo:MeetFrame rdf:nodeID="meetFrame">

<mlo:agent rdf:resource="&mlo-terms;Clause/subject"/> <mlo:theme rdf:resource="&mlo-terms;Clause/directObject"/> <mlo:theme rdf:resource="&mlo-terms;Clause/withComplement"/>

<dc:description>&mlo;agent has come upon &mlo;theme as by chance or arrangement</dc:description> </mlo:MeetFrame>

[edit] Output Types

Cypher produces RDF triples (in various flavors, including turtle, n3, trix, and ntriples) from natural language clauses, and both SPARQL and SeRQL queries from natural language noun phrases and questions. In addition, grammar parse trees are generated, encoding such information as phrase type, part-of-speech, morphological data, parser duration/time, and lexical resource used. Cypher is also equipped with a plugin-in framework for creating custom output such as Cyc microtheories.

[edit] Similar Technologies

Semantra

Powerset

Trueknowledge

Hakia

Carabao Language Kit

[edit] See also

Symbolic Species by Terrence Deacon

Context-free grammar

Semantic Frames

Semantic Web

Linked Data

[edit] External links

  • Online Cypher Demo
  • Cypher User Guide
  • Harnessing Social Collaboration - Presentation on Cypher
  • Controlled Natural Languages for the Semantic Web - A case study on the need for better NL-based UI tools for the Semantic Web

Project Homepage