Warning:
This wiki has been archived and is now read-only.

Feature:FullText

From SPARQL Working Group
Jump to: navigation, search


Feature: Full Text Search

We need a full text search functionality, so XPATH/XQuery full text extension can be reused.

Feature description

We propose support for a full text search capability. Since a full text match typically returns a match score, this would be in the form of a pattern like

?o contains <text condition> ?score .

Some implementations exist, e.g. Virtuoso and ARQ. They are not interoperable however.

Fulltext search capabilities can be discussed on different levels. The simplest possible implementation is just to standardize the function to do it and possibly a requirement that the string should be possible to truncate.

The next level is to retain the score, as in the above example.

For the convenience of standardization, Openlink propose to reuse the XPATH/XQuery full text extension. This will not interoperate with existing implementations. Not reusing the XPATH or possibly SQL MM specifications would make the task unachievable within the scope of the WG. Reusing existing proposals makes the addition trivial. Support of the XPATH/XQuery text expression syntax is not very hard on implementors since they already must parse some syntax for text conditions.

Finally, one could address the problems encountered in use cases where a number of literals should be indexed, which is a common requirement in "simple free text search" scenarios.

Example

?o contains <text condition> ?score .

or a simpler case:

FILTER (?o contains(<text condition>)) .


Existing Implementation(s)

  • Virtuoso supports the functionality but not with the proposed syntax.
  • ARQ has a LARQ extension that uses Lucene to maintain indexes, and it supports Lucene text condition syntax.
  • There is Fulltext Querying SAIL for Sesame based on Lucene
  • Glitter

Existing Specification / Documentation

XQuery and XPath Full Text 1.0

Compatibility

No issues because expressions can be recognized by ftcontains that was not in use before.

Links to postponed Issues

Related Features

Feature:FunctionLibrary

Champions

Use cases

Nowadays, almost every site has a "search box", and this is an important and useful feature users have come to expect. When SPARQL is used against a backend triplestore, it is very important to be able to use the freetext keywords in the SPARQL query, as one may need to combine the search term in conjunction with other properties (e.g. search only resources that are approved for publication).

Computas has delivered a system that has a public search interface, and where the freetext search capabilities is a major feature. The main motivation for backing this is that the migration costs between ARQ and Virtuoso was high due to the differing syntax.

Within that system, users can issue both simple free text queries and advanced queries for specific properties of the resources, some of which may hold literals. The latter use case can be satisfied by simply standardising the function name and a truncated string (e.g. "foo*" will match "foobar").

The former is a harder problem to solve in the general case, as it requires that one can specify the properties that should me matched. I described the problem in an email to Virtuoso-users list.

Perhaps something like

?s ?p ?o .
FILTER ( ?p IN (dct:title, foaf:name, rdfs:label)) .
FILTER ( ?o contains "foo*" ) .

would do?

References