SPARQL End Point Self Description
This page summarizes and develops upon conversations on the subject at ISWC 2006.
Self-description of end points may serve at least the following purposes:
- Query composition - A client must know the capabilities of a server in order to compose suitable queries. ODBC and JDBC have fairly extensive metadata about each DBMS's SQL dialect and other properties. These may in part serve as a model.
- Content Discovery - What is the data about? What graphs does the end point contain?
- Query planning - When making an execution plan for federated queries, it is almost necessary to know the cardinalities of predicates and other things for evaluating join orders and the like.
- Does it make sense to send a particular query to this end point? The answer may contain things like whether the query could be parsed in the first place, whether it is known to be identically empty, estimated computation time, estimated count of results, optionally a platform dependent query plan.
We will look at each one in turn.
End Point Data and Capabilities
- Server software name and version
- Must the predicatte be constant?> Must a rdfs:type be given for a subject? Must a graph be given? Can the graph be a variable known at execution time only?
- List of supported built-in SPARQL functions .
- Language extensions - For example, whether there is a full text match predicate.
- Name and general desscription of the purpose of the end point.
- What organization/individual is maintaining the end point?
- Contact for technical support, legal or administrative matters. Support and webmaster.
- Ontologies used. This could be a list of graphs, each with a list of ontologies describing the data therein. Each graph would be listed with a rough estimate of size expressed in triples.
- Topic - Each graph/ontology pair could have a number of identifiers drawn from standard taxonomies. Examples would be the Latin names of geni and species for biology, the HS code for customs, ISO code for countries, various industry specific classifications of goods and services.
The end point should give a ballpark cardinality for the following combinations of G, S, P, O.
- G, P
- G, P, O
- G, S
- G, S, P
Based on our experience, these are the most interesting questions but for completeness, the entry point might offer an API allowing specifying a constant or wildcard for each of the four parts of a quad. If the information is not readily available, "unknown" could be returned, together with the count of triples in the whole end point or the graph, if the graph is specified. Even if the end point does not support real time sampling of data for cardinality estimates, it would at least have an idea of the count of triples per graph, which is still far better than nothing.
Given the full SPARQL request, the end point could return the following data, without executing the query itself.
- Syntax errors vs. parsed successfully?
- Are there graph, predicate or subject literals which do not exist in this end point? Does this cause the query result to always be emppty? What arre these?
- How many results are expected, according to the SPARQL compiler cost model? This is a row count, if the query is a construct or describe query, this is the count of rows that will go as input to the construct/describe.
- What is the execution time, as guessed by the SPARQL compiler cost model?
- Execution plan, in whatever implementation specific, in principle human readable format.
All these elements would be optional.
This somewhat overlaps with the optimization questions but it may still be the case that it is more efficient to support a special interface for the optimization related questions.