RDF Stream Model

From RDF Stream Processing Community Group

Definitions

We introduce a proposal of a model for representing RDF Streams. With this we want to extend RDF to model data streams: potentially infinite sequence of time-varying data elements encoded in RDF. The temporal aspect of the data needs to be taken into account in the data representation, and we extend the definition of an RDF graph for this purpose.

  • An RDF triple consists of a subject, prediacate and object: (s,p,o) in IuB x I x IuBuL
  • An RDF graph is a set of RDF triples.
  • An RDF stream S is a sequence of time-annotated graphs <g [t]> where g is an RDF graph and t is a timestamp.

What is the nature of the timestamp? In the abstract model is it just a non decreasing integer? (e.g. a tick) or a full date timestamp (which could also be seen as a number?).

Mikko: Doesn't the restriction to non-decreasing timestamps (integer or xsd:dateTime) rule out all distributed systems?

Alasdair: Not really as we could put in the assumption that all clocks are synchronised. This solves it at a model level, of course practical implementations would need to decide how to cope with that.

Or can we extend this to be possibly a timestamp interval? See discussion: [1]

Mikko: How about the support of multiple timestamps per event or multiple intervals per state? For an illustration see e.g. slide 10 in: [2]

Josi: Mikko, I think we do need the support of multiple timestamps, as illustrated in your slide. A event can have a timestamp given by the source that generated it, and also a timestamp when it reached the stream processor, for instance. However, I don't think we need to have this definition on the data model, but rather on the processing model. The timestamp generated by the processor does not have a validity outside that scope, IMHO.

Possible representations of the model

  • o :timeStamp "foo"^^xsd:dateTimestamp . GRAPH o { t1, … tn}


Josi: I think we can find arguments for both timestamps and time intervals, the question is can we find a unique mapping between the two? For instance, if we say a graph o is associated with a time interval [t1,t2], does it mean that the following holds:

o :timeStamp ti exists, for all t1<=ti <=t2

General considerations

Streams of streams

In the definition above we assume we have all knowledge for a given timestamp. In a more general case (e.g. scenario with no guarantees of having all the triples for a given time "t"), we can have "streams of timestamped streams of triples", as exemplified below [3]:

| ordered RDF triples ---> t1[2013-11-22T12:59], t2[2013-11-22T12:59], t3[2013-11-22T12:59], t4[2013-11-22T12:59]

| ordered RDF triples ---> t1[2013-11-22T13:00], t2[2013-11-22T13:00], t3[2013-11-22T13:00] ...

| ...

|

V

[Alasdair:] What in the stream of graphs approach assumes that we have all the knowledge for a given timestamp?

[Alasdair:] My understanding of the stream of graphs approach is that the triples would be grouped by say observation into a graph; meaning that there can be more than one graph with the same timestamp, one per observation. Thus there would not be one graph per timestamp with all triples from that timestamp, but one graph per event/observation which has a timestamp associated with it.

[Alasdair:] This is a fundamental difference that we need to come to some agreement upon.

Graphs and triples

Streaming 'objects' or 'entities' are usually captured not only by one triple but by a set of triples. Therefore this set of triples need to be grouped together and annotated within the same timestamp. Two graphs may have the same timestamp as well, which would make them contemporaneous.

Alasdair: be aware that triples with the same timestamp are not necessarily about the same event.

Emanuele: [...] we should consider allowing to consume time-(interval) annotated graphs rather than time-annotated triples. Classic RDF triples are defined to be facts. Facts are true at the current point in time. Yet, when processing streams there is at least an ordering component if not a time component added to such facts. Regarding TEF-SPARQL [Kietz et al. 2012] we propose to distinguish between (possibly) complex events and facts. Simple events happen at a single time instant, complex events start and end happening whereas facts always have a time interval during which they are valid. Facts are usually triggered by events or represent background data which we consider valid. We hence need to express both complex events as graphs and be able to assign these but also facts a time interval.

Mikko: I still haven't seen a definition of complex events, which would support statements like "complex events start and end happening". I also doubt that the text above would originate from Emanuele. :-)

Roland: I support the idea of using quadruples (graphs) instead of triples. More structure is possible in events as opposed to just plain triples. Just one example: flexibility in timstamping (one vs. two times or application time vs. system time) is only possible if timstamps can be attached to event structure. Flat triples cannot do that. Every triple (a,b,c) is associated with a named graph g as a quadruple like this: (g,a,b,c). Timestamps and other "header" data (like event source, etc) can be attached to the graph: (g,g,startTime,c1), (g,g,source,c2). Payload data (as opposed to header data) can use arbitrary triples (a,b,c) belonging to the same graph g: (g,a,b,c). We can exploit the distinction between header and payload data when discussing immutability of payload data, see below.

Immutability and Event Derivation

In many event processing systems [...] events are immutable (Luckham & Schulte 2011). This stems from the definition of what an event is: An event is an occurrence within a particular system or domain; it is something that has happened, or is contemplated as having happened [...] (Etzion & Niblett 2010). So events cannot be made to unhappen.

Open Question: Does this apply to all systems/applications/usecases or just to many as stated above?

Roland: I made immutability a general assumption in my work. It is very useful for building systems (distributed systems, consitency, ...).

Q: How can a Stream processing agent process events if they are immutable?

A: Every processing task produces new derived events as results. Advantage: the underived events are still available for other uses and remain immutable.

For RSP this means: (1) create a new (unique) graph for the derived event (2) possibly link back to the base event(s) thus enabling drill-down or root cause / provenance analysis of the derived event. The links can be made with DUL:hasConstituent from DOLCE Ultralight (Gangemi 2009). In my own work (Harth & Stühmer 2011) I use a new :members property to link from a derived event to its simple events. The property is a subproperty of the mentioned DUL:hasConstituent.

Observation: We talk about adding "received time" and other metadata later by receiving agents: Adding triples later to the event graph with graphname as subject can still be legal and considered as amending the event header. Much like with email: headers can be added by intermediate mail servers but the mail body and ID are immutable.

References:

  • Luckham, D. C. & Schulte, R. Event Processing Glossary - Version 2.0 (2011) [4]
  • Etzion, O. & Niblett, P. Event Processing in Action Manning Publications Co. (2010)
  • Gangemi, A. DOLCE+DnS Ultralite (DUL) (2009) [5]
  • Harth, A. & Stühmer, R. Publishing Event Streams as Linked Data Karlsruhe Institute of Technology, FZI Forschungszentrum Informatik (2011) [6]

Punctuation

A punctuation is a pattern p inserted into the data stream with the meaning that no data item i matching p will occur further on in the stream. (Tucker et al. 2003), (Maier et al. 2005).

For streams of RDF graphs this can be used like this: A punctuation is a pattern p inserted into the quadruple stream with the meaning that no quadruples i from graph p will occur further on in the stream.

Roland: Punctuation could use special quadruples but I think because we use the Web stack(!) we can do punctuation out-of-band i.e. by doing punctuation on a lower layer of the stack. For example we can communicate through chunked transfer encoding (Fielding et al. 1999, Section 3.6.1) from HTTP 1.1. Each chunk contains a complete graph and the receiever will know that after a chunk is received the event is completely received and can be processed further in an atomic fashion. There is a guarantee that no quads for this graph will arrive later. Using HTTP chunked connections no special (or magic) quads are needed.

References:

  • Tucker, P.; Maier, D.; Sheard, T. & Fegaras, L. Exploiting punctuation semantics in continuous data streams Knowledge and Data Engineering, IEEE Transactions on, 2003, 15, 555-568 [7]
  • Maier, D.; Li, J.; Tucker, P.; Tufte, K. & Papadimos, V. Semantics of Data Streams and Operators Proceedings of the 10th International Conference on Database Theory, Springer-Verlag, 2005, 37-52 [8]
  • Fielding, R.; Gettys, J.; Mogul, J.; Frystyk, H.; Masinter, L.; Leach, P. & Berners-Lee, T. Hypertext Transfer Protocol -- HTTP/1.1 RFC Editor, 1999 [9]