Second of the BrainStorming sessions, after the first BrainStorming/2007-12-13 session.

SIOC Brainstorming Session on 29th January 2008

Attendees

Alexandre Passant
Stéphane Corlosquet
Uldis Bojars
John Breslin
Richard Cyganiak
Thomas Schandl
Hak Lae Kim
Ilko Grigorov
Jung Jin-uk
Tuukka Hastrup
Vassilios Peristeras
Knud Möller
Sungkwon Yang
Sheila Kinsella

SIOC-based software ideas

"Metablogger"

Vassilios would like to see a site like technorati, but empowered by SIOC

like a SIOC browser exclusively for blog data (a SIOC browser for e.g. collaborative work environments needs to be different)
a portal that would look like a nice web 2.0 site (as opposed to something that looks like a database)
traget audience: bloggers - they want an interface that looks like what they are used to work with.
it keeps your profile, shows all the SIOC data from blogs that is out there in a nice form, allows querys etc.
the name could be "Metablooger" and probably shouldn't be anything with SIOC, because users don't care about ontologies, they care about blogging.

This is a long way to go - but it would be good to identify from the very beginning where we want to go, so we could split it in smaller pieces.

Bookmarks manager in RDF (alex)

http://www.ejeliot.com/projects/php-delicious/
should be easy to convert bookmarks from delicious => sioc and get a local bookmark manager (ARc-based ?)
Maybe plugged with MOAT

Desktop Application

Interesting use case: Automatically look up the people in my address book in my SIOC browser - filter articles by those people that I know.

Widget

Knud works on proposal for new project Enterprise Ireland commercialisation project. A widget that uses DERI technology should be implemented as part of this projet. SIOC might be a candidate for that.

Knud now looks for ideas what this widget could do - maybe a sioc browser? Somehow generate sioc data?

SIOC reader (universal) widget

Alex works on a SIOC reader widget that will query any SPARQL endpoint of your choice for fresh SIOC data.

It would be nice if Sindice provided a way so you could filter that list of posts so you see only the posts of people you know (i. e. restrict this list to posts from people that are in your foaf file), or click on the topic of a post and show other posts on this topic made by your friends.

Richard: With Sindice at the moment you cannot filter just for SIOC data - but you can get documents that mention a certain URI but you would have to do some filtering yourself to see if it is SIOC.

Straw Feed Reader

Tuukka is working on the RSS aggregator Straw and wants to make it SIOC enabled, so you could suscribe to certain blogs, forums, etc.

But it has still to be defined what we want to do and what is possible. We need a way to express that: that you want to subscribe to something. A SIOC feed is not like an RSS feed (where you have one URI and you load it every now and then to see what's going on) - SIOC data will probably not be in all in one file.

We need some kind of vocabulary to describe where to go for archives, where to go for new items, etc.

iTunes-like SIOC browser

Knud would like to see a SIOC browser that worked like iTunes. Similar to "intelligent playlists" it could have something like a SPARQL query over SIOC data.

You have an address book, and the browser could look on the SIOC-o-sphere if there is anything new of any of those people e. g. a new post, a new blog and you will be asked: "do you want to import that / subscribe to that?"

Alex: maybe have something like an IMAP server for RDF data

Tuukka: Did anybody ever do that - RDF over IMAP? (Dan Brickley wrote an article about SPARQL over XMPP)

Feeds for SIOC data

We may want to have a feed of SIOC items or some list of recently updated SIOC items, which would help crawling and help to synchronize two stores (if you have e. g. an online community site which publishes SIOC data) then you want to have a list of most recent updates so that you know what SIOC items you have to retrieve.

We do have such feed-like things for posts, for other SIOC items there are two options:

we could build such feeds/pages for comments and other things into SIOC exporters/API
or piggy back on exisiting RSS and Atom feeds (even if the feed is not RDF there is should be nothing that prevents us from attaching a link to that RSS or Atom entry and say "there is more RDF / SIOC data somewhere")

Discussion

Knud: maybe there should be a convention saying if there is a newsfeed at e.g. url.com/feed.rss then it's SIOC data will probably be at e.g. url.com/sioc.rdf ? Then you could always check for any newsfeed URI you get if there is some SIOC data. Uldis: that could be minimum solution.

Richard: But you mainly want to identify if there is a new SIOC Post. It is not so hard to find the SIOC feed, but the hard part that in SIOC you have the archive going back to e.g. 1998 - but you are only interested to find new posts / posts you haven't seen yet.

Knud: this info you would get from the newsfeed.

Richard: ok, so you will find out that there is a new RSS item, but how do you discover the corresponding SIOC item in the SIOC feed?

Ulids: Knud has a point - if you get the RSS feed with latest updates, then you could go and resolve: what is the HTML content at that URI and see if it has a SIOC autodiscovery link. (but RSS will have the limitation that you will get blackouts if you don't check it often enough)

Current "implied best practices" for exporting SIOC data

Requirements in this particular case: we want to be able to retrieve SIOC data of what has changed since when we last looked at a site. If we are storing the data, that would mean that we want to do incremental updates of our data store. We need to write suggestions for how people can create such feeds of SIOC data, in order to make this incremental crawling possible.

We do have a feed of SIOC posts (and in a similar way we could implement that for comments): We export all sioc:Posts in a linked list which is paged - one page number 1 contains e. g. the first 15 posts and a rdfs:seeAlso link to page 2 with more posts and another link to page 3 and so on (see drawing on whiteboard 1). These pages are generated on demand when someone e. g. requests http://example.com/sioc.php?type=forum&id=1&page=1

So we have SIOC data of only a small amount of posts on every page (not to overload the site), but by following the rdfs:seeAlso links you can retrieve everything. But this practice of pagination with seeAlso links is not documented - so it is not guaranteed to be ordered by date, neither is it certain if newest posts are on first page.

Documentation needed

Scalability

How will this work and scale for e. g. 1 million of blogs?

Each blog would have the above described link structures - so a million blogs would mean a million of these link structures. So each of them scales well, as it is exactly the same technology that is used to publish blogs - just the output is RDF instead of HTML.
If you want to consume massive amounts of this data from multiple sources the approach is kind of the same as web crawlers are doing. This is of course fairly expensive, but managable: 3 Sindice servers can fetch a couple of hundreds of thousands pages per day.
Then there is the question of storing, querying and building your actual application - which is an issue with large amounts of data. If you want to run SPARQL queries on it than you need a really scalable triple store. But: do we really need all of the SIOC data, or can we limit the data to certain kinds of SIOC posts of a certain domain for a specific application, and everything else we ignore and don't even store. Then it should also be managable.

Richard thinks there will be some kind of intermediaries like technorati which will facilitate this process. There could be also two steps to ease the scalability problem:

Identifying sources of interest. Go to a search engine and ask e.g. "Who is talking about the semantic web?" You get a bunch of sources and put them into your local system.
The local system will just watch these sources.

Search on the web of data

Hot to get good graphs?

Everywhere on the semantic web we have the problem: how do we find the triples that we want.

We have the image of a connected graph, but in reality it is divided into documents. Are there people working to solve this problem or are any people in DERI who would be interested in solving this problem?
Need for seeAlso's to the right places (but rdfs:seeAlso isn't very semantic and won't help many tools to do what they want to do)
Need for different kind of aggregation system.

A number of projects are working on that - but often an ideal world is assumed, where people use the same URIs when they are talking about the same thing.

Main problem: How do we get good connected graphs? At moment two approaches are done in practice:

Linking with seeAlso's or more specific properties like foaf:knows and following these links between documents. Problems:
1. All the necessary links have to be there
2. There can be too many links: which of these 100 seeAlso's am I supposed to follow?
Sindice's approach: global search - try to find as many documents as possible and then do keyword search or search for URIs mentioned in these documents. Problems:
1. We don't have everything and are not always up to date.
2. Just searching for keywords or mentioned URIs might be too unspecific for many applications.

→ Something in between these approaches is missing, where you can search for specific documents even if they are not linked directly.

For Sindice's approach it would be interesting if you could limit a search to certain types of objects.

Discovering SIOC data

It is currently not so easy to discover SIOC data that is out there: Not many links from FOAF profiles into SIOC data.

One possibility to do it (apart from using PingtheSemanticWeb.com): the homepage specified in a foaf file could be checked to see if it has SIOC autodiscovery links.

There is a need for documented best practices / FAQ / blog post explaining how to point from e. g. FOAF to SIOC → establish a working group to make a document describing how people can point to their SIOC data

Basically it is done be connecting a foaf:Person with foaf:holdsAccount to a sioc:User and also make a triple with rdfs:seeAlso to point to the place with SIOC data.

Other Issues

Ontology

Alex would like to see if the "SIOC-topic extention" can fit in a sioc module
Some properties should be changed, more properties could be added - that should be discussed in a SIOC working group next week.
SIOC ontology doesn't open in Protege - maybe have a look at it, although SIOC ontology is valid rdf, so it's Protege's fault that it doesn't load the specification (to browse the ontology one can use e. g. Tuukka's Fenfire).
Richard and Thomas will look at output of SIOC API to produce checklist for it's use with tabulator (respectivly as a small test case for rdf browsers / generic rdf applications)

News

There is int.ere.st for searching SCOT information - should also work with SIOC.
Alex made a flickr foaf and sioc exporter
SIOC module for Drupal 5 released last week.
SIOC module for Drupal 6 released yesterday.