SemWalker

A data browser, aka "semantic web browser "(cf Longwell, Tabulator)
Demonstrates "downhill steps" to decentralizing the Data Web
In progress

What is a Semantic Web Browser?

Software which lets users browse RDF data published on the web

Single-Item Page
Multi-Item Page
Class-Specific Views
Real-Time Harvesting
Real-Time Inferencing
Control Over Sources
A better way to make data-oriented web sites

Hopefully the term "Data Browser" will come to mean this

Single-Item Page

see a lot information about something

Person
Book
Meeting
Photograph
RDF Property

Multi-Item Page

see a little information about each of many items

The people danbri foaf:knows
Books on Danny's wish list
DIG meetings
Photographs of the W3C team
Properties with the word "sister" in their docementation

Class-Specific Views

different content and layout for different class items

People cf friendster, orkut, department pages, okcupid
Book cf amazon, library card-catalogs
Meeting cf conference website, agenda-email
Photograph cf flikr, custom photo sites
RDF Property cf ontaria 9-board, ...?

Real-Time Harvesting

fetch the data from the web when the user wants it (with caching)

delays in publication are often unacceptable

Author's feedback-loop
Meeting agenda changes
Photographs we just took
(isn't this obvious)

Real-Time Inference

Let people rely on the formal semantics; let the machines do the work

subclass/subproperty
vocabulary mapping rules
...?

Control Over Sources

Let readers see (and control) where data came from

one possible design:

- blue -- on reader's trusted-sites list
- white -- trusted by uri-owner
- grey -- 3rd part
- red -- on reader's highly-suspect list
combined items (eg inferences) colored at least-trusted level
mouse-over or nearby [!?] icon for more details
list of Sources at the bottom of the page, with Mark-As-Trusted and Mark-As-Highly-Suspect
harvesting and inference depths much greater on more-trusted information

A Better Way To Make Data-Oriented Web Sites?

What big web-sites are not data oriented?
Content-Providers:
- expose RDF views of their database(s)
- host an off-the-shelf SemWalker installation (or contract for one)
- can decentralize internally (eg between sales, marketing, shipping, customer relations, etc -- all are automatically integrated on "their" website)

Consumers (Data Users)

Get better data-oriented web-sites
Get to use their own data browser
Incidentally get to query/integrate across provider sites

Market Forces

Providers looking for best server-side data browsers (SemWalker-clones)
Consumers looking for best data browsers
Providers become notably absent from "Data Web" for consumers using their data browsers (presure to expose data)

SemWalker Strategy

Offers software that lets people build world-class websites just by publishing their public data as RDF
... (they might have to add data-organization metadata)
... (and an RDF-based transactional-services interface)

Downhill steps to the Semantic Web

Applications

(@@@ expand these examples)

social networking (view of Person)
photography (view of a Photo)
shared vocabularies (view of Ontology, Class, Property)
shared calendars (view of TimeRange, Event)
collaboration (view of Project)
shopping (view of ItemForSale)

@@? Some Requirements

Allow data source branding
As usuable as a typical data-oriented website (simple, fast, reliable)
Customizable views for new Classes
Privilege of term-origin data (over 3rd party data)
Quiet-but-available inference
Good Caching
Access to search features

The Code

just my work, so far
most development driven by Ontaria
- Key item classes: Ontology, Class, Property
- Central-hosting plan (portal style), at w3.org
- Extensive shared indexing - plan for 10^8 triples?

Development History

Dec 2001	server-side browser mockup
Mar 2002	SemWalker name; first python prototype
Apr-Oct 2003	second python prototype
Feb-Nov 2004	prolog version (DAML funding as part of Ontaria)

Harvester

prioritized threads (hi-queue, low-queue)
all harvest data visible as RDF
reads RDF/XML, HTML
fast, but there's lots of cruft out there
breaks large graphs into chunks (10K triples)
can keep old copies of data
follows redirects, cache-control, etc.

Indexer

separate thread, running behind harvester
maps keyword -> (subject, chunk)
map uri -> chunks that use it
broken berkeley db implementation

RDF Store

library('semweb/rdf_db') in memory (an 8-way indexed quad store)
rdfpage library provides fast loading from disk (pre-parsed)
pagein_around
- loads all the chunks that use this URI as subject/value
- (may need ontology-smarts -- ie pagein_around(rdf:Resource)
uris given to chunks are all hosted/served
backward-chaining of rules (as prolog, but incremental depth-limited)

Inference

say more? forward chaining, too, eg for index

experimentally reads surnia rules for OWL inference

Views

rules for mapping from triples to XHTML trees

Current implementation makes a tab for each that applies
I think they're pretty easy/fun to write

SemWalker: Strategy and Progress