RDB2RDF Working Group Teleconference

15 Dec 2009


See also: IRC log


Seema, +43.316.876.aaaa, +1.562.249.aabb, +039046188aacc, +39.046.188.aadd, Ashok_Malhotra, Souri, EricP, mhausenblas, MacTed, soeren, cygri, whalb, [IPcaller], angela_UNITN, HeikoStoermer, jsequeda, +44.131.208.aaee, hhalpin, Orri
Ben_Szekely, Nuno, Ahmed




<trackbot> Date: 15 December 2009

<Ashok> Do we have telcons on Dec 22 and 27 ?

<mhausenblas> no, Ashok ;)

<ericP> slackers

<Ashok> Thanks, Michael!

<mhausenblas> scribenick: cygri


<MacTed> MacTed = Ted Thibodeau

<MacTed> correct

<angela_UNITN> aacc is me

<angela_UNITN> aadd is heiko

<HeikoStoermer> right

PROPOSAL: Accept the minutes of the 8 December 2009 telecon,


<whalb> +1

<Marcelo> +1


<soeren> +1

RESOLUTION: Accept the minutes of the 8 December 2009 telecon

Use Case planning

Use Case planning

mhausenblas: http://www.w3.org/2001/sw/rdb2rdf/wiki/Use_Cases_and_Requirements
... invite ppl to add their use cases

Ashok: format? HTML or only wiki?

mhausenblas: initially collaborate on the wiki, then turn into proper WG Note with help of EricP

Soeren: present use cases as database schemas?

mhausenblas: rather keep it on user level, e.g., "we have a web shop..."

or "combine crm system with web shop"

for now, it's structured brainstorming

number of use cases we're aiming at?

EricP: a size that we can manage

Presentation - Okkam/ENS


Heiko Stoermer is presenting

work is part of OKKAM, EU project

ENS -- Entity Naming System

thanks mhausenblas!

slide 2

slide 3

ENS provides services for re-use of identifiers

several public services

ID search, ID creation, ID management (alternative IDs), create+update profiles of entities

scalable architecture

access through SOAP services, REST is coming

web frontends

slide 4

benefits from using ENS

heiko: easily retrieve all data attached to the same ID

thx ericP!

scribe: maintain metadata about entities
... profile updates based on popularity
... application in business intelligence
... integrate data across systems
... potentially get links to stuff outside on the web for free
... e.g. other people talking about your product (SAP use case)

slide 5

heiko: architecture
... storage
... lifecycle, e.g. ageing, merging, splitting of IDs
... entity matching (queries)
... access management: no mining queries ("give me all XYZ")
... access APIs

slide 6

heiko: scalability
... storage has distributed index, and distributed entity store, both clustered
... replication+sharding
... solr
... ENS Core does life cycle etc, also clustered

slide 7

heiko: currently also working on offline processing
... batch processing, deduplication, data quality assessment etc

slide 8

heiko: under development for 2 years, version 2 coming
... now at 7.5M records, system scales to 50M
... want to be at 50M records and capability of 500M at project end 06/2010

slide 9

heiko: entity repository = ID + attached entity description

slide 10

heiko: challenges
... no defined fixed schema, just vocabularies
... we don't define vocabularies
... users specify name-value pairs
... matching afterwards is difficult
... users can use whatever vocab they want, "professor" instead of "person", we must deal with that

slide 11

heiko: internal representation: XML documents with name-value pairs describing the entities
... and alternative identifiers
... can be interpreted as linked data style sameAs
... e.g. dbpedia URI
... API call for retrieving the canonical OKKAM ID for an alternative identifier

slide 12

heiko: current content of the repo
... wikipedia, geonames, manually created
... total 7.5M entities
... currently adding DBLP
... no restriction w.r.t. types of entites, we can manage everything

slide 13

heiko: entity ID search
... user submits key-value pairs as query
... query must be matched against profiles
... result is canonical identifier
... skip slide 14

slide 15

heiko: 2 phase process in search
... 1. entity search, 2. refined entity matching
... entity search is for recall, pull out everything that is relevant, that's fast
... refined matching then to increase precision, can be more expensive
... return match or no match

slide 16

heiko: bridging to database integration
... expose two DBs as two knowledge bases (graph)
... typical approach for integration: owl:sameAs between records in diff DBs

slide 17

heiko: owl:sameAs has strong semantics, you forget where the data came from
... (slide 18) better: use same ID everywhere
... OKKAM ID as "mediator" in the middle
... without undesirable consequences of sameAs

slide 19

heiko: you can give local identifiers and then connect them to OKKAM ID
... then you can merge based on the ID, with desired semantic rules

slide 20

heiko: a database alignment project with okkam
... client has bunch of databases
... want unified view
... convert them all to RDF
... use ENS to align
... so entities are linked without having to merge the graphs

slide 22

heiko: in RDB you have PKs, so unique ID is often a number
... in RDF you need a URI
... ENS is the thing that can enable stepping from the RDF world to the RDF world
... afterwards, coreference is syntactically evident
... so okkam provides mapping between local ID and global OKKAM ID
... DERI has sig.ma application
... you can give it an okkam ID and it will give view on all data out there that uses the ID

<Souri> +q


ericP: similar to Shared Names project? Concept Wiki?

heiko: they do life science IDs, we do all domains
... they are vertical app

ericP: different proteins are sometimes the same, sometimes not considered the same
... predicated similarity?

heiko: frequently raised point... up until which is X the same when you start replacing all its parts?
... we don't deal with that kind of semantics
... what's the same or not is in your knowledge base
... if you describe things differently from me, if we need insulation, we will have two different entites

ericP: when I do SPARQL queries, should engine be aware of OKKAM?

heiko: no SPARQL interface yet

Souri: q related to goal of this WG... how do you do mapping in the DBs?

heiko: that's up to mapping infrastructure. we just provide a URI. ENS is not a mapping layer between DB and RDF. ENS is ID management

Souri: do you hand an ID to the user, "build your DB using this"? or does user give all hist IDs to the ENS?

heiko: can do two things. first, whenever I create an entity, ENS assigns it an ID. when someone else wants to talk about same entity, ID is already there in the ENS
... second, we already have distributed data. you give data to the ENS, it gives you an ID (existing or newly created). repeat for different data sources, you get same ID

Ashok: are okkam IDs URIs? what's the structure of the URI?

heiko: yes, they are URIs

<HeikoStoermer> http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1

heiko: that's an okkam ID
... it's a UUID

<angela_UNITN> you can aggragate data by okkamID using sig.ma for example

<angela_UNITN> http://sig.ma/search?q=http://www.okkam.org/entity/ok5f23a5ce-a683-4c4d-ae73-b78cdc17aec1

jsequeda: let's say i have legacy DB about companies with PKs. so I would map my PKs to okkam IDs?

heiko: yes you want to have the okkam ID somewhere in your data, because then it's stable
... either do it entity by entity, or use batch processor where you send the data to the ENS
... privacy issues of course

jsequeda: how much disambiguation do you do? how tell apart oracle the company and oracle the DB?

heiko: if you just have a string, we can do nothing for you. need more info in your record
... sometimes can fall back on global popularity. IBM the company vs IBM the band
... in practice, today: build a slightly more elaborate description of your entity; do it one by one; send query to ENS
... real examples from use case partners have sufficient detail
... structure of query: simplest is bag of words; more complex is key value pairs; easy to pull that from a DB and that helps us a great deal

mhausenblas: further questions on the mailing list


<ericP> +1

mhausenblas: no telecon on december 22nd and 29th

<Marcelo> +1

PROPOSAL: reconvene jan 5th

<mhausenblas> http://www.w3.org/2001/sw/rdb2rdf/wiki/ScribeList

next scribe is Souri

microsoft patent ... apparently does not come from SQL Server team but perhaps Live Search

<jsequeda> Email on the New York Semantic Web mailing list

<jsequeda> Actually its not a patent yet, just an application. The USPTO is looking at ways to improve discovery of prior art, and has a pilot program where you can participate in the examination process. So if you know of prior art, post it here:

<jsequeda> http://www.peertopatent.org/

<MacTed> there is a date that "prior art" must exist before, associated with the patent ... but I forget whether that's the "submission date" or something else

<Souri> Oracle has a paper in VLDB 2005

<mhausenblas> [adjourned]

Summary of Action Items

[End of minutes]

Minutes formatted by David Booth's scribe.perl version 1.135 (CVS log)
$Date: 2009/12/15 18:07:02 $

Scribe.perl diagnostic output

[Delete this section before finalizing the minutes.]
This is scribe.perl Revision: 1.135  of Date: 2009/03/02 03:52:20  
Check for newer version at http://dev.w3.org/cvsweb/~checkout~/2002/scribe/

Guessing input format: RRSAgent_Text_Format (score 1.00)

Succeeded: s/Heiko Störmer/Heiko Stoermer/
Found ScribeNick: cygri
Inferring Scribes: cygri
Default Present: Seema, +43.316.876.aaaa, +1.562.249.aabb, +039046188aacc, +39.046.188.aadd, Ashok_Malhotra, Souri, EricP, mhausenblas, MacTed, soeren, cygri, whalb, [IPcaller], angela_UNITN, HeikoStoermer, jsequeda, +44.131.208.aaee, hhalpin
Present: Seema +43.316.876.aaaa +1.562.249.aabb +039046188aacc +39.046.188.aadd Ashok_Malhotra Souri EricP mhausenblas MacTed soeren cygri whalb [IPcaller] angela_UNITN HeikoStoermer jsequeda +44.131.208.aaee hhalpin Orri
Regrets: Ben_Szekely Nuno Ahmed
Agenda: http://lists.w3.org/Archives/Public/public-rdb2rdf-wg/2009Dec/0008.html
Found Date: 15 Dec 2009
Guessing minutes URL: http://www.w3.org/2009/12/15-RDB2RDF-minutes.html
People with action items: 

[End of scribe.perl diagnostic output]