HCLSIG BioRDF Subgroup/Meetings/2006-09-11 Conference Call

From W3C Wiki

Conference Details

  • Date of Call: Monday September 11, 2006
  • Time of Call: 11:00am Eastern Time
  • Dial-In #: +1.617.761.6200 (Cambridge, MA)
  • Participant Access Code: 246733 ("BIORDF")
  • IRC Channel: irc.w3.org port 6665 channel #BioRDF (see W3C IRC page for details, or see Web IRC)
  • Duration: ~1 hour
  • Convener: Susie Stephens
  • Scribe: Susie Stephens

Agenda

  • Discuss BioRDF representation at the F2F
  • Continue discussion on URIs
  • Fred Zemke (Oracle) update on SPARQL.


Attendees

Kerstin Forsberg, Andy Seaborne, Fred Zemke, Scott Marshall, Matthias Samwald, Kei Cheung, Joanne Luciano, John Barkley, Marja Koivunen, Bill Bug, Alan Ruttenberg, Susie Stephens

Fred Zemke (FZ) gave a presentation of some use cases for SPARQL. The use cases focused on concerns regarding blank nodes, duplicates and counting. His five questions to the BioRDF group at the end of the presentation were:

1. Do you view blank nodes as a device to create data structures without bothering with IRIs, or do you view them as existential assertions? 2. Do you expect to use linked lists? Do you expect to use what I call flattened lists? 3. Do you expect to use RDF in an “open world” fashion, or a “closed word”? 4. Do you expect to assume “distinct names”? 5. Do you want SPARQL to treat blank node identifiers existentially, or just the same as variables?

Susie Stephens (SS) – It was very good that FZ could give an overview of his concerns with SPARQL, as it’s very important that SPARQL is a solid standard, otherwise the alternative query languages will continue to flourish.

Andy Seaborne (AS) – The first four questions are really RDF related questions.

Bill Bug (BB) – Good presentation. Blank nodes are very important in life sciences.

Matthias Samwald (MS) – Originally I thought there was a bug with how SPARQL was working with blank nodes, but now it seems like it is a feature that could be problematic.

John Barkley (JB) – Great presentation. You could deal with the problems by increasing the number of data types for use with OWL and RDF. With relational databases people tend to be more interested in data than knowledge. People tend to use RDF for different things. You could have a RDF representation for a purchase order. You could use more sophisticated techniques than pattern matching triples. People will want to link the relational and RDF worlds together. Relational databases allow really arbitrary data types. Maybe the same would help with RDF.

FZ – This maybe the solution. There could possibly be significant performance implications. This is an issue with the object-relational world too.

JB – In response to question three, RDF and particularly OWL are meant to represent the open world. If you want to do a data search, rather than knowledge representation, then you could use the closed world. Although the semantics wouldn’t be being used properly if it’s a closed world scenario.

BB – There’s a lot of open world representations in the biomedical world.

Joanne Luciano (JL) – Open world and closed world both have their places. The life sciences are very open world, while business tends to be very closed world. We need to be able to bridge both worlds.

BB – There are many closed world environments, for example, XML, and schema modeling. Some applications of RDF, and especially OWL need to allow for open world. We need best practices for closing part of the world.

JL – There’s a covering axiom that can be used for closing OWL for specific parts of applications. We need this capability to be better defined, more than having a need for best practices. We need constructs to enable users on either side of the world to do whatever they want.

Alan Ruttenberg (AR) – Pellet for OWL uses blank nodes as regular variables. Non-distinguished bind implicit variables can’t return values. For example, if every parent has a child, and the child is a blank node, then you get all instances back in a query. If the child is a variable, then you don’t get every instance back in a query. Returning blank nodes as values of variables is at best confusing.

FZ – I’ve had similar conversations with Bijan Parsia. It’s also not possible to get a linked list back, as there isn’t any recursive query capability with SPARQL. The only way to walk the list would be to return a fetch, and then to return another fetch, etc.

AR – Blank nodes are commonly used incorrectly. You shouldn’t use them unless you need existential characteristics. The semantics will bite you otherwise. Something should distinguish a blank node extension, from a blank node where there is no name.

The only way you can tell a blank node from an individual with OWL is if there is entailment. You can’t test for a blank node in OWL with subsumption.

With OWL if you define a transitive property you usually get an unordered list. You could use pairs to follow the path back to the root.

Until everyone, and not just the logicians, is clear on blank nodes, then there is a limited penalty in using a named individual instead.

FZ – The primer needs to be re-written to make blank nodes clearer, as it sounds like there are serious consequences in their use.

AR – To answer question one, people view blank nodes as existential assertions.

BB – Are flattened lists related to named graphs in RDF?

FZ – Flattened lists are what I describe as being when you have a node and a verb, and the verb points to many other nodes, and the verb is then reused many time to point to additional nodes. This is useful in SPARQL. It is dangerous in the open world as there can be many contributors, but in the closed world you can get away with flattened lists.

AR – My first issue is that that isn’t a list, it’s a set, as it has no order.

My second issue is that line items need to be distinguished by more than their value, so you need to add an identifier number or a name.

You can close the open world by saying that you only have 20 of something for example.

FZ – It is possible to use a unique key. However, in practical queries, users tend not to be interested in using a unique key, and this therefore gets projected away and the user hits the duplicate problem.

In the relatioal world, if the duplicate is suppressed, the user views it as an incorrect result. There won’t be much traction for SPARQL if users perceive it as giving incorrect results.

AR – It’s a matter of educating people about distinct.

FZ – If there’s no distinct, then the engine may randomly remove duplicates.

AR – This is a good thing to look into.

FZ – This concerns me too.

AR – There’s a problem with count in the open world. If you exert a count, then you want to know what it is.

BB – You don’t want to have to tell the database the count.

AS – You need to make an explicit assertion. You may want to close the world to do the count. Sometimes you will want to do that, and sometimes you won’t.

FZ – We need more language options.

AR – That could be hard to do, and could potentially be very inefficient. It would force certain things. There’s no way to efficiently close the OWL world.

AS – Then you shouldn’t get a result to the query in the OWL world.

AR – RDF is the same.

AS – You do get closed and open world modes. With a uniformly implemented standard it’s difficult to include all options.

BB – In current systems with SPARQL, there could be the core standard that people conform to, and extensions. The extensions could be used to close the world.

Kei Cheung (KC) – There are currently many RDF query dialects.

AS – But there aren’t any variations of SPARQL itself.

AR – You can ask for how many distinct individuals there are as a minimum number. I don’t think you can say whether they are the same or different. With OWL it makes sense to ask ‘at most’ and ‘at least’.

SS – What topics would people be interested in focusing on at the F2F in Amsterdam? Would people be interested in discussing how we can link together our various data sets in RDF using vocabularies and ontologies? We could focus on additional data sets that we need to convert to RDF in order to have a demo that spans from the bench to the bedside. We could also focus discussions on URIs?

AR – Who will be attending the meeting?

Joanne, Susie, Kei, Alan in person, and Bill and John by telcon.

AR - A mini-hackathon where everyone brings their RDF data, and we work on integrating the data sets together. Doing this work in a F2F environment would be very useful. People could post the data for download in advance of the meeting.

I would like to give a presentation on URI resolution in ontologies. This would force me to really get my head around the topic area.

I would also be very interested to learn about the ontologies that the Bio-ONT group has developed.

I would also like to discuss the HCLS workshop at ISWC.

SS – Should we have a session focused on URIs? The initial plan was to have an initial draft of a best practices document in place in time for the meeting. However, I believe that this has slipped too much to be realistic.

JB - Part of the delay is that we aren't very aware of the areas in which there is controversy. How about we could solicit short URI related statements from people prior to the meeting, and during the meeting we could see where we have consensus and where we need to focus future work?

SS – As setting up a Wiki page doesn’t always appear to be constructive, I’ll send out a mail to the list to solicit URI statements. I can then post them on the Wiki if necessary. I’ll chase for input from those people who have stated that they are especially interested in URIs.

Please don’t forget to submit abstracts for the HCLS meeting. The deadline is Monday.

Please let me know if you have suggestions for the next call.

AR – Jonathan Rees (JR) is working on developing queries for Huntington’s Disease (HD). He’s currently not sure what information would be useful to include. This is because of data quality issues, etc. It would be good if we could have a call with some neuroscientists that could help him to make some decisions in this regard.

BB – This sounds like a good idea. There hasn’t been much of a focus on HD to date. HD is also great data to work with as it has a rich genetic component, e.g. pedigrees, populations, etc.

AR – I’ll investigate JR’s availability.