HCLSIG/LODD/Meetings/2010-09-01 Conference Call

Conference Details

Date of Call: Wednesday September 1, 2010
Time of Call: 11:00am Eastern Daylight Time (EDT), 16:00 British Summer Time (BST), 17:00 Central European Time (CET)
Dial-In #: +1.617.761.6200 (Cambridge, MA)
Dial-In #: +33.4.26.46.79.03 (Paris, France)
Dial-In #: +44.203.318.0479 (London, UK)
Participant Access Code: 4257 ("HCLS").
IRC Channel: irc.w3.org port 6665 channel #HCLS (see W3C IRC page for details, or see Web IRC)
Duration: ~1h
Convener: Susie

Agenda

Mapping the WHO Global Health Atlas - Amrapali Zaveri
Mapping experimental data

- Identify second data set & questions - All
- Discuss methodology of setting up the hypothetical data environment - Elgar
- Begin answering identified questions from paper - Susie

Data updates - Egon, Matthias, Anja, Oktie
AOB

Minutes

Attendees: Janos, Matthias, Elgar, Claus, Susie, Joanne, Amrapali

Apologies: Bosse, Oktie

<Susie> Claus introduction

<Susie> Claus works for Lunbeck

<Susie> Want to link to external data

<Susie> Best Practices Document

<matthias_samwald> susie: one topic: the overarching goal of how to make linked data available.

<matthias_samwald> susie: ADNI data (Alzheimer disease neuroimaging longitudinal data)

<matthias_samwald> susie: elgar proposed how to set up a hypothetical environment

<Susie> http://esw.w3.org/images/2/28/DataAggregationMethodology.pdf

<matthias_samwald> elgar: last time we talked about a document of best practices there was a misunderstanding. i was just thinking about the steps i go through (metaconcepts, where are the problems, where would i like to be). nothing groundbreaking.

<matthias_samwald> elgar: first, the metamodel. when i know what concepts are in datasets i think about how to model them.

<matthias_samwald> elgar: then i choose an implementation, e.g., which datastore, how is it stored

<matthias_samwald> ... the third step is the analysis / presentation step

<matthias_samwald> ... i spend most of my time with these three steps, HCLS does as well, although we should spend much more time on the third step (how analyses are represented to users)

<matthias_samwald> ... what i call class concept identification: i identify concepts in the sources. another important step: identifying identical and new concepts.

<matthias_samwald> ... useful to know what a CLASS and what an INSTANCE is.

<matthias_samwald> ... after you have done concept identification, you structure them into hierarchies.

<matthias_samwald> ... we need to go from maps-to relations to something more sophisticated... equivalence mapping, more details about mapping

<matthias_samwald> ... we have lots of experience with how to RDFize data

<matthias_samwald> ... we represent information as graphs, this seems pretty much straight-forward

<matthias_samwald> ... when it comes to statistical analysis, we did not pay much attention to how data is used

<matthias_samwald> ... and how it needs to be presented

<matthias_samwald> susie: thanks.

<matthias_samwald> ... for the best practices document, we should think about it in terms of effectively making data available

<matthias_samwald> ... one thing we should considers is how far we should 'pretend' implementing these things

<matthias_samwald> ... elgar was highlighting the presentation layer -- is that where we start? e.g. ADME dataset.

<matthias_samwald> elgar: my desire was to work in groups, where i understand what the problems are and how they are solved. i would say: pick a good question first

<matthias_samwald> elgar: we have tried to come up with lots of use cases for TMO, then we ended up with one that we actually used, not necessary to find too many questions

<matthias_samwald> elgar: we should shift from RDFization to data analysis

<matthias_samwald> susie: it would be good to focus on one use-case, but one use-case could have several (e.g. 3) questions, for example

<matthias_samwald> ... i was not thinking that the best practices layer would be too much focuse on presentation, but rather on how to make data available as RDF. i would keep that focus, but you have a good point. the presentation layer has been a weak spot.

<matthias_samwald> ... i invited Chris Bouton, was head of integrated data mining at data, he built a linked data browsing tool that i think looked pretty nice.

<matthias_samwald> ... resembling a mind-map

<matthias_samwald> ... i asked him to give a demo

<matthias_samwald> ... he actually used the LODD datasets

<matthias_samwald> ... he talked about making this tool freely available to academia and to make it available to our group, to make LODD data explorable on our web site

<matthias_samwald> ... the Allen Brain Atlas is making data available as linked data

<matthias_samwald> ... could be a connection to ADNI data

<matthias_samwald> susie: i invited someone from the allen brain atlas as well

<matthias_samwald> matthias: the availability of the allen brain atlas is exciting

<matthias_samwald> susie: could you look at how to connect to ADNI data?

<matthias_samwald> matthias: yes.

<matthias_samwald> TOPIC: Amrapali Zaveri talks about converting WHO dataset

<Susie> http://esw.w3.org/HCLSIG/LODD/Meetings/2010-09-01_Conference_Call

<matthias_samwald> amrapali: the presentation is how we turned excel sheets into RDF

<matthias_samwald> ... we used scovo

<matthias_samwald> ... and a plugin in ontowiki

<matthias_samwald> amrapali: (describes SCOVO, see http://semanticweb.org/wiki/Scovo)

<matthias_samwald> amrapali: slide 5 -- doing the interconversion automatically was not possible, there had to be some user interference

<matthias_samwald> ... we created a plugin in ontowiki, developed by AKSW group

<matthias_samwald> ... the plugin converts CSV data into RDF

<matthias_samwald> ... slide 6 -- first we created empty knowledge base

<matthias_samwald> ... slide 7 -- imported CSV

<matthias_samwald> ... country, population, diseases

<matthias_samwald> ... we view the resources in ontowiki

<matthias_samwald> ... on the left side are the dimensions (of type class)

<matthias_samwald> ... on the right side you can see a particular data item

<matthias_samwald> ... slide 13 -- N-Triples of SCOVOfied data

<matthias_samwald> ... 2000 dataitems, 50.000 triples

<matthias_samwald> ... there were a few challenges -- slide 14 -- some excel sheets contained a particular hierarchy

<matthias_samwald> ... we are also looking into converting other WHO databases

<matthias_samwald> ... probably using SCOVO

<matthias_samwald> ... but they are all in different formats

<matthias_samwald> ... another future work we are going to do: evolution of knowledge bases

<matthias_samwald> susie: thanks

<matthias_samwald> joanne: how long did it take?

<matthias_samwald> amrapali: 5 minutes start-to-finish using the tool. the entire process took about 4 hours for the whole dataset. someone who never used the tool would need over a day, maybe a week.

<matthias_samwald> amrapali: next steps are to look at the taxonomies of diseases, how to semantically enrich the data

<matthias_samwald> matthias: interesting to see ontowiki used. ontowiki is very useful because based on pure RDF database. will look into exposing HCLS KB via Ontowiki. Interesting to see how plugin was made

<matthias_samwald> Susie: any news about data update?

<matthias_samwald> matthias: nothing unfortunatley, it seems like a lot of interesting datasets have been RDFized by now.

<matthias_samwald> susie: interesting point. still best practice document is very important, especially for people with company-internal data.

<matthias_samwald> claus: looking forward to that document!

<matthias_samwald> susie: oktie said he will publish improvement of LinkedCT in october

<matthias_samwald> janosz: updated RxNorm, work on link to UMLS

<matthias_samwald> susie: next call in two weeks, chris butte (sp?) will give presentation

<matthias_samwald> ... also focus on best practices

<matthias_samwald> bye!