HCLSIG/LODD/Meetings/2010-09-01 Conference Call
Conference Details
- Date of Call: Wednesday September 1, 2010
- Time of Call: 11:00am Eastern Daylight Time (EDT), 16:00 British Summer Time (BST), 17:00 Central European Time (CET)
- Dial-In #: +1.617.761.6200 (Cambridge, MA)
- Dial-In #: +33.4.26.46.79.03 (Paris, France)
- Dial-In #: +44.203.318.0479 (London, UK)
- Participant Access Code: 4257 ("HCLS").
- IRC Channel: irc.w3.org port 6665 channel #HCLS (see W3C IRC page for details, or see Web IRC)
- Duration: ~1h
- Convener: Susie
Agenda
- Mapping the WHO Global Health Atlas - Amrapali Zaveri
- Mapping experimental data
- Identify second data set & questions - All - Discuss methodology of setting up the hypothetical data environment - Elgar - Begin answering identified questions from paper - Susie
- Data updates - Egon, Matthias, Anja, Oktie
- AOB
Minutes
Attendees: Janos, Matthias, Elgar, Claus, Susie, Joanne, Amrapali
Apologies: Bosse, Oktie
<Susie> Claus introduction
<Susie> Claus works for Lunbeck
<Susie> Want to link to external data
<Susie> Best Practices Document
<matthias_samwald> susie: one topic: the overarching goal of how to make linked data available.
<matthias_samwald> susie: ADNI data (Alzheimer disease neuroimaging longitudinal data)
<matthias_samwald> susie: elgar proposed how to set up a hypothetical environment
<Susie> http://esw.w3.org/images/2/28/DataAggregationMethodology.pdf
<matthias_samwald> elgar: last time we talked about a document of best practices there was a misunderstanding. i was just thinking about the steps i go through (metaconcepts, where are the problems, where would i like to be). nothing groundbreaking.
<matthias_samwald> elgar: first, the metamodel. when i know what concepts are in datasets i think about how to model them.
<matthias_samwald> elgar: then i choose an implementation, e.g., which datastore, how is it stored
<matthias_samwald> ... the third step is the analysis / presentation step
<matthias_samwald> ... i spend most of my time with these three steps, HCLS does as well, although we should spend much more time on the third step (how analyses are represented to users)
<matthias_samwald> ... what i call class concept identification: i identify concepts in the sources. another important step: identifying identical and new concepts.
<matthias_samwald> ... useful to know what a CLASS and what an INSTANCE is.
<matthias_samwald> ... after you have done concept identification, you structure them into hierarchies.
<matthias_samwald> ... we need to go from maps-to relations to something more sophisticated... equivalence mapping, more details about mapping
<matthias_samwald> ... we have lots of experience with how to RDFize data
<matthias_samwald> ... we represent information as graphs, this seems pretty much straight-forward
<matthias_samwald> ... when it comes to statistical analysis, we did not pay much attention to how data is used
<matthias_samwald> ... and how it needs to be presented
<matthias_samwald> susie: thanks.
<matthias_samwald> ... for the best practices document, we should think about it in terms of effectively making data available
<matthias_samwald> ... one thing we should considers is how far we should 'pretend' implementing these things
<matthias_samwald> ... elgar was highlighting the presentation layer -- is that where we start? e.g. ADME dataset.
<matthias_samwald> elgar: my desire was to work in groups, where i understand what the problems are and how they are solved. i would say: pick a good question first
<matthias_samwald> elgar: we have tried to come up with lots of use cases for TMO, then we ended up with one that we actually used, not necessary to find too many questions
<matthias_samwald> elgar: we should shift from RDFization to data analysis
<matthias_samwald> susie: it would be good to focus on one use-case, but one use-case could have several (e.g. 3) questions, for example
<matthias_samwald> ... i was not thinking that the best practices layer would be too much focuse on presentation, but rather on how to make data available as RDF. i would keep that focus, but you have a good point. the presentation layer has been a weak spot.
<matthias_samwald> ... i invited Chris Bouton, was head of integrated data mining at data, he built a linked data browsing tool that i think looked pretty nice.
<matthias_samwald> ... resembling a mind-map
<matthias_samwald> ... i asked him to give a demo
<matthias_samwald> ... he actually used the LODD datasets
<matthias_samwald> ... he talked about making this tool freely available to academia and to make it available to our group, to make LODD data explorable on our web site
<matthias_samwald> ... the Allen Brain Atlas is making data available as linked data
<matthias_samwald> ... could be a connection to ADNI data
<matthias_samwald> susie: i invited someone from the allen brain atlas as well
<matthias_samwald> matthias: the availability of the allen brain atlas is exciting
<matthias_samwald> susie: could you look at how to connect to ADNI data?
<matthias_samwald> matthias: yes.
<matthias_samwald> TOPIC: Amrapali Zaveri talks about converting WHO dataset
<Susie> http://esw.w3.org/HCLSIG/LODD/Meetings/2010-09-01_Conference_Call
<matthias_samwald> amrapali: the presentation is how we turned excel sheets into RDF
<matthias_samwald> ... we used scovo
<matthias_samwald> ... and a plugin in ontowiki
<matthias_samwald> amrapali: (describes SCOVO, see http://semanticweb.org/wiki/Scovo)
<matthias_samwald> amrapali: slide 5 -- doing the interconversion automatically was not possible, there had to be some user interference
<matthias_samwald> ... we created a plugin in ontowiki, developed by AKSW group
<matthias_samwald> ... the plugin converts CSV data into RDF
<matthias_samwald> ... slide 6 -- first we created empty knowledge base
<matthias_samwald> ... slide 7 -- imported CSV
<matthias_samwald> ... country, population, diseases
<matthias_samwald> ... we view the resources in ontowiki
<matthias_samwald> ... on the left side are the dimensions (of type class)
<matthias_samwald> ... on the right side you can see a particular data item
<matthias_samwald> ... slide 13 -- N-Triples of SCOVOfied data
<matthias_samwald> ... 2000 dataitems, 50.000 triples
<matthias_samwald> ... there were a few challenges -- slide 14 -- some excel sheets contained a particular hierarchy
<matthias_samwald> ... we are also looking into converting other WHO databases
<matthias_samwald> ... probably using SCOVO
<matthias_samwald> ... but they are all in different formats
<matthias_samwald> ... another future work we are going to do: evolution of knowledge bases
<matthias_samwald> susie: thanks
<matthias_samwald> joanne: how long did it take?
<matthias_samwald> amrapali: 5 minutes start-to-finish using the tool. the entire process took about 4 hours for the whole dataset. someone who never used the tool would need over a day, maybe a week.
<matthias_samwald> amrapali: next steps are to look at the taxonomies of diseases, how to semantically enrich the data
<matthias_samwald> matthias: interesting to see ontowiki used. ontowiki is very useful because based on pure RDF database. will look into exposing HCLS KB via Ontowiki. Interesting to see how plugin was made
<matthias_samwald> Susie: any news about data update?
<matthias_samwald> matthias: nothing unfortunatley, it seems like a lot of interesting datasets have been RDFized by now.
<matthias_samwald> susie: interesting point. still best practice document is very important, especially for people with company-internal data.
<matthias_samwald> claus: looking forward to that document!
<matthias_samwald> susie: oktie said he will publish improvement of LinkedCT in october
<matthias_samwald> janosz: updated RxNorm, work on link to UMLS
<matthias_samwald> susie: next call in two weeks, chris butte (sp?) will give presentation
<matthias_samwald> ... also focus on best practices
<matthias_samwald> bye!