SV_MEETING_TITLE -- 27 Aug 2012

<ericP> Q: is there a way to share templates?

<ericP> Lena: we have bioportal

<Lena> new version of my slides: https://confluence.deri.ie:8443/download/attachments/40304735/bos_linkedData4LifeSciences1.pptx

<Lena> these are the slides for the afternoon: https://confluence.deri.ie:8443/download/attachments/40304735/BigLinkedDataLifeSciences_20120827_bos.pptx

Introductions

<ericP> scribenick: dbooth

<ericP> dbooth: David Booth, involved with Cleveland Clinic for several years

<ericP> ... working with SemWeb for research

<ericP> ... worked with Pangenex (SP?) for decision support

<ericP> ... keen user of SPARQL as a rules language

<bobP> PanGenX

Maryam: Working in sem in LS. Project in sem in LS. PubMed.

Nima: Haarvard med school. Partners Healthcare clinical decision support system.

__: MIT. Financial models.

<ericP> Maryam Panahiazar - MUlti-Dimensional integrative approach to comparing genes for knowledge DIScovery

HelenaDeus: DERI
... Cancer research.

Sheng Yu Research fellow, harvard. Med records.

<ericP> Sheng Yu

<ericP> Loren Wilde

___: Clinician in audiology.

<ericP> Peter Mager

Peter: CS background. Looking at gene seq and synth biology. Also run a seminar series for IEEE. Having someone from Church lab about writingn thing on DNA sequenses.

<ericP> stuart turner

StuartTurner: Veterinarian. Post doc in bionfornatics. Research in biosurveillance. Want to sem webify a project ant NCI.

Franz: Entogen, int in deriv of structured data from unstructured.

MattMackdonald: Entogen.

ChrisBouton: Neurobio, engineer, sponsory, triplemapper.

Terence: Up-to-date, CDS, many teaching hospitals. In charge of info retrieval, and to get better int w EMR sys.

Yakubo: Visiting MIT for dissertation. Nat Lan Proc to symbolic then computation. Background philos. Formal modeling, logic.

Luke: Lead dev SADI, linking dymaic data into SW.

DanielaBourges: Harv Med School, working on SW years, Eagle-I project.

<Lena> (created a tweeter hashtag for this hackaton: #hclshack)

JustinLancaster: CSO QuickBio BioMedServer. Dnamic modeling of sys biology. Came to LS after env sci 10 yrs ago. Simulate complex sys from the knowledge, in the pattern of OpenBEL, then dev automated hypothesis.

Dipankar: Classifiers, clustering. Don't work w SW but want to learn.

<ericP> julie mcmurry

JulieMcMurry: Eagle-I, int biomed resource data. Background in vaccine, immun.

<ericP> juliane schneider

JuliaSchneider: metadata librarian, Harvard Med School. LD, entity extraction, trying to hook MeSH and Medline and Astrophysics.

<Frans> small correction - Entagen product is TripleMap - triplemap.com

<Ray> Hi

<Lena> tweethach is #hclshack

<Justin> Hi Justin Lancaster (BiomedServer.com -- kwiKBio project) justin.lancaster@att.net --- based in Boston area.

<Lena> tweethacs is #hclshack

<Lena> tweethash is #hclshack

Lee and ericP's SPARQL Tutorial

slides - http://www.cambridgesemantics.com/sparql-by-example/slides.html

[slide 1]

<Frans> anyone have a link for the meetup just described?

-> http://www.meetup.com/The-Cambridge-Semantic-Web-Meetup-Group/ Cambridge Semantic Web Meetup

[slide 2]

[slide 3]

<luke> Frans: this looks like the meetup http://www.meetup.com/The-Cambridge-Semantic-Web-Meetup-Group/

[slide 5]

<Frans> thanks - signed up

[slide 6]

Q: How do you specify where to get the data for the query?

A: That will be explained in a moment.

scribe: A query interface often has a place to put the URL of the data you wish to query.

[slide 9]

Try query at: http://librdf.org/query/

<ericP> http://128.30.7.30:8001/

[slide 10]

[slide 12]

[slide 14]

[slide 15]

[slide 18]

<mary> can not get the slides from web<http://www.cambridgesemantics.com/semantic-university/sparql-by-example#q_negation_new_not_exists_r>

Q: How often are the XML datatype URIs dereferenced?

A: Almost never. The app doesn't look up the URIs, typically just the developer who needs to look up a detail about that datatype.

Q: SPARQL queries RDF resources. But how do you create RDF from other data sources and keep it updated?

A: There's a lot of tools to do that. It isn't standardized. There are software tools for mapping relational databases to RDF.

scribe: Same kind of ETL and data integration approaches already existing.

Q: When you design relational DBs, there are design considerations. Are there similar design considerations for RDF?

A: The design is in the developer's lap. There are fewer things that the DBA must think about, but it may increase in the future.

Q: In rel DB, there's a limited number of columns. But in PubMed, you have a gazillion docs, and all free text. What's the best practice to make sure the user can take advantage of that free text DB? how do you define the API or protocol between a DB designer and user?

A: SPARQL wiped them out. Make it so that the person who writes the queries has an intuitive graph pattern to walk.

Q: How do you pick out the data most useful to the users?

scribe: If you designed RDF data for PubMed, how do you pick what data is most useful?

A: One approach is the get users together and design what attributes you want. At the other end of the spectrum, we can pull out tons of information. But because in RDF we aren't limited by the number of columns, we're going to pull them all out, and let the users do SPARQL queries to pull out what they want.

scribe: For the first approach you may as well use a rel DB. For the other extreme, it isn't really practical. The approach I've seen in practice is an incremental path on that continuum. But incrementally adding new things on user requrements can be done in RDF without breaking existing data. You don't have to go back and create new tables.

dbooth: Great answer!

Q: Is the entailment features in SPARQL intended for integrity constraints and checking?

Lee: No. Different tools and platforms are treating constraints and checking their own way.

EricP: in the TMO (Translational Medicine Ont), we were evolving the ont and it would break the results of SPARQL queries. We would track what queries broke (went from 5 answers to 0).

[slide 26]

[slide 27]

dbooth: You can also think of constraints as SPARQL queries that look for violations.

Q: Is it possible to query for an artist that has at least one of those attributes?

A: Yes. BOUND is a SPARQL construct that asks whether a variable has a value in the query, so you can use that.

Lee: I've been interested in doing a life-sciences-specific version of this tutorial. Let me know if you're interested in helping on that.

[slide 30]

[slide 31]

[slide 33]

[slide 35]

[slide 36]

Q: Is there a place where these best practices are being collected?

Lee: Semantic University is one place.

dbooth: A good way to develop a CONSTRUCT query is first to develop and debug it as a SELECT query, and then convert it to CONSTRUCT after you've debugged the WHERE clause.

Q: When would it be better to create a view using CONSTRUCT versus converting the query?

EricP: Question of materializing the view or not. Same trade-offs as in DB world.

Q: What if a value is unbound when you're doing CONSTRUCT?

A: That triple is automatically filtered out, per the SPARQL standard.

[slide 37]

Q: Are you guaranteed that the Amazon and Nile lengths are in the same units?

A: No. You'd better be careful in your query. Good practice that I like (but it bother's ontologists): put the units in the predicate name, e.g. :lengthInKm

[slide 39]

[slide 40]

Q: Queries involving time durations?

A: Yes, there is time arithmetic.

scribe: They're defined in terms of the XML Schema operators spec.
... But it isn't required for SPARQL 1.1 conformance.

[slide 42]

[slide 43]

[slide 46]

Q: Is there a difference between MINUS and using the old !BOUND idiom?

A: They're pretty much the same, except maybe some edge cases.

[slide 50]

[slide 51]

[slide 53]

[slide 54]

[slide 56]

EricP: SADI talk will be next, after lunch, at 1:30pm Eastern. THen Helena's talk after that.

<luke> Questions from Max:

<luke> How are triples hashed/indexed?

<luke> What sits between SPARQL and the web data?

<luke> Where is the logic stored for the relations?

<luke> How good are SPARQL queries for proprietary data?

[Lunch break until 1:30pm Eastern US]

SADI

<luke> Slides for this talk are at http://sadiframework.org/slides/MIT2012.pdf

<frago> thank you, I was about to ask about the slides

[slide 2]

<ericP> scribenick: ericP

[slide 3]

luke: use case for computed data
... my clinic recently changed their gold-standard for COPD factors
... was costly 'cause the data was all stored in the old format

[slide 4]

[slide 5]

luke: semantic web services (e.g. OWL-S) aim for the world where the service models the state of the universe before and after
... e.g. tell ciri to purchase plane tickets and do all the debiting etc.

[slide 6]

[slide 7]

<mary> sadi slide is not avalaible!

huh, indeed

<frago> i'm following from http://sadiframework.org/slides/MIT2012.pdf

mary, just takes a while to download

[slide 11]

[slide 12]

luke: given a db of heights and weights, use a SADI service to query BMI

<mary> thanks

[slide 17]

Maryam: can you orchestrate SADI services?

Luke: the goal is that *you* don't have to, that it happens for you
... the SHARE client exports the latest Taverna workflow format

[slide 19]

[slide 20]

[slide 22]

[slide 23]

luke: the input is a named individual, and the output is the same individual with a hello:greeting property

[slide 26]

luke: ideally service description would describe services
... for now, use the SADI registry

[slide 28]

luke: OWL reasoning in Java 'cause thats where the reasoners are

[slide 29]

luke: SHARE is a SPARQL processor which matchs queries against the local store + the services in the registry
... also decomposes OWL classes
... if you have a bunch of triples which constitute an entity, you can use an OWL class to capture them

Maryam: how do you invoke WSDL services?

luke: SHARE is for SADI services
... you can describe that in WSDL, but it's just RDF-in/RDF-out
... there are ways to use e.g. SAWSDL or wrappers to make WSDL services available as SADI
... example: increasing creatinine (blood urea nitrogen) level indicates a rejected transplat

[slide 32]

includes OWL class patients:AtRiskPatient

<dbooth> http://biordf.net/cardioSHARE/

Luke: I want the genes in a pathway and the proteins they code for
... interface completes from LSRN (Life Sciences Resources Network)
... could use identifier.org (if they export RDF)

[something close to http://sadiframework.org/content/2010/06/10/cardioshare-walkthrough/ ]

<luke> Download SHARE command-line client: https://code.google.com/p/sadi/wiki/SHAREClient

<luke> SHARE example queries: http://biordf.net/cardioSHARE/queries.html

-> http://sadi.googlecode.com/files/SHARE-client-0.1.jar share client jar

-> http://biordf.net/cardioSHARE/queries.html example queries

<Justin> What was command line string to execute the jar file?

<Justin> ... for the SHARE client

<mary> if anybody found the link for paper?

<dbooth> justin, java -Xmx1024m -jar SHARE-client-0.1.jar

luke: [Re: http://biordf.net/cardioSHARE/queries.html #14]
... phd student in our lab trying to emulate clinical classification in OWL
... i.e. have the OWL reasoner perform diagnosis support like a clinician
... the measurement units in his data were inconsistent and frequently unspecified
... #14 demos that SHARE maps to units to a standard, and can guess them when not specified

Maryam: phylogeny analysis is a hard case

Lena: no single tree of life
... also they are huge

luke: we needed a stable taxonomy so we're using one from NCBO
... if you build the tree from the ribosomal RNA, you hard-code you biases

SADI can't solve the social prob, but can address the size

luke: SADI can't solve the social prob, but can address the size
... we had a group using doing molecular modeling
... a query for the polygons on a molecular surface exceeded 4G
... so back to the phylogeny use case, probably need to pack the hierarchy as a literal and unpack when needed

Peter Mager: can't you pass a parameter?

luke: yep, we have URIs and we can use them for this
... there's a predicate called rdf:isDefinedBy which SADI derefs

maryam: phylogeny researchers care about methods used
... can i go through the info for methods?

luke: if they write it down, but we don't force them
... Jim McCusker proposed pointing to the code for the service in a public repo

ericP: for e.g. homology, you code the way something is known to be homologous

luke: we use OWL to infer that e.g. BLAST homology is a form of homology
... there are tools to use SAWSDL to make WSDL services be SADI services

StuartTurner: where do you see SADI going in the next few years?

luke: i described SADI in terms of a toolkit 'cause i work on the toolkit
... but SADI is just a set of practices
... i worked on a submission to W3C

<luke> link to SADI summary/spec: https://code.google.com/p/sadi/wiki/SADITrail

Big Data

-> https://confluence.deri.ie:8443/download/attachments/40304735/BigLinkedDataLifeSciences_20120827_bos.pptx Big Data slides

[slide 15]

[slide 32]

ericP: the trust axis can also capture latent nuances which make data more applicable

mark: the prob with federation is that we can't move 1k genome sets around on the network
... (without exotics fiber infrastructure)

Lena: but are you using all that data? can you select for the data you need?

mark: need to be able to run analysis on computers that you don't own
... e.g. i don't want to pull the 1k genome data and swissprot to my local computer

ericP: [mumbles about the Grid marrying the SemWeb]

[shipping code to data a la SciDB]

Justin: are you losing too much by believing that you will capture all of the understanding of the publishers? [@@corrections please]

Lena: i think the question is can we capture that that knowledge
... if we can eliminate human subjectivity, you increase the quality of the data
... having spent a year manually capturing data, i recognize how imprecise it is

Justin: i'm not convinced that letting the machine crank on 1k dimensions will capture the intuitions of the scientist

peter: there's a lot of interesting data which is lost
... e.g. astronomers who fedex disk arrays around the country to do fourier analysis
... that data could be still be useful but is not available

ericP: [meta data tracking of raw data, e.g. disk arrays]

Lena: [fold-it example]

<tez> "Big Data" should enable both cases debated

@@1: going back a few slides, you spoke of identifiers disappearing on you. that seems like a higher priority

<tez> How can the "crackpot" be accelerated ?

Lena: yes that's first, but it's basically a solved problem

TFMorris: i don't know what your example was, but how has this been solved?

Lena: through frameworks, exposing e.g. RDBs on the SemWeb
... my issue was URLs which were simply non-dereferencable

maryam: one of the probs with reactome is that it's just for humans
... on KEGG we can't find a molecule in a pathway on certain days

Lena: need to capture context and provenance
... e.g. reactome is curated human data

maryam: how can we decide whether KEGG and Reactome is better?

Lena: we invented pathways to keep things in boxes
... (like species)
... we need to capture the underlying data
... imo, we need to talk about system states instead of pathways

Justin: there are so many variables which can be measured on a patient
... the data mining problem is so huge, but coming at it with a big data approach might allows us to compartmentalize and analyze

StuartTurner: in clinical care we have clinical practice to avoid opinion-based medicine
... humans make cognative mistakes and have biases

Justin: we want the machine to amplify the human

Lena: can't we have the machine perform the standardized tests while humans work on e.g. new methodologies

- DRAFT -

SV_MEETING_TITLE

27 Aug 2012

Attendees

Contents

Introductions

Lee and ericP's SPARQL Tutorial

SADI

Big Data

Summary of Action Items

Scribe.perl diagnostic output