15:01:06 RRSAgent has joined #HCLS 15:01:06 logging to http://www.w3.org/2012/05/21-HCLS-irc 15:01:25 amrapali has joined #hcls 15:01:36 amrapali has joined #hcls 15:03:26 Jun has joined #hcls 15:03:40 matthias_samwald has joined #hcls 15:03:54 ram has joined #hcls 15:04:02 Zakim, who is here? 15:04:02 sorry, mscottm, I don't know what conference this is 15:04:03 On IRC I see ram, matthias_samwald, Jun, amrapali, RRSAgent, Zakim, Janos, egombocz, rkiefer, bbalsa, achille_zappa, mscottm, ericP 15:04:11 Zakim, this is hcls 15:04:11 ok, mscottm; that matches SW_HCLS(BioRDF)11:00AM 15:04:15 Zakim, who is here? 15:04:15 On the phone I see Tony, +1.510.705.aaaa, tlebo, +1.631.444.aabb, ??P9, Scott_Marshall, ??P1, +46.7.08.13.aadd, ??P13, Chimezie 15:04:17 On IRC I see ram, matthias_samwald, Jun, amrapali, RRSAgent, Zakim, Janos, egombocz, rkiefer, bbalsa, achille_zappa, mscottm, ericP 15:04:18 matthias_samwald1 has joined #hcls 15:04:43 http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf 15:04:44 +??P15 15:05:27 + +1.412.623.aaee 15:05:30 BrianLowe has joined #hcls 15:05:40 HarryH has joined #hcls 15:05:47 chimezie has joined #hcls 15:06:03 Zakim, who is on the phone? 15:06:03 On the phone I see Tony, +1.510.705.aaaa, tlebo, +1.631.444.aabb, ??P9, Scott_Marshall, ??P1, +46.7.08.13.aadd, ??P13, Chimezie, ??P15, +1.412.623.aaee 15:07:00 412 623 is me - Harry Hochheiser Pittsburgh 15:07:01 Brian Lowe: Developer on VIVO project, Susan Mitchell also works as developer / ontology on VIVO 15:07:36 Harry Hochheiser - University of Pittsburgh, interested in HCLS 15:09:03 Brian Lowe: Developer on VIVO project, Stella Mitchell also works as developer / ontology on VIVO 15:09:06 Stella has joined #hcls 15:09:09 Ram from Metaome - We have a life science search engine called DistilBio (distilbio.com) 15:10:01 scribe: Jun 15:10:27 michael has joined #hcls 15:10:34 s/Susan/Stella/ 15:11:08 + +1.206.732.aaff 15:11:52 Chimezie Ogbuji - Cleveland Clinic, Case Western, Recently started a startup 15:12:34 http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf 15:12:55 + +1.857.250.aagg 15:13:31 Zakim, mute me 15:13:31 Chimezie should now be muted 15:13:53 Scott: introduce Janos' talk: it's important to differentiate RDF datasets apart from by their content, licenses, etc 15:13:54 mattgamble has joined #hcls 15:15:49 VIVO - scientific research network ontology 15:15:57 Janos: one of the members of CTSA Connect graduate programme, to connect two major ontologies, VIVO and ***, to connect clinical sciences data 15:16:23 yes, I do 15:16:34 +Tony.a 15:16:38 http://dl.dropbox.com/u/21690634/Quantifying%20RDF%20data%20sets.pdf 15:16:52 BobF has joined #hcls 15:17:09 Slide 1: a lot of further work. this just presents a start 15:17:26 slide 2 15:17:47 Janos: Semantic Web is based on RDF, a graph-based data model 15:18:12 CTSA Connect: http://www.ctsaconnect.org/about-us 15:19:15 ... more flexible than relational DBs by allowing parallel edges 15:19:18 slide 3 15:19:41 Janos: a paper submitted to the Triple Challenge 2010 15:19:59 ... they did some quantification of datasets, looking into the internal structure of the data 15:20:08 .... drew some of the approaches of this paper 15:20:32 ... took a look of the datasets of the challenge, and did some structural analysis and others 15:20:48 slide 4 15:21:22 AmitSheth has joined #hcls 15:21:51 Janos: a basic python library to parse n-triples. it's a memory based approach, and do some processing. based on PyPy 15:22:26 .... PyPy for just-in-time compiling. speed up the processing 15:22:26 Amit has joined #hcls 15:22:41 conference is full! cannot join by voice 15:23:02 .... just some basic statistical analysis, then started to do some pattern matching analysis. not by using SPARQL endpoint 15:23:25 ... each file is treated as its own graph. didn't use Named Graphs 15:23:51 Q: on scalability 15:24:08 Janos: largest one is LinkedCT 15:24:38 ... 28 millions triples. took 30% of a 64G memory 15:25:20 ... SPARQL1.1 might provide better performance promises 15:25:28 slide 5 15:26:00 ... started with some basic counts 15:27:46 slide 6 15:28:21 Janos: do some simple fractions calculations 15:28:33 ... e.g, how many literals in your triples 15:28:40 ... how many literals are unique? 15:28:48 ... how many objects are unique? 15:29:43 ... structure measurement, by taking out the typing sort of information and literals 15:30:06 egonw_ has joined #HCLS 15:30:24 ... subject/object coverage, more pointing or more pointed? 15:30:39 ... more concrete examples to follow 15:30:45 slide 7 15:31:47 scribenick: Jun 15:31:53 Janos: computed it against a couple of LOD datasets, 4 of the LODD, DailyMed, LinkedCT, DrugbankRDF, RxNorm 15:32:55 ... BioGrid database: an open access DB on Protein and Genetic Interactions 15:33:09 ... BioPAX: pathways in BioPAX format 15:33:21 ... bioGrid can be downloaded via OWL format 15:33:42 ... VIVO: NIH funded project for scientific networking 15:34:03 .... got n-triples for VIVO dataset 15:34:30 ... go through by the number of triples desc 15:34:38 slide 8 15:35:13 Janos: top subjects, top classes, predicates, etc 15:35:21 markthompson has joined #hcls 15:35:24 ... give you a good idea of how people use ontologies 15:36:15 ... LinkCT: 40% are literals, objects have 80% repetition 15:36:31 ... three dominant classes 15:36:54 Michael: have you done this analysis on the GO ontology? 15:37:00 Janos: not yet 15:37:09 Michael: expecting more diverse coverage 15:37:21 Janos: would be interesting to look at 15:37:24 slide 9 15:37:38 Janos: BioGrid in BioPAX 15:37:58 ... 50MB in owl but 40 millions triples in n-triple format 15:38:29 ... again, subject, object coverage, and top classes. they are not LOD yet 15:38:41 egonw_ has joined #HCLS 15:38:53 ... get a good sense of what's actually in the content 15:38:54 slide 10 15:39:00 Janos: RxNorm 15:39:13 ... only 6 classes. pretty small 15:39:36 ... quite a bit of literals. structure data is higher than other datasets 15:41:35 Q: do you see a big structure differences from these datasets? 15:41:38 Janos: TBD 15:41:49 slide 11 15:42:04 Janos: 1.2 million triples 15:42:37 ... data about publications, such as Authorship, Person ... 15:42:44 -??P15 15:42:59 ... publication is dominant data source there. pretty good subject/object coverage 15:44:11 slide 12 15:44:59 Janos: it has a lot of links to outside datasets, have a much higher object coverage 15:45:11 slide 13 15:45:33 Janos: top predicate: owl:sameAs. again has a lot of links to outside datasets 15:48:50 Scott: any idea about how one type of matrix could be more useful than another, or searching for others? 15:49:04 s/Scott/mscottm/ 15:49:12 slide 14 15:49:28 s/matrix/metric/ 15:49:32 Janos: there are a lot of tools for graph vis and analysis, but not so good with RDF data 15:50:41 +EricP 15:51:22 slide 15 15:52:38 Janos: the twist is to allow multiple paths between 2 nodes 15:52:40 slide 16 15:53:45 Janos: there are ways to collapse the parallel edges, or put RDF into XML, in order to use some graph analysis tools 15:54:06 slide 17 15:54:13 Janos: show some examples 15:54:38 ... get co-authors that are only members of a site, to get a smaller co-author network 15:54:46 slide 18 15:55:05 Janos: do some basic graph analysis using Mathematica 15:56:26 ... basic in-degrees, out-degrees, histograms, one/two degree separation etc 15:56:39 Nice! 15:56:47 slide 19 15:57:07 Janos: Gephi doesn't support parallel edges. you have to do some pre-processing 15:57:24 -tlebo 15:57:49 slide 20 15:58:02 Janos:some links 15:58:35 q+ have you thought about encoding some of the statistics using VoID? 15:58:57 thanks, janos, i need to drop off 15:59:02 - +1.206.732.aaff 15:59:40 -Tony 16:00:18 Eric: any further analysis on some of the results, like the social network? 16:00:33 16:01:42 First how do you work out which metrics are useful? 16:01:54 -EricP 16:03:26 - +46.7.08.13.aadd 16:05:23 Our Knowledge Explorer also provides metrics for weighing of connections in several ways 16:05:57 -??P1 16:06:10 Zakim, unmute me 16:06:10 Chimezie should no longer be muted 16:08:12 Chime - would you please jot your comment/question into IRC? I received an urgent call exactly when you started.. :( 16:08:28 -Tony.a 16:09:08 My question was whether he had considered using rdflib (https://github.com/RDFLib) 16:10:54 - +1.510.705.aaaa 16:11:50 michael has left #hcls 16:12:59 - +1.857.250.aagg 16:17:53 -Chimezie 16:18:49 CTSA Connect - ISF - Integrated Semantic Framework: core is combining VIVO ontology and eagle-i ontology 16:19:28 Thanks , Janos - very interesting! 16:19:49 Thanks Janos 16:20:02 thanks all, bye 16:20:04 - +1.412.623.aaee 16:20:08 -Scott_Marshall 16:20:10 -??P13 16:20:10 - +1.631.444.aabb 16:20:12 bye all 16:20:12 -??P9 16:20:12 SW_HCLS(BioRDF)11:00AM has ended 16:20:12 Attendees were Tony, +1.510.705.aaaa, tlebo, +1.631.444.aabb, +46.7.08.13.aacc, Scott_Marshall, +46.7.08.13.aadd, Chimezie, +1.412.623.aaee, +1.206.732.aaff, +1.857.250.aagg, EricP 16:20:24 Zakim, please draft minutes 16:20:24 I don't understand 'please draft minutes', mscottm 16:21:02 RRSagent, draft minutes 16:21:02 I have made the request to generate http://www.w3.org/2012/05/21-HCLS-minutes.html mscottm 16:21:04 rrsagent, make log world-visible 17:05:40 matthias_samwald has joined #hcls 17:27:45 mscottm has joined #hcls 18:21:37 egonw_ has joined #HCLS 18:29:19 Zakim has left #HCLS