HCLSIG/SWANSIOC/Actions/RhetoricalStructure/meetings/20101115

Meeting November 15 2010, 10 am EDT

Agenda

1. Recap goals and use cases for the Rhetorical Document Model Subtask:

Coarse-grained model: Use case 3
Medium-grained model: Use case 2
Fine-grained model: Use case 1

2. ORB (Ontology of Rhetorical Blocks) progress: ORB OWL file

3. Medium-grained progress: first pass at Medium, very much open for discussion.

4. DoCo (Document Components Ontology) - David Shotton: DoCO v1.0

architecture (png)
Example (PDF) in article from the Journal of Cell Biology

---DoCo Background

5. Next steps.

Three alignment efforts:

1) ORB - MediumGrain,
2) MediumGrain - Data+Discourse,
3) ORB & MediumGrain - DoCO

Who to do: offline discussion.
When: report back on next call.

Minutes

Participants: Anita de Waard, Scott Marshall, Jodi Schneider, David Shotton, Joanne Luciano, Tim Clark, Silvio Peroni, Paolo Ciccarese Scribe: Scott

Anita: Shall we Recap use cases?
1 - Would like to identify the section of the document where

you've identified a gene, for example...application of Coarse Grained structure model, "ORB" (Ontology of Rhetorical Blocks) http://esw.w3.org/images/d/d2/Orb-0_1.owl

2 - Coarse grained model is very simple: Introduction, Methods, Results and Discussion
3 - Medium-grained (DRO), identify part of section, paragraph

level, research question

4 - Fine-grained sentence level, phrase, clause.
Paolo: Coarse-grained - We assume contiguous piece of text.
Anita: IMRaD Model = Introduction, Methods, Results and Discussion
Header: in Dublin Core/PRISM; Fabio/Bibo can model references
Classes are disjoint: coarse-grained blocks are contiguous, not overlapping in the article
Tim: (Tudor is traveling, he and Ron and Alex and Anita and Paolo made ORB) - is straightforward, adds a way to talk about entire article,disjunct from other level, useful for text miners
Joanne: Tim suggests we adopt it, I agree.
Tim: notes that ORB is simple and useful
Paolo: that was motivation for this ontology, this is consensus

and quite neat in terms of definition

Joanne: The classes are References, Discussion, Header, Introduction,

Methods, References, Results

Anita: Write out SIG note and send to HCLS as a whole
Scott: Tim is discussing with Dietrich Rebholz Shumann - Pistoia SESL Project, about microarray
Scott: annotation of microarray corpus -
Tim: how is Dietrich annotating these? Any further talks?
Tim: wanted to collaborate; DRS didn't respond yet. Scott and Tim and Pfizer people are interested.
Paolo: This is the purl for the most recent ORB version: http://purl.org/orb/
Joanne: There are folks here at RPI that are interested also, for provenance purposes, they couldn't make the call because of conflicting schedules, but contact us off-line.
Anita: Anita is chasing Pistoia corpus from Elsevier end - will let know

if it is available with or without annotation by EBI and others.

Paolo: This is the PURL for the version 0.1 only http://purl.org/orb/0.1/
Anita: Anita asks Paolo and Tudor to write note for releasing on the world(wideweb)
Paolo: agrees - will take lead, Tudor to help.
Anita: Let's not subdivide header and references, since we already have standards for that...
Paolo: don't give options?
Paolo: do provide options! Let's start listing them and go through existing bibliographic standards.
Tim: task for the next time we take up this thread - let's take ones we like the best, at least DC and PRISM and XMP/ElPub standards and also CiTO/Bibo
Anita: Involve Ron Daniel in this discussion - keep placeholder

http://esw.w3.org/HCLSIG/SWANSIOC/Actions/RhetoricalStructure/models/medium

Anita: Ron Daniel headed up the PRISM metadata project (?)
Jodi: I think that paper type needs to be distinguished for medium grain. Experimental vs. theoretical, review, ...
Scott: - yes, Ron Daniel headed up PRISM,http://en.wikipedia.org/wiki/Publishing_Requirements_for_Industry_Standard_Metadata
Jodi: For math, for instance, I don't think the method and results are going to work very well
Anita: PAM: Prism Aggregator Message = http://www.prismstandard.org/faq/
Jodi: except perhaps positioning, central problem, definition
Paolo: Objects of study: ?
Tim: biomaterials or what
Anita: HI Jodi, good comments, will address in the call in a minute!
Tim: what is use case? How can we use this? Overlap with Data + Discourse + Experiment task
Tim: offline, should chat with Philippe and Susana - overlap with

other task.

Tim: link to Sudeshna's talk from Nov 1 on DEXI: http://www.slideshare.net/sdas617/sci-discourse-nov-2010
Tim: link to cartoon of current Data+Discourse+Experiment (DEXI) ontology: http://esw.w3.org/File:SWAN-myExp-v4.jpg
Tim/Anita - let's discuss integration between medium-grained structure and research data/workflow output a la beyond the pdf: https://sites.google.com/site/beyondthepdf/
Tim: link to the Data+Discourse+Experiment Task: http://esw.w3.org/HCLSIG/SWANSIOC/Actions/SWANmyExpArray
Jodi: this is life science focused - we need old stuff in other disciplines, that we can apply retrospecitvely
Tim: Well - this is a Life Sciences SIG! :-)
Anita: Three distinctions: Life science/Physical sciences/everything else
Second distinction: Research article, Review article, QUick research note
Third distinction new material vs. existing text
Let's discuss other article types as well?
Jodi: clinical reports are a nice example
Scott: has been looking at clinical reports as well. This group is more aimed at scientific literature, right? See a very acute need to mine clinical reports.
Scott: Assign mapping to terminologies, UMLS, SnoMED etc. - but Clin Reports don't have a normalised structure - makes it difficult to make something that is generally useful in terms of ontology of doc

structures

Jodi: I think that's really pragmatic, Tim!
Jodi: I guess, let's just be explicit about what the scope is, when giving a medium-grain structure.
Tim: my 2 cents - fan of restricted initial scope - let's start on Research papers in life sciences, make that use case work; do a stepwise incremental expansion. Take users in astronomy, etc., then look at other types

of articles.

Anita: volunteers for medium-grained model - Tim will send an email to Experimental Discourse Group
Tim: jodi - ?
Howard: I'm interested in this discussion
Joanne:interested in being involved in the medium grained model discussion (RPI)
Jodi: yes, interested ;)
Anita: Sorry that's Data + Discourse + Experiment task
David Shotton - access to figures and examples?
Yes, I think, it's all linked from http://esw.w3.org/HCLSIG/SWANSIOC/Actions/RhetoricalStructure/meetings/20101115
Anita: Significant overlap between DoCo and medium-grained model - big blocks that describe structural components
Joanne: This is Item #4 that David is talking about.
Tim: Overlap with DoCO and Coarsegrained and Medium grained ontology
Tim: David, what is motivation for this work?
David: Tried to create an ontology that would accurately describe the components of a document.
David: publishers could use this, ontology that would accurately describe the components of a document, could be used by publishers and researchers. Document sections have a rhetorical function
Tim: Is this a 1:1 map NLM dtd to DoCo? Yes.
Tim: could you do automated processing of NLM DTD to DoCo?
David: yes but haven't done that yet
Tim: will you be using this in your JISC project?
David: In Peter Murray Rust's project for JISC
Anita: did you look at any other publisher's DTDs?
David: no -
Anita: long history of DTD development
David: are looking at PLoS, then at BioMed Central
Anita: Is there a way to integrate with Medium-grained system
David: how do we connect DoCo to medium-grained structure?
Sylvio: developed the ontology with colleagues in Bologna, interested in patterns in XML documents - we identified structural patterns that allows structure of textual document, we study this topic in Masters
Tim: so - what I see is mutliple alignment tasks here: ORB - MediumGrain, MediumGrain - Data+Discourse, ORB & MediumGrain - DoCO
Anita: Silvio, what do you mean by patterns?
Sylvio: pattern is a general solution to a current problem. E.g. textures, and such. Paragraphs are blocks containing text, and many other elements, such as emphasis, citations, etc.
Silvio what software tool did you use to generate the DOCO documentation?
Anita: 1) ORB - MediumGrain, 2) MediumGrain - Data+Discourse, 3) ORB &

MediumGrain - DoCO - offline discussion whom. When: report back on next call!

David: Silvio's mail is speroni@cs.unibo.it