HCLSIG/SWANSIOC/Actions/RhetoricalStructure/meetings/20100215

Rhetorical Document Structure Group HCLS SIG W3C, Phone Meeting February 15th 2010, 9AM Boston / 2PM Irish / 3PM Amsterdam

Agenda items:

1. Paul Groth from the Concept Web Alliance to present his Nano-publications format Media:HCLSIG$$SWANSIOC$$Actions$$RhetoricalStructure$$meetings$$20100215$cwa-anatomy-nanopub-v3.pdf

2. Action items from previous meeting:

Jack to try models on global climate change discourse
Paolo to consult with curators about models
Tudor and Paolo upload a single document example
- SALT: SALT
Tudor to do an intermediate medium-grained model (in a comparison grid)
- Comparison Grid

3. Assessment: where are we now? Can we go back with our current model and re-assess the use cases? What else needs to be done before then?

4. AOB.

Notes (stream of discussion scribed by Anita)

Discussion with Paul Groth about Concept Web Alliance Nanopublication

1. Discussion with Paul Groth about Concept Web Alliance Nanopublication

Paul Groth (PG): Concept Web Alliance is about nano-publications. Goal is to say: what is out there, what is out there in terms of something that looks like what all the speeches around nano-pubs is out there, can we make that happen At the core of this is the idea of a triple, like an rdf-triple This is a statement that a scientist has made on the web; lots of these are redundant, let’s get rid of this redundancy. We want something that’s accredited, that scientists review and do.

Tudor Groza (TG): Is their goal to compact the knowledge? Or is it your goal?

PG: Goal of the CWA is to help reduce redundancy of knowledge on the web.

Anita de Waard (AdW): Can you say something about the background of the project?

PG: This is about what CWA is, am still understanding that. CWA thinks tehre is lots of information in publications and databases a lot of duplication, lots of redundancy – makes it hard to do automated reasoning, assign credit where it’s due – many redundant statements out there – help aggregate and reduce this redundancy,

AdW: So for existing publications?

PG: Yes and for databases; eventually people could publish in this format. First step: go through existing publicaitons, people do textmining, extracting facts.

Tim Clark (TC): That is a highly nointrivial task!! Let’s find all the statements in biology that are the same – this is very very difficult to do – we have some unpublished work on this area.

PG: I agree, is a very very hard task. I think the way they want to approach this is through some common namespace, and refer to it – that’s one way of helping along this process – essentially a wiki, a concept wiki that you can refer to – e.g. malaria is defined there, we mean the same thing.

TC: So eg. Malaria is transmitted by mosquitoes – but actually only certain kinds of mosquitoes – true but limited in it’s applicability, very well-established. Are we talking about non-disputed statements?

PG: Good sequey into model – you can make any statement you want. I can say people come from swans – but need a way of marking where it comes from, why you believe it’s true – want some sort of annotation from that statement – attribution, peer review, provenance etc.

AdW: Indeed, author makes a statement in a context

PG: Earlier draft of this document I used the word ‘context’ – now I call it ‘annotations on a statement’. We have a core statement, and an annotation, make a nanopublication – together, they are a nanopublication.

TG: How about for this paper, make an example?

PG: E.g. this paper was written by these people, on this date

Jodi Schneider (JS): Some assertion that you are making, rather than what you are saying?

Matthias Samwald (NS): This is a methodology paper, not one that records research results

PG: No, they have a good point! One is that

TG: Are not trying to grill anyone – to help a bit

TC: Yes, these are statements in the paper – he is using the malaria in mosquitoes thing –

PG: We have metadata on it – imported by text extractor etc. What it doesn’t include is e.g. malaria is only transmitted by a certain type of mosquitoes –

AdW: E.g. adding what the evidence is –

PG: And adding what publications this is evidenced

TG: Anita is referring to where this is first asserted

TC: ‘It is well-known that’ – show the reference – and cite Walter Reed, or whoever figured it out

PG: this is what we were considering, kind of annotation we want on this statement

TC: important by text extractor from this publication, - is not the one that is cited in support –

AdW: HypER is about making networks between statements and evidence – is this the same?

PG: no we don’t describe the entire providence trail –

TC: If you want this to be useful for scientists – this is an interesting step, could offer some suggestions: they care about 3 things: 1) Is this a novel statement? 2) What’s the evidence that supports it? 3) Who is making this statement? – implied, imported by ... – with SWAN we try to make these chains of evidence – only believe it because it is common knowledge. If someone says malaria is transmitted by owls: how was it studied? Who studied it?

PG: Statements are asserted by an entity, along with annotation – who is responsible for it? Could be a person?

AdW: the knowledge substrate is that modeled?

PG: I’m not opposed to model more – but we just want to model what is in a paper – more I get into modeling into scientific discourse is what people are trying to agree on.

TC: Is there some sort of intersection? Can we be helpful to one another?

PG: Yes, we have some annotations, there are others that would be useful. We could use help on this.

TC: We have done some work on fleshing out even further this imported-by, text-extraction etc. could share this if useful

PG: Yes, if we can get enough technology push and people using that, we can expand outwards from that – can do the most simple thing.

TC: Paolo is working on this, what people don’t know – we evolved our ontology ahead of where we were, export of SWAN 1.2 to current knowledgebase. Basically a periodic export every month – he’s let you have that.

MS: Why isn’t this data publicly available?

TC: Are making it available through Neuroscience Information Framework

AdW: can you put it on a wiki?

TG: Our course-grained structure can perhaps be a context to the statement?

TC: I think that’s a great point: slide you presented at ISWC – SWAN and rhetorical structure – intersection can be a statement.

AdW: Inside Elsevier, we are now making each paragraph of our documents externally accessible.

[Paolo joins,. Some discussion about lack of reminders. AdW tries to defend herself, promises improvement. TC summarizes discussion so far. Back to PG]

PG: So we are trying to see what is the underlying format so people can ship around statements plus their contextual information – awesome if we would work togehter between this group and CWA – how can this set of annotation grow

TC: extent that we do converge will be useful

PG: You guys propose a lot of things that we would like to do – we can follow

AdW: maybe we can work together on a new use case?

PG: I don’t want to work on any new ontologies – let’s get some of the state-of-the-art and agree on a minimal set. I want to go back and propose we all make a demo – HCLS can show us things that are extra and we can use

TC: Sounds good! Paolo can we make SWAN 1.2 export available? Through NIF – have to ask Elisabeth; and technically

AdW: Can you send email to everyone on the call?

PC: NIF is not sharing RDF –we have to go to who owns the content, ask AlzForum

TC: We can put it up on our lab webpage – with license terms – cc license

PC: Some incarnation of something marked up with your ontology – will be very useful

PG: Small example or whole thing, either are nice.

TC: We’ll go back and post some things and circulate some things

AdW: Example on the wiki?

TC: Yes

AdW: How to start?

TC: If CWA wants a demo – they want to do automated reasoning on triples, we are concerned on provenance of whole discourse – have to figure out a way to harmonise that

PG: Our role is to aggregate, to grab data and get into a common format – at a very fine-grained level

TC: What we came up with in SWAN is the concept of canonical statements – can appear in a lot of differetn statements – verbal statement can be different, we have idea of a canonical statement, considered even including negations in a cluster. Could have dual formats of statements, triple formulation.

PG: Triple formulation is more concrete, using URIs – canonical version can be more explicit or more accurate

AdW: FEBS SDA experience teaches us that triples can be claimed by the authors, but not e.g. approved by the curators

PG: May filter out these contextual statements in various ways

PG: Notion is a repeated statement

JS: good thing is: they are more explicit.

TC: So, do you want to poll appetite in CWA to work with us? Let’s fit together: Tudor and Anita on the rhetorical structure, SWAN is more about statements and evidence, both can converge potentially with CWA stuff – we can start a series of discussions on how to converge.

PG: Push to ‘let’s start doing something’ – we’re having those conversations soon – I’ll make it clear that you guys would like to see some overlap – we can send something to you, get your feedback!

AdW: Great idea, let’s get it going

PG: CWA is taking minimal set, pushing it out there.

TC: We’ll send around examples and follow up.

2. Other points =

PC: ‘Paolo to talk with curators about models’ – I spoke with curators, just have to get it up on the wiki. Problem is that these low-level statements are in textual format; how fine-grained do they want to go? Currently the curators read a paper and extract a list of claims; they believe a list of claims that explain a particular hypothesis, then they rephrase them; they try to turn it into a object-verb-subject triples. A hypothesis is a little sentence explaining what a claim is about, it’s free text – not finer than that!

This is the maximum they want to do – do not want to go more granular than that, because biological knowledge is too complex to this. Can show them the model we propose

MS: The CWA: remarks I have are that fitting biological statements into Subject-Predicate-Object format is almost impossible, more than three entities we have

PC: SWAN is an evolving story; at level SVO – triple can be false tomorrow – increases level of complexity – are you just representing text?

AdW: Statements: hedges are eroded when they are cited, we see the linguistic formulations change because the statement gets to be eroded

MS: Have been working with Reflect to make use of RDF-a – can now inject some RDF-a that links to linked data representations – still quite preliminary, can start by sending an email. Not everything we see must be modeled, need to focus on what can be done, where we see a real benefit! Has properties and are things going on that we are unable to model.

TG: I agree there are complex, but temporality is most interesting thing we can do besides extracting fact, temporal evolution of a statement

PC: We had a model for doing this – but they don’t want to increase the complexity, break down into little pieces, make other artifact tomorrow! We understand evolution by looking at two artifacts; some things are hard to represent – biologist does not want to spend time going into that level of detail. They understand a sentence – don’t want to model

PC: Model what was said in 1984, then what is said in 1994 – don’t take original hypotheses, if contrast or in agreement, then track that link back.

TG: I got an idea! [... sorry Tudor I missed that...]

PC: Everything in SWAN is created and maintained by biologists -

TG: Are there in the database, add or update in time – we could use the discourse relations ontology, context of a courser-grained

AdW: Can we now take a couple of documents and try to mark them up? Taking a new document and perform the annotations? PC: I can do that, but probably need to show what is done by the annotators – ask a curator to take a document and perform the annotation in the document that they have done – want to do that anyway to use the software and link to documents they have created. Will ask to mark up the document and decide how to present it to you. Technically, Tim nominated me responsible for SWAN so I can do this – I’ll write the email now, have to put it in their pipeline!

TG: I made this! Port these two items to next meeting.

AdW: Great! What date?

PC: David Shotton does not have time for alignment, this is now under PC’s guidance. With David an Andrew – I have to do the job by myself.

TG: Can we use this slot?

JS: Is this a good time? Every other week? -> 1 March

AdW: Andrew , Scott, Marco Roos; Jack Park; Joanne Luciano?

JS: Can we have an email listserv?

TG: We can set this up? There is a HCLS-One, not for announcements etc. I have a folder in GMail –

AdW: I’ll ask Scott and try to get it set up. We can go back to a Google group.

Next meeting 1/3/10, 9am/2 pm/3 pm; Agenda:

1. Paolo walks us through example of annotation in SWAN

2. Tudor walks us through his marked up document in SALT

3. News/demo from CWA?