13468 – Support Microdata values that are HTML snippets

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 13468 - Support Microdata values that are HTML snippets

Summary: Support Microdata values that are HTML snippets

Status:	RESOLVED NEEDSINFO

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	LC1 HTML Microdata (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P3 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-07-30 15:54 UTC by Manu Sporny
Modified:	2011-08-23 05:12 UTC (History)
CC List:	10 users (show)

See Also:

Attachments

Description Manu Sporny 2011-07-30 15:54:19 UTC

This feedback is filed as a personal comment and is not intended to be any sort
of official feedback from any standards working group.

The Drupal content management platform has a use case for RDFa where they want to express the portion of a page that is the actual post content, including all HTML markup. The Microdata spec currently cannot do this, please specify a mechanism that allows an author to easily express an HTML snippet. 

The end-result must allow for at least the following to be expressed without resorting to using HTML entities or other hacks:

"<h1><s>Microdata</s> HTML Literal Test</h1><p>This is a test.</p>"

Comment 1 Jirka Kosek 2011-07-30 16:00:34 UTC

You mean expressing HTML markup using RDFa/Microdata? Why you can't embed HTML snippet directly, possibly enclosing it in <div> and givit it special class? Could you be more specific about this use-case?

Comment 2 Philip Jägenstedt 2011-07-30 22:41:57 UTC

It sounds like the use case here is syndication, perhaps the HTML to Atom conversion algorithm would be more suitable?

http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#atom

Comment 3 scor 2011-08-01 18:55:50 UTC

(In reply to comment #2)
> It sounds like the use case here is syndication, perhaps the HTML to Atom
> conversion algorithm would be more suitable?
> 
> http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#atom

Isn't Atom limited to a specific schema and assumes that each atom item can only contain one single atom:summary element in HTML? What if a page contains more than just a body in HTML but also a headline and comments in HTML?

How does that address the case of making HTML snippets available at the Microdata DOM API level, or in JSON?

Comment 4 Philip Jägenstedt 2011-08-01 20:41:51 UTC

Rather than guessing further at what the use case is (last guess: "something with syndication") I'll await clarification.

Comment 5 Ian 'Hixie' Hickson 2011-08-02 07:25:51 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: I don't understand. What is the use case?

Comment 6 Michael[tm] Smith 2011-08-04 05:05:42 UTC

mass-move component to LC1

Comment 7 scor 2011-08-04 16:02:06 UTC

I build websites for scientific communities where we often require HTML for writing molecules or formulas. Here are a couple of simple examples:

The acetylene chemical compound is written like this in HTML: C<sub>2</sub>H<sub>2</sub>
if you strip the subscript tags, you end up with C2H2 which does not mean anything. A number in front of a molecule formula indicate the number of molecules, and subscript number refers to an atom in a given molecule. You need to be able to keep these tags intact in order to keep your formula meaningful.

Let's take another example involving a simple formula written like this in HTML: E = mc<sup>2</sup>. If you strip out HTML, you end up with E = mc2 and you will fail your exam.

Comment 8 Ian 'Hixie' Hickson 2011-08-04 16:19:40 UTC

There are lots of places where markup is relevant, but when are those places relevant to microdata?

Comment 9 scor 2011-08-04 16:40:43 UTC

(In reply to comment #8)
> There are lots of places where markup is relevant, but when are those places
> relevant to microdata?

Maybe I'm just missing the point of microdata. If you want to reuse this data elsewhere (e.g. search results), you're losing some important piece of information during the microdata parsing. I don't get why you would not want this (HTML) data to be available in the microdata output, and let the end user decide whether they want to use them or strip them out.

Comment 10 Paolo Ciccarese 2011-08-04 17:29:43 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > There are lots of places where markup is relevant, but when are those places
> > relevant to microdata?
> 
> Maybe I'm just missing the point of microdata. If you want to reuse this data
> elsewhere (e.g. search results), you're losing some important piece of
> information during the microdata parsing. I don't get why you would not want
> this (HTML) data to be available in the microdata output, and let the end user
> decide whether they want to use them or strip them out.

Here is another use case. I develop software for creating annotations on online scientific resources. Typically scientists can write comments related to a resource or a fragment of it. For instance, if we are looking at the following text: 

"The function of full-length APP is undefined, and it is likely that full-length APP performs distinct roles from any its cleavage products"

The author of this piece, or another scientist, could comment - using an online rich text editor that our users like very much - including a link:

"I am referring to the full-length <a href="http://en.wikipedia.org/wiki/Amyloid_precursor_protein">APP</a>"

or, if she prefers to use a term from an ontology:

"I am referring to the full-length <a href="http://purl.obolibrary.org/obo/PRO_000004168">APP</a>"

When you republish the comment with Microdata, if you strip out the HTML, the comment is useless as APP can be many different things (related and not related to the topic). The only way to understand the comment is referring to the APP protein is to look at the link that gets 'lost in translation'. The same considerations apply when scientists attach references or evidence in their comments. Something like:

"<a href="http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2659208/">[Hoe et al.]</a>

Of course there is always the alternative of forcing users to use plain text - with an old style textarea instead of a rich text editor - or to develop software translating users' HTML into a wiki like format. However this would be a big limitation and an additional burden

Comment 11 Ian 'Hixie' Hickson 2011-08-04 20:09:18 UTC

I was with you until:

> When you republish the comment with Microdata

Why would you do that? What are you using the item*="" attributes for here? Who is going to be consuming this data? What are they going to do with it?

When I ask for a use case, I mean can you describe a scenario in which a user (or other tool working on behalf of a user, e.g. a search engine) would interact with a Web page in a manner that is not possible today, but which people actually want to do.

Comment 12 scor 2011-08-04 20:24:39 UTC

kalrcow asked me add a full example code where I would expect microdata to allow for HTML snippet values:

<div itemscope itemtype="http://myschema.org/ChemicalCompound">
  <span itemprop="name">acetylene</span>
  <span itemprop="formula">C<sub>2</sub>H<sub>2</sub></span>
</div>

Comment 13 Paolo Ciccarese 2011-08-05 02:30:20 UTC

(In reply to comment #11)
> I was with you until:
> 
> > When you republish the comment with Microdata
> 
> Why would you do that? What are you using the item*="" attributes for here? Who
> is going to be consuming this data? What are they going to do with it?
> 
> When I ask for a use case, I mean can you describe a scenario in which a user
> (or other tool working on behalf of a user, e.g. a search engine) would
> interact with a Web page in a manner that is not possible today, but which
> people actually want to do.

I would re-publish the comment created with the rich text editor as something like:

...
<div itemprop="has_reply" itemscope itemtype="http://schema.org/Comment">
    ...
    <span itemprop="content">
        For this experiment I used <a href="http://purl.obolibrary.org/obo/PRO_000004168">APP</a>
    </span>
    ...
</div>

In my specific case, I have an application - already deployed in alpha - parsing documents to extract proteins and genes. That HTML link would save my scientists additional interactions to resolve ambiguities.

Comment 14 Ian 'Hixie' Hickson 2011-08-05 14:39:57 UTC

Could you elaborate on what your application does? I'm trying to learn more about your use case to better understand the need here.

Comment 15 Paolo Ciccarese 2011-08-05 15:37:51 UTC

(In reply to comment #14)
> Could you elaborate on what your application does? I'm trying to learn more
> about your use case to better understand the need here.

Sure... and I apologize in advance for the size of this message.

The application is called DOMEO (Document Metadata Exchange) [1]. The idea is simple: scientists read online documents and want to create annotations on them. These are visually created through the application - and specifically through a GWT component - and stored in a separate store with access control. In this first phase, I can trigger pipelines for document analysis. In other words, when a scientist opens a document I can do a bunch of things for her to save their time. Examples are bibliographic citations and biological entities extraction (genes, proteins, antibodies). 

Most of the document we deal with are out of our control so we cannot insert  back any markup nor Microdata. However, our group builds also online portals for scientific communities working on a specific disease or area (examples: Pain, Parkinson Disease, MS...). In this last case we have control on the documents and we can, after a moderation process, re-publish the comments/notes of our users in the document. So if another user opens that document with DOMEO - or with a text mining algorithm/tool - , she will automatically get pieces of knowledge that she can simply look at - possibly with additional data as a result of meshups with external sources - or reuse and organize in her private knowledge management space.

One way of embedding those notes back into the document is to use Microdata. 

One example is the portal for Harvard Stem Cell Institute http://www.stembook.org/ . We have control over these peer reviewed articles - ex: http://www.stembook.org/node/471 -. We can therefore thinking of embedding valuable notes back into the document with some Microdata that allows our applications - that run outside that specific environment - and all the text mining application of other research groups  to better understand what to look for and how to parse it for knowledge extraction. In the case of a comment we can think of embedding a snippet such as the one of my previous email. But we have many other forms of annotations that are more specific to science: hypothesis, claims. And these are even more powerful. If we publish an article in our portal we want to be able to use Microdata to isolate important scientific claims in the text. Such claims though include references and other entities (such as protein as I was showing in my previous email) that are ambiguous if you cannot follow the provided links. Extracting Microdata from those document will allow to extract automatically the scientific discourse of such documents. 

You can get the flavor of what scientific discourse is looking at this example - http://tinyurl.com/3pvvjsc - of another application I developed for Alzheimer Disease researchers. This list of statements you see here is incredibly structured. It is actually a very detailed graph that you can see here http://tinyurl.com/3hrraje . As we have structured data we can embed powerful and very detailed Microdata back in the original document. These will allow better knowledge discovery and also to generate multiple views of the classic document that is still linear and very poor for today's technologies. 

I truly believe lots of knowledge our scientists encode in their annotation is related to links and other markup that took long time for them to master. To bring them back to plain text would probably be a big step backward. 

Let me know if you want to know more on the topic and thank you for following up and trying to better understand our needs.
Paolo

[1] If you want to see the application live here is a screencast of one of my presentations: http://www.bioontology.org/annotation-ontology . 
At minute 11.30 I explain the goals. At minute 28.55 I show the annotation process. live.

Comment 16 Ian 'Hixie' Hickson 2011-08-15 04:28:12 UTC

The annotation thing makes sense from a publisher point of view but I'm still trying to understand this from a consuming software point of view.

Could you elaborate on the way that users are actually going to be processing this microdata? That is, why do you need to mark it up at all, rather than just having the HTML files include the comments and then if you want to process the comment data, doing it directly from the raw database?

Comment 17 Ian 'Hixie' Hickson 2011-08-23 05:12:15 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Did Not Understand Request
Change Description: no spec change
Rationale: see comment 16