9533 – The Microdata extration algorithm should include image alt-text when extracting the contents of an element as a string

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 9533 - The Microdata extration algorithm should include image alt-text when extracting the contents of an element as a string

Summary: The Microdata extration algorithm should include image alt-text when extracti...

Status:	RESOLVED WONTFIX

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	pre-LC1 HTML Microdata (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	PC Linux

Importance:	P2 normal
Target Milestone:	---
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2010-04-16 00:01 UTC by Tab Atkins Jr.
Modified:	2010-10-05 13:03 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description Tab Atkins Jr. 2010-04-16 00:01:21 UTC

In the "Values" section of the Microdata section of the spec (currently section 5.2.4), the "Otherwise" clause says that the value of the Microdata property should be the textContent of the element. The textContent extraction algorithm defined in DOM3CORE does not include the value of the @alt attribute on an <img> element in the returned string.

This is a problem for many common cases on the web, where Microdata may be used to extract information from a page that uses an image logo with appropriate alt-text. For example, it is common for corporate pages to have markup resembling "<h1><img src=foo alt='Example Corp'></h1>". Currently, using this markup to get the company name as the value of some Microdata property is impossible. If you set an @itemprop on the <img>, the value for the property is the value of the @src attribute. If you set an @itemprop on the <h1>, the value for the property is the empty string.

Currently, the only way to get the company name as the value of a Microdata property is to duplicate the company name in a <meta> element and set the @itemprop on that instead. This is precisely the type of duplication that Microdata is intended to prevent

Ideally, you would be able to set an @itemprop on the <h1> and get the value of the <img>'s @alt attribute, as you are getting the text inside the element, and @alt is the textual replacement for the image.

It can be argued that more elements could benefit from special handling when formatting their text content. For example, the <q> element could emit its contents with quotes, the <bdo> element could emit its contents with unicode directionality characters, or the <br> element could substitute itself with a linebreak. However, these elements will still emit *something* useful if they just provide their plain textContent, even if it ends up being somewhat misformatted. <img alt> provides *nothing* and will require data duplication in the current algorithm, and thus is much more important to address.

The actual change to algorithms extracting Microdata from a document are so minimal as to be trivial. If one is using DOM methods, one has to manually iterate through an element's nodes per the DOM3Core algorithm for textContent, and add a single additional case to extract the @alt value from <img alt>. This is somewhat more difficult than just requesting the .textContent value from a node, but is still quite trivial. If one is using lower-level or alternate methods to parse a page and extract Microdata from it, then the change should be excessively trivial - a single additional case while building the text content string, as described earlier in this paragraph.

Comment 1 Tab Atkins Jr. 2010-04-16 00:06:54 UTC

So, um, to summarize the above, the suggested change is exactly what the title suggests.  When extracting the text from an element as the value of a Microdata property, the alt-text for <img> elements should be included.

Comment 2 Ian 'Hixie' Hickson 2010-04-16 00:12:02 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: This is intentionally trivial to implement. I'm not convinced that there are that many use cases for which this is important. It's easy to work around (put the logo to the side of the company name — that's what most business cards do anyway, and it means you'll get better text selection behaviour).

Having said that, it would be good to get feedback on this from early consumers of microdata, such as the Google Rich Snippets team or foolip. If consumers think the added complexity is enough, then maybe it's worth it.