14470 – microdata: support for structured data (HTML) as a property value

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 14470 - microdata: support for structured data (HTML) as a property value

Summary: microdata: support for structured data (HTML) as a property value

Status:	RESOLVED WONTFIX

Alias:	None

Product:	WHATWG
Classification:	Unclassified
Component:	HTML (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P3 enhancement
Target Milestone:	Needs Impl Interest
Assignee:	Ian 'Hixie' Hickson
QA Contact:	HTML WG Bugzilla archive list

URL:	https://www.w3.org/Bugs/Public/show_b...
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2011-10-14 19:21 UTC by Jeni Tennison
Modified:	2016-03-16 18:02 UTC (History)
CC List:	9 users (show)

See Also:

Attachments

Description Jeni Tennison 2011-10-14 19:21:01 UTC

It is not clear how microdata handles languages. Language is not mentioned as part of the microdata data model [1]. It is not exposed within microdata JSON [2]. It is not used in the algorithm for creating vCard [3] or iCalendar [4], where it should be used to provide a value for the LANGUAGE property [5][6].

There is a list of examples of multi-lingual content on the web at [7]. Another example is the EUR-LEX site where information about items of European legislation is available in multiple languages [8] or on legislation.gov.uk where Welsh and English titles for the same item of legislation are listed together [9].

Microdata will be unusable for multi-lingual content if it doesn't preserve the language of textual values. The spec should make it clear whether language should be preserved by consumers, ignored, or if this is implementation dependent. Regardless, the vCard and iCalendar conversions in the spec should take account of language.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-microdata-model
[2] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#json
[3] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#conversion-to-vcard
[4] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#conversion-to-icalendar
[5] http://tools.ietf.org/html/rfc6350#section-5.1
[6] http://tools.ietf.org/html/rfc5545#section-3.2.10
[7] http://microformats.org/wiki/multilingual-examples
[8] http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31994Y0702(01):FR:NOT
[9] http://www.legislation.gov.uk/wsi

Comment 1 Ian 'Hixie' Hickson 2011-10-18 22:41:06 UTC

It's entirely up to the vocabulary to specify a property to carry the language. Microdata is just a group of name-value pairs.

Comment 2 Jeni Tennison 2011-10-19 15:27:16 UTC

(In reply to comment #1)
> It's entirely up to the vocabulary to specify a property to carry the language.
> Microdata is just a group of name-value pairs.

Your (now dropped) mapping of microdata to RDF [1] did take into account language from the element when generating RDF (step 6.1.4.). Does that mean that it is OK for a consumer to take language into account when processing microdata, despite it not being part of the microdata data model, or was that an error in that mapping?

Having some text in the spec that clarifies the interaction of microdata and HTML language would be really useful to avoid publisher and consumer confusion.

[1] http://www.w3.org/TR/2011/WD-microdata-20110525/#generate-the-triples-for-an-item

Comment 3 Ian 'Hixie' Hickson 2011-10-25 02:58:01 UTC

The RDF mapping was not a microdata to RDF mapping, it was an HTML to RDF mapping, and so it did much more than just expose the microdata model. It was also, IMHO, a rather misguided idea.

I don't understand what is unclear here. It seems crystal clear that the microdata model doesn't have language, just like it doesn't list prices for each property, or data types, or the phase of the moon when the property was set: if it had anything to do with a language, it would be mentioned, and it is not.

Comment 4 Jeni Tennison 2011-10-26 09:00:40 UTC

The issue is that the natural/obvious method of getting information about the language of a property value is for an application to use the lang DOM property of the relevant property element. Without a clear indication that doing so is non-conformant, the assumption will be that the HTML language can be used by applications that interpret microdata and map to other formats because even though it's not part of the microdata data model, language is information that is accessible from the DOM.

It is also not clear to microdata vocabulary creators that they must provide properties/types to indicate the language of a property's value if they want to capture that information. Illustrating the use of other languages in one of the example vocabularies would be one way of making this clearer.

Comment 5 Henri Sivonen 2011-10-27 08:52:47 UTC

(In reply to comment #1)
> It's entirely up to the vocabulary to specify a property to carry the language.
> Microdata is just a group of name-value pairs.

It seems very inconvenient to have to specify a vocabulary-specific language markup mechanism instead of using the language markup mechanism from the HTML layer.

Unfortunately, language info from the HTML layer doesn't map nicely to JSON. Is that the reason why language isn't part of the data model? What's the reason why language isn't part of the data model?

Comment 6 Ian 'Hixie' Hickson 2011-11-03 16:00:27 UTC

I agree that it's inconvenient. The JSON issue isn't the reason, though it is certainly a factor.

The original reason is simply that none of the use cases indicated a need for this. It's still not entirely clear to me what use cases exist. Certainly multilingual content exists, but what are people intending to do with it in a microdata context that requires the labeling to persist?

Comment 7 Jeni Tennison 2011-11-05 07:17:00 UTC

A use case is that a search engine wants to bring together reviews and other information about films into film-centric pages. It gathers that information about that film from all over the web and wants to present people with reviews in their preferred language(s). This requires it to preserve information about the language of the reviews.

Also in this case, the film might have different titles in different languages; the search engine would be able to link together the information provided in different languages about the same film using pages in which there were multiple translations of the title (see eg [1])

A perhaps more esoteric use case: translation services such as Google Translate might look for examples where the same information about an item was given in different languages as potential sources for improving its translation services.

[1] http://fr.wikipedia.org/wiki/Les_Aventures_de_Tintin_:_Le_Secret_de_La_Licorne

Comment 8 Ian 'Hixie' Hickson 2011-11-11 20:01:18 UTC

(In reply to comment #7)
> A use case is that a search engine wants to bring together reviews and other
> information about films into film-centric pages. It gathers that information
> about that film from all over the web and wants to present people with reviews
> in their preferred language(s). This requires it to preserve information about
> the language of the reviews.

(I assume you mean aggregator, not search engine.)

The above can be solved today, you just need to include the language information in the microdata:

   <p itemscope itemtype="http://example.com/movie/review">
    <span itemprop=text> bla bla bla </span>
    <meta itemprop=language content="en">
   </p>

It's redundant with lang="", but lang="" doesn't have the same coarseness as microdata. Consider:

   <p itemscope itemtype="http://example.com/movie/review" lang="en">
    <span itemprop=text>
     <span lang="de">bla</span>
     <span lang="fr">bla</span>
    </span>
   </p>

What language would you associate with the "text" property?

Also, note that microdata isn't currently intended for handling cases where entire blobs of HTML content are aggregated. For example, it would completely fail with something like:

  <div itemprop=adcopy>
   <style scoped> em { color: purple } </style>    
   This product costs <s>$500</s> just $100!
   You should get <em>this</em> version, not any version.
  </p>

The microdata extraction would get:

   "adcopy": [ "\n    em { color: purple }     \n   This product costs $500 just $100!\n   You should get this version, not any version.\n  \n" ]

...which isn't at all what was intended.


> A perhaps more esoteric use case: translation services such as Google Translate
> might look for examples where the same information about an item was given in
> different languages as potential sources for improving its translation
> services.

Such a tool would presumably want intra-text language annotations, not just coarse language annotations.


I think if we're to address the use cases presented, we need to add more than just lang="" support; we need to add subtree support (which would give us language support for free). I don't think it makes sense to make such a radical addition so early in the technology's development. We should wait to see how people are using it, first.

Comment 9 Ian 'Hixie' Hickson 2011-12-07 00:15:39 UTC

EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: none yet
Rationale: I have marked this LATER so that we can look at this again once browsers have caught up with what we've specified so far, per the last paragraph of comment 8.

Comment 10 public-rdfa-wg 2013-01-24 06:48:26 UTC

This bug was cloned to create HTML WG bug 19050.

Comment 11 Ian 'Hixie' Hickson 2013-03-18 23:08:31 UTC

Are there widely used microdata vocabularies that are currently working around the lack of structured data support in microdata, or tool chains using something other than microdata for where the lack of structured data was one of the main driving factors in using another, otherwise less-well suited, technology?

Comment 12 Ian 'Hixie' Hickson 2014-07-21 20:46:20 UTC

If we did this:

 - I guess we'd represent the fragment of HTML as a string of HTML markup, in the 
   microdata data model, as opposed to a DOM, since a DOM is an in-memory structure
 - we'd need a way to mark where the fragment started and ended, presumably either
   an attribute next to itemprop="", or, probably better, an element that indicates
   that its value is its structured contents, not its textual contents.

Comment 13 Anne 2016-03-16 18:02:42 UTC

Resolving WONTFIX given the lack of interest. Hopefully that's acceptable.