19050 – Microdata: Language handling

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19050 - Microdata: Language handling

Summary: Microdata: Language handling

Status:	RESOLVED LATER

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	HTML Microdata (editor: Ian Hickson) (show other bugs)
Version:	unspecified
Hardware:	PC All

Importance:	P2 normal
Target Milestone:	---
Assignee:	This bug has no owner yet - up for the taking
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-09-25 22:00 UTC by contributor
Modified:	2013-02-25 22:49 UTC (History)
CC List:	8 users (show)

See Also:

Attachments

Description contributor 2012-09-25 22:00:24 UTC

This was was cloned from bug 14470 as part of operation LATER convergence.
Originally filed: 2011-10-14 19:21:00 +0000
Original reporter: Jeni Tennison <jeni@jenitennison.com>

================================================================================
 #0   Jeni Tennison                                   2011-10-14 19:21:01 +0000 
--------------------------------------------------------------------------------
It is not clear how microdata handles languages. Language is not mentioned as part of the microdata data model [1]. It is not exposed within microdata JSON [2]. It is not used in the algorithm for creating vCard [3] or iCalendar [4], where it should be used to provide a value for the LANGUAGE property [5][6].

There is a list of examples of multi-lingual content on the web at [7]. Another example is the EUR-LEX site where information about items of European legislation is available in multiple languages [8] or on legislation.gov.uk where Welsh and English titles for the same item of legislation are listed together [9].

Microdata will be unusable for multi-lingual content if it doesn't preserve the language of textual values. The spec should make it clear whether language should be preserved by consumers, ignored, or if this is implementation dependent. Regardless, the vCard and iCalendar conversions in the spec should take account of language.

[1] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#the-microdata-model
[2] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#json
[3] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#conversion-to-vcard
[4] http://www.whatwg.org/specs/web-apps/current-work/multipage/microdata.html#conversion-to-icalendar
[5] http://tools.ietf.org/html/rfc6350#section-5.1
[6] http://tools.ietf.org/html/rfc5545#section-3.2.10
[7] http://microformats.org/wiki/multilingual-examples
[8] http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=CELEX:31994Y0702(01):FR:NOT
[9] http://www.legislation.gov.uk/wsi
================================================================================
 #1   Ian 'Hixie' Hickson                             2011-10-18 22:41:06 +0000 
--------------------------------------------------------------------------------
It's entirely up to the vocabulary to specify a property to carry the language. Microdata is just a group of name-value pairs.
================================================================================
 #2   Jeni Tennison                                   2011-10-19 15:27:16 +0000 
--------------------------------------------------------------------------------
(In reply to comment #1)
> It's entirely up to the vocabulary to specify a property to carry the language.
> Microdata is just a group of name-value pairs.

Your (now dropped) mapping of microdata to RDF [1] did take into account language from the element when generating RDF (step 6.1.4.). Does that mean that it is OK for a consumer to take language into account when processing microdata, despite it not being part of the microdata data model, or was that an error in that mapping?

Having some text in the spec that clarifies the interaction of microdata and HTML language would be really useful to avoid publisher and consumer confusion.

[1] http://www.w3.org/TR/2011/WD-microdata-20110525/#generate-the-triples-for-an-item
================================================================================
 #3   Ian 'Hixie' Hickson                             2011-10-25 02:58:01 +0000 
--------------------------------------------------------------------------------
The RDF mapping was not a microdata to RDF mapping, it was an HTML to RDF mapping, and so it did much more than just expose the microdata model. It was also, IMHO, a rather misguided idea.

I don't understand what is unclear here. It seems crystal clear that the microdata model doesn't have language, just like it doesn't list prices for each property, or data types, or the phase of the moon when the property was set: if it had anything to do with a language, it would be mentioned, and it is not.
================================================================================
 #4   Jeni Tennison                                   2011-10-26 09:00:40 +0000 
--------------------------------------------------------------------------------
The issue is that the natural/obvious method of getting information about the language of a property value is for an application to use the lang DOM property of the relevant property element. Without a clear indication that doing so is non-conformant, the assumption will be that the HTML language can be used by applications that interpret microdata and map to other formats because even though it's not part of the microdata data model, language is information that is accessible from the DOM.

It is also not clear to microdata vocabulary creators that they must provide properties/types to indicate the language of a property's value if they want to capture that information. Illustrating the use of other languages in one of the example vocabularies would be one way of making this clearer.
================================================================================
 #5   Henri Sivonen                                   2011-10-27 08:52:47 +0000 
--------------------------------------------------------------------------------
(In reply to comment #1)
> It's entirely up to the vocabulary to specify a property to carry the language.
> Microdata is just a group of name-value pairs.

It seems very inconvenient to have to specify a vocabulary-specific language markup mechanism instead of using the language markup mechanism from the HTML layer.

Unfortunately, language info from the HTML layer doesn't map nicely to JSON. Is that the reason why language isn't part of the data model? What's the reason why language isn't part of the data model?
================================================================================
 #6   Ian 'Hixie' Hickson                             2011-11-03 16:00:27 +0000 
--------------------------------------------------------------------------------
I agree that it's inconvenient. The JSON issue isn't the reason, though it is certainly a factor.

The original reason is simply that none of the use cases indicated a need for this. It's still not entirely clear to me what use cases exist. Certainly multilingual content exists, but what are people intending to do with it in a microdata context that requires the labeling to persist?
================================================================================
 #7   Jeni Tennison                                   2011-11-05 07:17:00 +0000 
--------------------------------------------------------------------------------
A use case is that a search engine wants to bring together reviews and other information about films into film-centric pages. It gathers that information about that film from all over the web and wants to present people with reviews in their preferred language(s). This requires it to preserve information about the language of the reviews.

Also in this case, the film might have different titles in different languages; the search engine would be able to link together the information provided in different languages about the same film using pages in which there were multiple translations of the title (see eg [1])

A perhaps more esoteric use case: translation services such as Google Translate might look for examples where the same information about an item was given in different languages as potential sources for improving its translation services.

[1] http://fr.wikipedia.org/wiki/Les_Aventures_de_Tintin_:_Le_Secret_de_La_Licorne
================================================================================
 #8   Ian 'Hixie' Hickson                             2011-11-11 20:01:18 +0000 
--------------------------------------------------------------------------------
(In reply to comment #7)
> A use case is that a search engine wants to bring together reviews and other
> information about films into film-centric pages. It gathers that information
> about that film from all over the web and wants to present people with reviews
> in their preferred language(s). This requires it to preserve information about
> the language of the reviews.

(I assume you mean aggregator, not search engine.)

The above can be solved today, you just need to include the language information in the microdata:

   <p itemscope itemtype="http://example.com/movie/review">
    <span itemprop=text> bla bla bla </span>
    <meta itemprop=language content="en">
   </p>

It's redundant with lang="", but lang="" doesn't have the same coarseness as microdata. Consider:

   <p itemscope itemtype="http://example.com/movie/review" lang="en">
    <span itemprop=text>
     <span lang="de">bla</span>
     <span lang="fr">bla</span>
    </span>
   </p>

What language would you associate with the "text" property?

Also, note that microdata isn't currently intended for handling cases where entire blobs of HTML content are aggregated. For example, it would completely fail with something like:

  <div itemprop=adcopy>
   <style scoped> em { color: purple } </style>    
   This product costs <s>$500</s> just $100!
   You should get <em>this</em> version, not any version.
  </p>

The microdata extraction would get:

   "adcopy": [ "\n    em { color: purple }     \n   This product costs $500 just $100!\n   You should get this version, not any version.\n  \n" ]

...which isn't at all what was intended.


> A perhaps more esoteric use case: translation services such as Google Translate
> might look for examples where the same information about an item was given in
> different languages as potential sources for improving its translation
> services.

Such a tool would presumably want intra-text language annotations, not just coarse language annotations.


I think if we're to address the use cases presented, we need to add more than just lang="" support; we need to add subtree support (which would give us language support for free). I don't think it makes sense to make such a radical addition so early in the technology's development. We should wait to see how people are using it, first.
================================================================================
 #9   Ian 'Hixie' Hickson                             2011-12-07 00:15:39 +0000 
--------------------------------------------------------------------------------
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Partially Accepted
Change Description: none yet
Rationale: I have marked this LATER so that we can look at this again once browsers have caught up with what we've specified so far, per the last paragraph of comment 8.
================================================================================