Bug 12417 - HTML5 is missing attribute for specifying translatability of content
HTML5 is missing attribute for specifying translatability of content
Status: VERIFIED FIXED
Product: HTML WG
Classification: Unclassified
Component: LC1 HTML5 spec
unspecified
All All
: P2 enhancement
: ---
Assigned To: Ian 'Hixie' Hickson
HTML WG Bugzilla archive list
implement proposal in comment 64
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-04-04 14:54 UTC by Jirka Kosek
Modified: 2014-03-06 23:19 UTC (History)
35 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jirka Kosek 2011-04-04 14:54:01 UTC
Problem:
A lot of web content has to be provided in many languages - usually it is done through machine translation either by website owner or by user using some translation widget like Google Translate or Microsoft Translator. However during automatic translation usually more then it is desirable is translated. Things like program listings, terms, names, etc should not be translated. This problem can be solved by providing flag which can exclude element content from translation process. This can greatly increase precision and usefulness of machine translation of a web content.

Solution:
New global attribute "translate" with permitted values "yes" and "no" should be added.

Rationale:
Having such attribute is best practice for localization of content (see eg. http://www.w3.org/TR/xml-i18n-bp/#DevTransOver)

Such attribute is currently supported by Microsoft Translator and there are already pages using it and relying on the functionality.
Comment 1 Karl Dubost 2011-04-04 16:46:13 UTC
@jirka

- Do you have pointers to the pages which are using it?
- Would it have the same inheritance rules than "lang" attribute for example?



Related to this, I haven't seen tools yet enforcing the lang attribute for automatic translation. Are there online? Aka a text having two languages with proper markup for lang let's say French and English. And you translate the page from French to English. only the French part would be translated (and avoid to translate English to English as it is happening sometimes.)
Comment 2 Kornel Lesinski 2011-04-04 18:54:38 UTC
For the record, there's a related W3C recommendation:

http://www.w3.org/TR/its/

which, among lots of other features, has its:translate="no".

Google Translate supports `class="notranslate"` (http://translate.google.com/support/)


Theoretically such effect could be achieved with `lang=zxx`, although this loses information about language text is in (which may be useful for screen readers).

Another solution could be something like `lang="en-notranslate"` (i.e. correct language code with some suffix indicating it should not be translated).
Comment 3 Jirka Kosek 2011-04-05 08:42:23 UTC
(In reply to comment #1)

> - Do you have pointers to the pages which are using it?

Actually now. What I have for now is a lot of XML content which uses its:translate="no" and which has to be put on the web and translation widgets will be used to get for translation into other languages. So I have started looking for such facility in HTML5 and realized that there is no such one.

But I had chat with some folks from MS and according to them there is quite a lot of pages already using translate="on". Maybe WG fellows who own these search engines can run simple query to get some measures.

> - Would it have the same inheritance rules than "lang" attribute for example?

Yes. Actually full definition of inheritance and override rules can be found at http://www.w3.org/TR/its/#datacategories-defaults-etc

> Related to this, I haven't seen tools yet enforcing the lang attribute for
> automatic translation. Are there online? Aka a text having two languages with
> proper markup for lang let's say French and English. And you translate the page
> from French to English. only the French part would be translated (and avoid to
> translate English to English as it is happening sometimes.)

I don't know. I don't think that such pages are very common.
Comment 4 Jirka Kosek 2011-04-05 08:47:13 UTC
(In reply to comment #2)

> Google Translate supports `class="notranslate"`
> (http://translate.google.com/support/)

I don't think that abusing CSS class for this is something we should endorse. 

> Theoretically such effect could be achieved with `lang=zxx`, although this
> loses information about language text is in (which may be useful for screen
> readers).

Translatability is orthogonal to language. Language is important not only for screen readers but also for hyphenation, spell-checking, proper bidi, ...

> Another solution could be something like `lang="en-notranslate"` (i.e. correct
> language code with some suffix indicating it should not be translated).

Again I don't think translatability has anything to do with language, so using lang attribute for conveying such information looks like abuse to me. Also even if new lang subtag is introduced syntax will be too cumbersome for average web developer.
Comment 5 Karl Dubost 2011-04-05 12:35:58 UTC
(In reply to comment #3)
> > from French to English. only the French part would be translated (and avoid to
> > translate English to English as it is happening sometimes.)
> 
> I don't know. I don't think that such pages are very common.

Quebec sites for example. :)
Comment 6 Ian 'Hixie' Hickson 2011-06-15 23:20:00 UTC
Using class="notranslate" isn't an abuse as far as I can tell.

This seems like a feature that would get only narrow use. Is it really worth adding to the language?

It seems like the ideal solution would be a new language subtag, frankly.


EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Rejected
Change Description: no spec change
Rationale: there's already a solution, a better solution might be to use a new language subtag, and it's not clear a solution is actually needed.
Comment 7 Jirka Kosek 2011-06-29 14:52:37 UTC
(In reply to comment #6)

> Using class="notranslate" isn't an abuse as far as I can tell.

I don't think so. The class attribute is primarly aimed as a hook for CSS. Currently HTML5 spec doesn't provide mechanism for defining some class values as reserved for special use -- it would be very bad design to have some special values recognized in place where user is supposed to be free to choose any identifier.

Moreover similar usage of class attribute by Microformats was subject to critic in past and lead to creation of RDFa and Microdata.

> This seems like a feature that would get only narrow use. Is it really worth
> adding to the language?

I don't think so, otherwise I wouldn't ask for it. More and more web content is produced by translation - either machine or human, and such non-translatability tag can vastly improve precision of translation. 

> It seems like the ideal solution would be a new language subtag, frankly.

This was discussed before. Translatability is orthogonal information to language information, so it doesn't make sense to tangle it into single value.

> Rationale: there's already a solution,

Usage of class attribute has several drawbacks and it's abuse of this attribute.

> a better solution might be to use a new
> language subtag,

Nope.

> and it's not clear a solution is actually needed.

It is clear that a solution is actually needed. Try to speak with people from localization industry.
Comment 8 Richard Ishida 2011-07-25 11:27:13 UTC
> This seems like a feature that would get only narrow use. Is it really worth
> adding to the language?

If Jirka is too modest to say it, let me say it for him.  He is raising this issue on behalf of a number of key people involved in localisation technology who have been wanting this for some time. As an additional note, I have been running workshops for the past year looking for gaps in the standards for the multilingual web and this is a topic that has consistently created significant interest. The use of machine translation is increasing rapidly of late, and making significant inroads into the localisation industry, and such a facility is a no-brainer for industrial use of MT.  Google has long since seen the need to implement this feature for their MT apparatus, as has Microsoft (and MS has requested this feature in the past of HTML5 too). 

So I do not see this getting narrow use, and I definitely see it worth adding to the language.

One reason for that is to standardise the approach.

Whereas Google and MS currently both use the class=notranslate approach, we should feel fortunate that they decided to use an identical approach.  On the other hand, there are other parts of this that are definitely not standardised. 

MS apparently also supports style="notranslate"; it certainly supports the custom attribute translate="no".

Microsoft doesn't translate content within <code> elements, but there don't seem to be instructions about how to override this if you do want (perhaps even just certain parts of) your <code> element content to be translated.

It's also not made clear, btw, how to make subelements translatable inside an element that has been set to notranslate -  which may sometimes be appropriate.

With Google, if you have an entire page that should not be translated, you can add:

    <meta name="google" value="notranslate">

to the <head> of your page and they won't translate any of the content on that page.

However they also support:

    <meta name="google" content="notranslate">

This shouldn't be Google specific, and a translate=no attribute on the html tag would be far cleaner.

Microsoft doesn't seem to offer the same markup here, but on the other hand using <meta name="microsoft" content="notranslateclasses myclass1 myclass2" /> anywhere on the page (or as part of a widget snippet) ensures that any of the CSS classes listed following “notranslateclasses” should behave the same as the “notranslate” class. 

This is a mess and HTML5 is just what is needed to standardise this.


Overloading language tags is not the solution. For example, a language tag can indicate which text is to be spellchecked against a particular dictionary.  This has nothing to do with whether that text is to be translated or not.  They are different concepts.  In a document that has lang=en in the html header, if you set lang=nottranslate lower down the page, that text will now not be spellchecked, since the language is no longer english. (Nor for the matter will styling work, voice browsers pronounce correctly, etc.)


Some links:

http://www.microsoft.com/Web/solutions/mstranslator.aspx

http://translate.google.com/support/
Comment 9 Yves 2011-07-26 15:57:55 UTC
Hi,

I support Jirka and Richard on this.


> Using class="notranslate" isn't an abuse as far as I can tell.

The class attribute is used for other things. Overloading it with "notranslate" is not a viable solution.


> This seems like a feature that would get only narrow use.
> Is it really worth adding to the language?

I disagree. Several content providers are using workaround because a standard solution does not exist.


> It seems like the ideal solution would be a new
> language subtag, frankly.

Using a subtag is not the solution either. It overloads the value with information that has nothing to do with the language.

-Yves Savourel
ENLASO Corporation
www.translate.com
Comment 10 Felix Sasaki 2011-07-27 08:38:05 UTC
I support Richard, Yves and Jirka and 2nd what they said about this.
Comment 11 Sergei Vasilyev 2011-07-27 09:51:09 UTC
+1 to having the "translate" attribute.
Comment 12 Karl Dubost 2011-07-27 11:03:42 UTC
(In reply to comment #9)
> > Using class="notranslate" isn't an abuse as far as I can tell.
> 
> The class attribute is used for other things. Overloading it with "notranslate" is not a viable solution.

I think the question that hixie had, was does it answer the use cases?
In which ways, what does it break, what does it improve? (with code scenarios)

> > This seems like a feature that would get only narrow use.
> > Is it really worth adding to the language?
> 
> I disagree. Several content providers are using workaround because a standard
> solution does not exist.

Is there a document with what the content providers are doing? 
What code (workaround) are they using?
How to solve it with code without focusing on  one possible solution?

> > It seems like the ideal solution would be a new
> > language subtag, frankly.
> 
> Using a subtag is not the solution either. It overloads the value with
> information that has nothing to do with the language.

In which ways? Could you give more details. (code+explanation)
Comment 13 Felix Sasaki 2011-07-27 11:18:30 UTC
Hi Karl,

(In reply to comment #12)
> (In reply to comment #9)
> > > Using class="notranslate" isn't an abuse as far as I can tell.
> > 
> > The class attribute is used for other things. Overloading it with "notranslate" is not a viable solution.
> 
> I think the question that hixie had, was does it answer the use cases?
> In which ways, what does it break, what does it improve? (with code scenarios)
> 
> > > This seems like a feature that would get only narrow use.
> > > Is it really worth adding to the language?
> > 
> > I disagree. Several content providers are using workaround because a standard
> > solution does not exist.
> 
> Is there a document with what the content providers are doing? 
> What code (workaround) are they using?

Richard described the workarounds here http://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c8

> How to solve it with code without focusing on  one possible solution?

Not sure what you mean by "without focusing on one possible solution" - could you explain?

> 
> > > It seems like the ideal solution would be a new
> > > language subtag, frankly.
> > 
> > Using a subtag is not the solution either. It overloads the value with
> > information that has nothing to do with the language.
> 
> In which ways? Could you give more details. (code+explanation)

<p lang=en-us>...</p> identifies the content of "p" to be American English. This identification is independent of an application like machine translation. So if you change this e.g. to <p lang=en-us-trans>...</p>, other applications that "understand" en-us will be confused. The issue behind this is that applications like choice of fonts or spell checkers rely on language *identification*, whereas "Translation" information is rather a directive for a process - "translate me" (me, being a human translator, an MT system, or both).
Comment 14 Bob Clark 2011-07-27 11:21:50 UTC
Absolutely spot-on, Richard! I couldn't agree more, nor could anyone else I know. Let's hope that sanity prevails.
Comment 15 Bob Clark 2011-07-27 11:24:25 UTC
Absolutely spot-on, Richard! I couldn't agree more, nor could anyone else I know. Let's hope that sanity prevails.
Comment 16 Karl Dubost 2011-07-27 12:32:38 UTC
Hello Felix :)

(In reply to comment #13)
> Hi Karl,
> > > I disagree. Several content providers are using workaround because a standard
> > > solution does not exist.
> > 
> > Is there a document with what the content providers are doing? 
> > What code (workaround) are they using?
> 
> Richard described the workarounds here
> http://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c8

It was said "several content providers". In Richard answers, I see two:

* Microsoft
* Google

are there others and if yes how, what do they do, etc? 
(just trying to have a better overview of the extent of things)

> > How to solve it with code without focusing on  one possible solution?
> 
> Not sure what you mean by "without focusing on one possible solution" - could
> you explain?

We have often a tendency to focus on the solution we have already imagined. It's normal. no blame on that. But sometimes there is more than one opportunity to solve an issue. Sometimes more practical/less costly ones emerge in the discussion.

For example, I had not realized for the "code" element (not relying on the attribute), it is quite cool.


> <p lang=en-us>...</p> identifies the content of "p" to be American English.
> This identification is independent of an application like machine translation.
> So if you change this e.g. to <p lang=en-us-trans>...</p>, other applications
> that "understand" en-us will be confused. The issue behind this is that
> applications like choice of fonts or spell checkers rely on language
> *identification*, whereas "Translation" information is rather a directive for a
> process - "translate me" (me, being a human translator, an MT system, or both).

thanks a lot. Clearer.
Comment 17 Pedro L. D 2011-07-27 12:55:06 UTC
It would be very important for the Localisation Sector to have a translatability tag in order to allow identify these pieces of content for Real Time Translation Systems, which operates directly with the HTML page, and for treatment of embedded HTLM content in XML in interoperability systems.

This is specially relevant in Germanic languages and other, where the uppercase cannot be used to distinguish proper and common nouns. Brands, acronyms, special idioms, etc. can be conserved in the original language and the content creator could have full control about it, without specific rules in the translation back offices, both Human or Machine Translation. 

How? Sure you know better than I do, but I think here applies "Keep it simple". If the default value is “to translate” it could be just a tag like: <NoTranslate> … </NoTranslate>.
Comment 18 Yves 2011-07-27 13:11:05 UTC
(In reply to comment #12)
> (In reply to comment #9)
> > > Using class="notranslate" isn't an abuse as far as I can tell.
> > 
> > The class attribute is used for other things. Overloading it with "notranslate" is not a viable solution.
> 
> I think the question that hixie had, was does it answer the use cases?
> In which ways, what does it break, what does it improve? (with code scenarios)

Good point. let me try:

If the translate feature was implemented with class. I think it would need:

- two values: e.g. 'notranslate' and 'translate' because the feature needs to
be able to override a parent directive.

- those values would need to be explicitly defined in the specification, so
they are standard.

- class would need to have some additional semantic associated to its behavior
as far as the scope. The idea is to implement the Translate data category
described in http://www.w3.org/TR/its/#trans-datacat 
I don't think 'class' currently is very explicit in that aspect.

BTW, if class can be used for indicating a boolean information such as
translate/do-not-translate along with other values (display classes, script
behaviour, etc.), why other type of boolean information do not use it as well?
For example 'hidden' or 'dragable' have their own attributes. I suppose the
same rational that applied to them can be applied to have a distinct attribute
for the translate feature.


> > > This seems like a feature that would get only narrow use.
> > > Is it really worth adding to the language?
> > 
> > I disagree. Several content providers are using workaround because a standard
> > solution does not exist.
> 
> Is there a document with what the content providers are doing? 
> What code (workaround) are they using?

I don't think much documentation exists on this. I can mention some solutions
I've seen on material we had to localize (can't share names because of NDAs)

- Use of the class attribute with various value ('protect', 'noloc', etc.)

- Use of HTML comments before the element to which the directive pertains: e.g.
<!--notrans--><p>Text no to translate</p>

- Use of 'custom' attributes: <p notrans>Text not to translate</p>


> How to solve it with code without focusing on one possible solution?

It's a good point. Any solution that would provide an implementation of the ITS
Translate data category would be fine.


> > > It seems like the ideal solution would be a new
> > > language subtag, frankly.
> > 
> > Using a subtag is not the solution either. It overloads the value with
> > information that has nothing to do with the language.
> 
> In which ways? Could you give more details. (code+explanation)

I think Felix has provided some answer for this already.
Mixing identification information with process information is not a good way to
keep separation of concerns.

And it would cause problem with formats where that separation is correctly
done. For example one could technically have: <elem xml:lang='en-notranslate'
its:translate='yes'>translatable or not?</elem>

I think a subtag would be a more dangerous solution than class. But class, as
described currently, is not enough.
Comment 19 Arle Lommel 2011-07-27 15:02:05 UTC
Very interesting discussion. I can state, as a representative of the Globalization and Localization Association (GALA), an international organization for organizations involved in localization and translation, that such a feature in HTML 5 would be quite important for GALA members. Having this feature available in a standard format from the start would be an aid to everyone from content creator through to the end consumer of the content.

Since HTML is increasingly the output format of choice for many systems (including authoring systems), a standard attribute would be a major improvement for everyone involved. A significant portion of web content will be translated at some point in its lifecycle, whether by human translators or by machine translations, and having the ability to explicitly state what should (or should not) be translated, would be tremendously important for improving quality and assuring that translated output makes sense. Since it is impossible to know in advance what methods will be used for translation, the current non-standardized situation poses a significant obstacle to indicating this information in a consistent and predictable way.

As others have mentioned, texts often contain a mix of translatable and non-translatable content, and knowledge of what should and should not be translated is vital for rendering texts in other languages. As Jirka mentions, a number of organizations are already indicating this information, albeit in inconsistent and non-standard ways. That they are doing so in advance of a standard way to tag translatability shows that this feature is needed by large content creators and is not a fringe use case.

I would second the comments by Richard and others that this is not a function of language tagging, but rather of instruction for processes. This function should be able to be indicated totally independently from indicating the language of the text. The two are related, but distinct, issues. And Yves' comments represent an excellent summary of the various options. I would second him that a language subtag is the least desirable of the options considered.
Comment 20 Karl Dubost 2011-07-27 15:11:46 UTC
What is happening in Ruby markup cases? Let's say a kanji and hiragana? What the translation machine does? Convert the hiragana in romaji?
Comment 21 Jirka Kosek 2011-07-27 15:20:22 UTC
(In reply to comment #20)
> What is happening in Ruby markup cases? Let's say a kanji and hiragana? What
> the translation machine does? Convert the hiragana in romaji?

Please note that proposed attribute just indicates whether content should or shouldn't be translated. How it is translated is completely out-of-scope for HTML5 spec and it is up to particular translation software or human translator to decide.

I can imagine cases where Ruby can be safely translated and completely removed from document. But I can as well imagine cases where it shouldn't be translated at all and such case can be easily indicated by putting translatability flag on the Ruby element.

Jirka
Comment 22 andrewc 2011-07-27 23:57:41 UTC
(In reply to comment #20)
> What is happening in Ruby markup cases? Let's say a kanji and hiragana? What
> the translation machine does? Convert the hiragana in romaji?

That would be transliteration, not translation.
Comment 23 andrewc 2011-07-28 00:07:01 UTC
I tend to work a lot with government documents, and translated government documents. The following question assumes human rather than machine translation.

Would  marking content as non translatable change the elements' contenteditable attribute to false?
Comment 24 Yves 2011-07-28 04:46:01 UTC
(In reply to comment #23)
> Would  marking content as non translatable change the elements' contenteditable
> attribute to false?

I suppose some tool--at some point of a localization process--may find useful to turn off contenteditable if translate is false.

But I don't think a translate=false implies contenteditable=false all the time.

For example, the author of the source text may still need to edit the content of an element that is not meant to be translated (e.g. some legal blurb, or some country-specific information, or some command line example)
Comment 25 Helena S Chapman 2011-07-28 16:16:06 UTC
I second the opinion of having a separate "translate" attribute as supposed to overloading the language tag for the purpose of translation. There are already plenty good arguments from others about why this is important, I need not to repeat here.

It is also not appropriate to suggest only MS and Google has this concern. Any multinational company with a need to reach out to a more global audience has a requirement to rapidly deliver information to their clients at any corner of the world. And, given enough experiences with machine translation, quality will not matter for certain types of content. Without some support from the underlying web standards, there is no easy way to communicate with 3rd party system for translation automation in an end-to-end process.

Yours, 

Just someone who works at IBM
Comment 26 Didier Briel 2011-07-28 16:25:06 UTC
(In reply to comment #24)
> (In reply to comment #23)
> > Would  marking content as non translatable change the elements' contenteditable
> > attribute to false?
> 
> I suppose some tool--at some point of a localization process--may find useful
> to turn off contenteditable if translate is false.

We could definitely use a non-translatable attribute in OmegaT (a
Computer-Aided Translation application), possibly optionally.

I approve Yves' summary, and I'm in favor of such an attribute.

Didier
Comment 27 Ultan O'Broin (@localization) 2011-07-28 20:02:10 UTC
I think it's a fine idea. A single standard would resolve some of the problems we have seen in knowing how to markup text:

http://www.multilingualblog.com/?p=14

I have seen many cases unfortunately where such markup has been abused as a way to reduce translation cost (at the expense of the reader) and of course the translation tool is at the mercy of the person applying the attribute. Therefore there needs to be some consideration for overrrding  of a no-translate flag.
Comment 28 Christian Buratto 2011-07-28 20:14:12 UTC
Let me add more complexity to this...

A "translate" attribute should be flexible enough to indicate a status change
in the translation requirement. Thinking specifically on the localization
workflow, an author may block the translation of content temporarily, while the
suitable translation for the element is not ready -- but then this translation
might become available, and the tag could allow the translator to ping the
'author' for it. For example:

"translate=URL"

The endpoint on URL could provide an empty string ("no translation available")
or one or more translations available for a specific language.

I know this demands an implementation of a query mechanism on the author's
server, but the gains in quality and productivity might worth the discussion.
Imagine that a web page that contains some terms in English would always be
parsed before serving, and if a translation is found, it will replace the term
before serving. Old pages might be always up-to-date, without long processing
(Your company was bought by another one? All references to its name will be
up-to-date immediatelly.)

Similarly, and more important, this could greatly improve the translation
process. A translator would be allowed to have immediate access to that term's
current state and apply it right away. We can even have the translation tool
suggest the translation -- similarly to what termbases do today, but in a
highly targeted way. If we think about contextual translation, where identical
source expressions may be translated differently according to context ("File"
as an application menu, and "File" as a verb, for example), a unique URL for
each could be of extreme value.

Just to give you more info, usually documentation and UI are translated in
parallel, and frequently you don't have access to the final UI terms that are
mentioned on the product instructions. You could finish the doc translation
just to have to change it later when the UI terms are translated or updated.
With the translate=URL approach, the doc could self-update and reflect with
precision what is shown on the localized application.

Of course, the full power of this approach would only be achieved when
authoring tools allow the writers to easily tag terms, and when this tagging is
integrated and available to be queried, but I am having a glimpse of what is
possible. Since you are discussing the creation of a new attribute, maybe it is
the right moment to aim high.

I bet you can see what this could do to TMs...

Regards.

(In reply to comment #26)
> (In reply to comment #24)
> > (In reply to comment #23)
> > > Would  marking content as non translatable change the elements' contenteditable
> > > attribute to false?
> > 
> > I suppose some tool--at some point of a localization process--may find useful
> > to turn off contenteditable if translate is false.
> 
> We could definitely use a non-translatable attribute in OmegaT (a
> Computer-Aided Translation application), possibly optionally.
> 
> I approve Yves' summary, and I'm in favor of such an attribute.
> 
> Didier
Comment 29 Karl Dubost 2011-07-28 20:25:31 UTC
Question which might be borderline but do not hesitate to tell me if there is
something wrong with the reasoning.

<p lang="en">
<span title="Organization for the human rights" 
      translate="no">Amnesty International</span>, 
in a report, said…
</p>


Automatic translation to French for example

<p lang="fr">
<span title=" […] " 
      translate="no">Amnesty International</span> 
a dit dans un rapport…
</p>

is title translated or not into "Organisation pour les droits de l'homme"?

and this might be even trickier with nested blocks where the parents has
translate="no". Think about images with an alt="" text in a block of text with
a translate="no". 

How is it done with the current choices and/or hacks mentioned earlier?

PS: not pushing back, just trying to see the consequences for each decision.
Comment 30 Felix Sasaki 2011-07-28 20:26:40 UTC
(In reply to comment #16)
> Hello Felix :)
> 
> (In reply to comment #13)
> > Hi Karl,
> > > > I disagree. Several content providers are using workaround because a standard
> > > > solution does not exist.
> > > 
> > > Is there a document with what the content providers are doing? 
> > > What code (workaround) are they using?
> > 
> > Richard described the workarounds here
> > http://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c8
> 
> It was said "several content providers". In Richard answers, I see two:
> 
> * Microsoft
> * Google
> 
> are there others and if yes how, what do they do, etc? 
> (just trying to have a better overview of the extent of things)
> 

The above are providers of online machine translation systems. I am currently
preparing a project with two companies (one localization, one  mt provider) who
want the implement the same functionality as a standard means.

So these are then three online MT system, and one is cooperating with a
localization provider. But a standardized mechanism can be relevant for many
other "participants" on the web, and also in the "deep Web": CMS providers who
produce the necessary information in their templates, which is then processed in MT systems or localization chains; and in larger companies  with often quite complex content delivery chains (see what Helena mentioned), which result on surface web html pages.

It is not important to use the ITS translate mechanism. But it is important to have a mechanism that can be separated from others, and that can be easily shared among all these groups.

And I would call these also content providers.
Comment 31 Felix Sasaki 2011-07-28 20:42:08 UTC
(In reply to comment #29)
> Question which might be borderline but do not hesitate to tell me if there is
> something wrong with the reasoning.
> 
> <p lang="en">
> <span title="Organization for the human rights" 
>       translate="no">Amnesty International</span>, 
> in a report, said…
> </p>
> 
> 
> Automatic translation to French for example
> 
> <p lang="fr">
> <span title=" […] " 
>       translate="no">Amnesty International</span> 
> a dit dans un rapport…
> </p>
> 
> is title translated or not into "Organisation pour les droits de l'homme"?
> 
> and this might be even trickier with nested blocks where the parents has
> translate="no". Think about images with an alt="" text in a block of text with
> a translate="no". 
> 
> How is it done with the current choices and/or hacks mentioned earlier?

The choices and hacks focus only on element content. In this respect they are identical to the "spelling and grammar checking" mechanism, that is: not addressing attribute content - taken from http://dev.w3.org/html5/spec/Overview.html#spelling-and-grammar-checking :
"To determine if a word, sentence, or other piece of text in an applicable *element* (as defined above) is to have spelling- and/or grammar-checking enabled, the UA must use the following algorithm: ..."
There are various mechanism one could envisage to cover attributes, e.g. defaults, or "global rules" using e.g. CSS selectors to depict attributes. But I think this would be the next step, once we agree on a basic mechanism for element content. This would already be of value, like in the case of spell checking.

> 
> PS: not pushing back, just trying to see the consequences for each decision.
Comment 32 Jörg Schütz 2011-07-29 09:50:36 UTC
This is a very interesting thread. The request for an additional markup element or just a new attribute/value pair is an important issue for the global multilingual web. This issue is related to the possible consumption process of information encoded in HTML, and here we have to distinguish, for example, the following three use cases: (1) (human) user wants a translation into her language, (2) NLP application (searching, trawling, analyzing) wants to provide multilingual results, and (3) integration into a localization and translation process chain. In a first approximation, the introduction of a new HTML5 language element sounds feasible and appropriate. However, this might end up with additional requests for the markup of terminology, sentence boundaries, semantic constructs, etc. which are all legitimate demands with convincing use cases, i.e. to effectively guide (machine) translation applications and to enhance the output quality of these applications. We already had the elements "acronym" and "abbrev" in HTML 4, and now in HTML5 only "abbr" has survived. So for me it is not a good idea to just introduce new syntactic sugar.

Let's analyse a bit more the possible use cases regarding what HTML5 already has on board as a potential solution, and also let's bear in mind that HTML5 is about web technologies and accessibility (see "wai-aria") which to some extend is included in the above translation scenario requirements.

One solution is with styling, for example: <p class="translatable" lang="en-US">...<b class="term">semantic styling</b>...</p>. This solution was already proposed in this thread, and it seems not optimal for our intended application scenario because it may have side effects with traditional css styling apporaches.

Another possibility is with microdata, for example: constructs with "itemscope", "itemprop", "itemref" and "itemid" including itemtype attributes, and the use of existing (or new) microdata vocabularies. This approach is pretty much inline with the discussions of a semantic HTML5, and is backed by the use cases above.

In summary, it turns out that we need to establish some "best practices" for specifying the translatability of content, and that web translation applications should be guided through them. Therefore, I suggest that we also discuss a possible microdata approach. I am looking forward to your opinions.
Comment 33 Felix Sasaki 2011-07-29 10:36:26 UTC
Hello Jörn,

(In reply to comment #32)
> This is a very interesting thread. The request for an additional markup element
> or just a new attribute/value pair is an important issue for the global
> multilingual web. This issue is related to the possible consumption process of
> information encoded in HTML, and here we have to distinguish, for example, the
> following three use cases: (1) (human) user wants a translation into her
> language, (2) NLP application (searching, trawling, analyzing) wants to provide
> multilingual results, and (3) integration into a localization and translation
> process chain. In a first approximation, the introduction of a new HTML5
> language element sounds feasible and appropriate. However, this might end up
> with additional requests for the markup of terminology, sentence boundaries,
> semantic constructs, etc. which are all legitimate demands with convincing use
> cases, i.e. to effectively guide (machine) translation applications and to
> enhance the output quality of these applications. We already had the elements
> "acronym" and "abbrev" in HTML 4, and now in HTML5 only "abbr" has survived. So
> for me it is not a good idea to just introduce new syntactic sugar.

HTML5 added the "syntactic sugar" of a spell check attribute (see this thread), since there is a clear use case and implementations. I think you can say the same about a translate mechanism, see the list of implementations and groups interested in this http://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c30 , and the people in this thread. So "translate" is much more central for many users compared to what you mentioned above, e.g. "semantic constructs" or "sentence boundaries". 

> 
> Let's analyse a bit more the possible use cases regarding what HTML5 already
> has on board as a potential solution, and also let's bear in mind that HTML5 is
> about web technologies and accessibility (see "wai-aria") which to some extend
> is included in the above translation scenario requirements.
> 
> One solution is with styling, for example: <p class="translatable"
> lang="en-US">...<b class="term">semantic styling</b>...</p>. This solution was
> already proposed in this thread, and it seems not optimal for our intended
> application scenario because it may have side effects with traditional css
> styling apporaches.
> 
> Another possibility is with microdata, for example: constructs with
> "itemscope", "itemprop", "itemref" and "itemid" including itemtype attributes,
> and the use of existing (or new) microdata vocabularies. This approach is
> pretty much inline with the discussions of a semantic HTML5, and is backed by
> the use cases above.
> 
> In summary, it turns out that we need to establish some "best practices" for
> specifying the translatability of content, and that web translation
> applications should be guided through them. Therefore, I suggest that we also
> discuss a possible microdata approach. I am looking forward to your opinions.

As Yves said in this thread, it is OK to think about various mechanisms. The key is to have agreement on one solution, and to have it available in the HTML dom, like in the "spell check" use case. This would not be the case for microdata or also rdfa, since, as you know, both are not part of the HTML5 core.
Comment 34 Najib Tounsi 2011-07-29 11:23:10 UTC
+1 to the "translate" attribute in HTML5.

Najib
Comment 35 Jörg Schütz 2011-07-29 12:28:29 UTC
(In reply to comment #33)
Hi Felix,

Thanks for your comments!

Frankly speaken, I also don't like very much the spelling and grammar checking feature in HTML5. This feature concerns the content creation process, and therefore doesn't have a particular relation to the actual content representation in HTML (which could be yet another seperate process).

Translatability, on the other hand, has such a relation to the content representation because it is a means to guide or to trigger possible subsequent processes such as for instance a (machine) translation application because these processes would directly operate on the HTML representation (maybe also through a certain user agent interaction - my first use case). In this context, my examples of other markup candidates point in exactly this processability direction.

Since the specification of HTML5 is still floating, I wouldn't distinguish between core and non-core features. This means depending on a particular use case scenario, there should be room for the discussion of different solution alternatives such as the proposed microdata apporach, or even fully fledged semantic web technologies. It would be also a challenging opportunity to demonstrate their applicability in real life scenarios. Last but not least, the proposed microdata approach is available in HTML5 dom.

Best -- Jörg
Comment 36 Sergei Vasilyev 2011-07-30 08:03:05 UTC
Now I am thinking that translation is really a process on a text, one of possible many. Maybe some "protected" attribute would be more efficient to signify that the text it marks is not allowed to be modified by whatever process (including translation). Then translation tools can rely on this attribute to decide whether to translate that span:

> <p lang="fr">
> <span title="[…]" protected="yes">Amnesty International</span> a dit dans un rapport…
> </p>

Or maybe it's all about having "translate" or not, then I prefer to have it, but then "autocorrect" yes/no, "capitalize" yes/no, etc. should be added in HTML6 :).

Sergei
Comment 37 Martin Dürst 2011-08-01 01:04:48 UTC
Just as an additional argument for having a translation-related attribute in HTML itself, I would like to point out that besides third-party translation services, there is also software that directly hooks into the browser and works on the client side. Such software was very popular a few years ago here in Japan. Because of on-line translation services, it's now less popular, but still in use.
Comment 38 Chris Wendt 2011-08-02 22:46:47 UTC
I want to add my support for Richard’s post, and underscore the necessity of a single, simple mechanism to control translation of an element, with an override for inheritance. It is relevant to include the standardization into HTML5 itself, so that authoring tools can reliably support a single mechanism. Automatic translators as well as human translators depend on it. 

An attribute on the element looks like a good solution to me. In addition, it makes sense to allow a style to carry the translatability attribute, so that author can designate a style to be not translated, for instance to replicate the effect of <code>. You could consider the words to represent the element content as being just a “style”. 

Two ugly wrinkles to be considered:

1) There are certain elements that are assumed to be untranslatable: code, address, samp, script, var, kbd, textarea. All others are by default translatable. Best if we had clarity in the standard that translate is ON by default, except for [this list] of elements, for which it is OFF by default.

2) The element attribute, or any style magic, do not solve the translatability of attributes. The ALT attribute is normally translatable, most other attributes aren’t. 

To solve the attributes issues, we can consider a predefined list of attributes that are translatable, mainly to grandfather in existing HTML4 assumptions, and postulate that the rest aren’t. Define a new attribute like attrib_translate=”list of attributes to translate”, with inheritance, so we can control translation of attributes on all child elements. Plus the reverse setting for overriding inheritance.

Chris Wendt
Microsoft Translator
Comment 39 Kevin Lenzo 2011-08-03 15:08:08 UTC
Use attribute name 'localize' rather than 'translate'?

'translate', the transitive verb, is asymmetric with respect to the source and target documents -- it only applies properly to the source version. What should the value of this attribute be in the translated versions?

Perhaps this should be more accurately 'localize', which would make sense both in the source and target documents. This would also allow it to remain 'yes' once the document is converted, without implying that the target document elements should be translated again. It could be argued that 'localize' is imperative and means it should be localized again, but at least it is neutral with respect to direction -- that is, localize this element in the document, irrespective of the document's particular language.
Comment 40 Arle Lommel 2011-08-03 15:24:41 UTC
Kevin’s comment raises an interesting issue. Having a “translate” or “localize” attribute is useful, but without some indication of what the (original) source was, it may end up that derivative representations are treated as source for subsequent processes. This situation could be rather problematic if, for example, an MT product is republished on the web and then another process comes along and treats it as a “clean” source, thus compounding any translation errors.

Although this is now tangential to the original proposal, are there any existing mechanisms in HTML5 that could be used to indicate document relationships in such a way that an agent encountering an HTML5 document could be alerted that that document is derivative from another document? This is not an area I've considered before, so I don’t really know what is available, but it seems that this sort of information would fit in with the semantic emphasis of HTML5. Perhaps someone else will know if there would be a good way of indicating the source/target nature of an HTML5 document.

My quick review of the specification does not show anything immediately promising, and a review of the Meta Extensions wiki (http://wiki.whatwg.org/wiki/MetaExtensions) also shows nothing in this area.

If there is currently no way to indicate this status, I would suggest that some sort of standard meta tag be considered for this information.
Comment 41 Kevin Lenzo 2011-08-03 16:38:15 UTC
I'm honestly not certain if 'localize' is better than 'translate', because of the existence of its:translate, but 'localize' has a couple of things going for it: 

 - in cases of things like URIs, the things really are localized rather than translated; 
 - this tag may be applied to other things like images that are not necessarily textual; 
 - and it has the direction-neutral property (input and output docs can logically have the same values).

I suppose another possibility for indicating the source vs. target distinction would be to extend the value structure of the attribute, from 'yes' and 'no' to include one that indicates that this is the target ('done'? 'target'?) but this seems awkward.

My general preference would be that the HTML5 document could simply be localized in place, without having to insert additional state attributes or value structures. That is, the value of localize= is stable between the source and target document, and is semantically appropriate.
Comment 42 Arle Lommel 2011-08-03 17:15:50 UTC
> My general preference would be that the HTML5 document could simply be
> localized in place, without having to insert additional state attributes or
> value structures. That is, the value of localize= is stable between the source
> and target document, and is semantically appropriate.

This is certainly the simplest way to handle it and would probably account for 90%+ of the use case needs. Additional state attributes would probably not yield much in the way of advantage since the purpose is really to provide a pragma-type instruction to downstream processes about what to do. In this case the only thing most of them would care about is “do I translate/localize this or not”? In most cases this information would be added by authoring tools with some sort of integrated terminology feature that is flagging non-translatables. The simpler we can keep it for them, the better.

The only complication I see right off is knowing whether or not something was already translated, but in most cases MT results will be automatically generated, not published, so this issue wouldn’t normally be a problem.

So my vote would be to keep this as simple as possible and treat my idea of indicating whether a page is source or target as a separate, less important, issue. (My guess is that this attribute really would be best handled as a <meta> tag. As far as I can see, the appropriate place to propose my idea is on the WhatWG wiki. Unfortunately (for me), without someone already implementing the idea and endorsing it, the likelihood of it being accepted there is slim.)
Comment 43 Chris Wendt 2011-08-03 17:42:32 UTC
(In reply to comment #42)
Agree with keeping it simple. Do not overload the notranslate attribute with indicating source and target, that can happen elsewhere, and is optional.

(In reply to comment #39)
>'translate', the transitive verb, is asymmetric with respect to the source and
>target documents -- it only applies properly to the source version.
No. Applies everywhere. notranslate maintains the original language regardless of translation. translate allows this element to be translated. The status of being an "original" or not is irrelevant.

>What should the value of this attribute be in the translated versions?
The same.
Comment 44 Michael[tm] Smith 2011-08-04 05:36:00 UTC
mass-move component to LC1
Comment 45 Aaron Madlon-Kay 2011-08-15 12:25:00 UTC
I was going to point out that ITS already has a translate="yes/no" attribute, but this was already brought up in comment #2(!). So my question is: What is wrong with simply using the ITS attribute? Why in the world does HTML5 need its own attribute? I don't think anyone's addressed this point yet.
Comment 46 Jirka Kosek 2011-08-15 12:28:54 UTC
HTML serialization of HTML5 doesn't support namespaces hence you can't use its:translate. Need for translatability is enough widespread to justify having "native" attribute.
Comment 47 Arle Lommel 2011-08-15 12:37:08 UTC
(In reply to comment #46)
> HTML serialization of HTML5 doesn't support namespaces hence you can't use
> its:translate. Need for translatability is enough widespread to justify having
> "native" attribute.

Just seconding what Jirka has to say. ITS is intended for XML documents, but HTML5 (with the exception of polyglot markup) is not XML and so cannot use ITS. Even polyglot markup runs into trouble when using external namespaces since the resulting document would then be using XML features not required to be supported in HTML5. So while using ITS appears like a good idea at first, it won't work for HTML5.

But Jirka's second point—“Need for translatability is enough widespread…”—is also crucial. If translation of web content is enough of a core function on the web (and I would argue that it is), then core web technologies ought to support it. I would agree with Jirka’s assessment of the importance of this feature, but that is ultimately the question to be determined: is translatability important enough to be defined as a native attribute?
Comment 48 Aaron Madlon-Kay 2011-08-15 12:40:21 UTC
Thanks Jirka and Arle for the clarification. I had assumed that ITS would be usable within HTML5, but I see that that is not true (at least not in all use cases). In that case I have to agree with the proposal for a native translate attribute!
Comment 49 Yves Savourel 2011-10-14 17:21:19 UTC
We are now working on tools to translate HTML5 files.
Are there any progress made on this issue?
Comment 50 Thorsten Trippel 2011-11-16 15:35:14 UTC
DIN NA 105-00-06 issues the following comment on this issue:


Within the German standardization organization DIN the working group NA 105-00-06 deals with language resources as being used in natural language processing (NLP) and the standardization of procedures and data formats for these purposes. We strongly support the inclusion of the translate-attribute natively embedded into HTML 5, as this attribute will provide numerous benefits for NLP. We expect an increased quality of various NLP applications, including named entity recognition, syntactic processing, and machine translation if the correct translation attribute is being used in web pages. We expect the correct use of the translation attribute only if it is a standard attribute in HTML 5 and not a proprietary or customized extension. For this reason we advocate the inclusion at this point, as the development of HTML 5 provides the historic opportunity to provide for localization and the NLP requirements by the addition of this attribute.
Comment 51 Laurent Romary 2011-11-28 14:33:32 UTC
As chairperson of ISO committee TC 37/SC 4 (language resource management) the discussion on a translate attribute for HTML 5 was brought to my attention. With standards such as
the ISO 24614  family (Word segmentation of written texts),  ISO 24616
(Multilingual Information framework) or ISO 24617 (semantic annotation
framework), we providing a core portfolio of standardized representations for language resources. Webpages are often an integral part of language resources, hence we closely follow the standardization
initiatives by the W3C and liaise with them in areas of our expertise,
which includes all fields of language technology such as
internationalization efforts for resources, promoting the creation,
exchange and archiving of resources for multiple languages.  We have a
vast number of uses of a translate attribute, both on the side of
publishing material using the translate attribute but also for using the
translate attribute within semantically rich applications.

It is for instance extremely common that language resources and
publications about them come in multiple languages (e.g. examples are in a
different language). Many applications also involve parallel texts. 
Typically a search engine for interlinear glossed texts could benefit from the translate tag. The <a
href="http://odin.linguistlist.org/">Online Database of Interlinear
Text</a> is an example for such an application. In such a situation
the translation of the source text would render the page useless, the
translation of the surrounding text might be useful.

Similarly, lexical resources have language specific parts that cannot
be translated without rendering the resource useless.  Though these
resources are not necessarily being natively stored as websites, they
are often transformed and published as such. One example of this is
Wordnet, a frequently used resource in the language technologies. See
for example the original (English) Wordnet at
http://wordnet.princeton.edu/ which also has a web interface. This also
applies to other lexical resources using ISO-24613:2008 (Lexical Markup
Framework): These resources are multilingual by design, but parts could
be translated sensibly (for example definitions), others should never be
translated. In fact, translation has been applied to some resources to
support manual creation processes for other languages.

At the same time large websites with high quality parallel content using
a translate attribute are being looked for for the creation of parallel
resources. However, to use the translate-attribute on webpages for
creating services, providing parallel content, etc. there is a problem:
without a sufficiently large number of available websites using these
attributes, natural language processing applications cannot use them to
analyze and work with the content.

We thus support the introduction of a translate attribute as a standard
attribute in HTML 5, since our resources would heavily make use of this mechanism
when we publish them, but also because the NLP services that we implement 
could benefit from its existence.
Comment 52 Ian 'Hixie' Hickson 2011-12-03 22:45:01 UTC
So the use case for inline fragments marked as not to be translated is things like product names or quotations from other languages or people's names, etc. It makes sense to have a dedicated way to mark that up because as automated translation and assisted translation becomes more widely used, and as sites become more widely translated, human and computerised translators need assistance in determining what the original author didn't want translated.

But what's the use case for the full-page case?
Comment 53 Yves Savourel 2011-12-04 00:27:56 UTC
(In reply to comment #52)
> So the use case for inline fragments 
> ... 
> But what's the use case for the full-page case?

An example of use case at the paragraph or set of paragraph level is, for instance, a section that is legal content and should remain it its original language. Another one is a section of the page that is a text excerpt or a citation that needs to remain as it.

An example at the page level would be to allow the author to specify that the whole page should not be translated expect for some paragraphs.
Comment 54 Jirka Kosek 2011-12-04 09:14:01 UTC
(In reply to comment #52)

> So the use case for inline fragments marked as not to be translated is things
> like product names or quotations from other languages or people's names, etc.
> It makes sense to have a dedicated way to mark that up because as automated
> translation and assisted translation becomes more widely used, and as sites
> become more widely translated, human and computerised translators need
> assistance in determining what the original author didn't want translated.

Yves already provided examples in #53.

Just to speak about technical implementation -- non-translateable content can be both inline (just phrase, few words) and block (complete paragraph, few consecutive paragraphs). Dedicated attribute seems as the most natural way how to support this.
Comment 55 Ian 'Hixie' Hickson 2011-12-07 20:06:29 UTC
(In reply to comment #53)
> 
> An example at the page level would be to allow the author to specify that the
> whole page should not be translated expect for some paragraphs.

Can you give an example of this? (Just trying to understand the use case here to be properly informed when addressing this.)
Comment 56 Yves Savourel 2011-12-07 23:02:32 UTC
(In reply to comment #55)
> Can you give an example of this?

Let's try: If a help system in HTML is to not be translated (e.g. because of cost), but the UI of the application it refers to is translated: you would want to leave most of each page in the source language, but translate the few strings that are direct references to the UI menus/buttons/etc.
Comment 57 Martin Dürst 2011-12-08 01:14:58 UTC
(In reply to comment #55)
> (In reply to comment #53)
> > 
> > An example at the page level would be to allow the author to specify that the
> > whole page should not be translated expect for some paragraphs.
> 
> Can you give an example of this? (Just trying to understand the use case here
> to be properly informed when addressing this.)

I agree it may be more difficult to find cases where "don't translate" applies to a full page. But Yves (comment #56) gave a good example.

Even if we didn't find such a good example currently, would that mean that we should not allow the attribute (or whatever) on the <html> element? In my view, that would be a bad idea. Even if it gets used very rarely, it's much easier for users, for implementers, and for spec writers and readers,... if the attribute is just allowed everywhere. Then if an use case comes up later, nobody has to wonder "why is this allowed everywhere but not here".

Of course if you had something different in mind when asking for such specific use cases, then the above reasoning might not apply, but then it would be very helpful to know what you had in mind.
Comment 58 Yves Savourel 2011-12-08 03:38:16 UTC
> Can you give an example of this?

Another example that someone suggested:

Sometimes one start translating a set of pages before they are completely done. Many parts can be 'draft' and not to be translated. Early on it would be much easier to declare the whole page as not translatable and mark up the sections to translate. Then, at some point after several iterations, the page default can be switched to to-translate and the few draft sections left can be marked up as not translatable.

Like Martin noted the idea is that having the attribute available at all levels allows for greater flexibility.
Comment 59 Felix Sasaki 2011-12-08 05:43:30 UTC
(In reply to comment #58)
> > Can you give an example of this?
> 
> Another example that someone suggested:
> 
> Sometimes one start translating a set of pages before they are completely done.
> Many parts can be 'draft' and not to be translated. Early on it would be much
> easier to declare the whole page as not translatable and mark up the sections
> to translate. Then, at some point after several iterations, the page default
> can be switched to to-translate and the few draft sections left can be marked
> up as not translatable.
> 
> Like Martin noted the idea is that having the attribute available at all levels
> allows for greater flexibility.

As for the inheritance of the attribute and for overriding of existing values, it makes sense to align that with the "lang" attribute.
Comment 60 Thorsten Trippel 2011-12-08 08:45:48 UTC
(In reply to comment #55)
> Can you give an example of this? (Just trying to understand the use case here
> to be properly informed when addressing this.)

Another example for whole pages not to be translated are translations. If you have a multilingual site, you want the translation from the original source, not from other translations. In machine translation we all had fun with circular translations, did we not? Something like English-French-Spanish-English to see if we could find some traces of the original meaning. A machine translated page would definitively set the translate attribute to "no" for the whole page.
Comment 61 Thorsten Trippel 2011-12-08 10:29:06 UTC
(In reply to comment #60)

To clarify my previous remark: If a website is translated from say English into Spanish, the resulting page should not be translated again. That is: if a machine translation application comes to the Spanish page, the translate="no" will instruct the processor not to translate this page. The application will then have to decide on what to do next, for example ask the user to ignore this instruction or just show the website as it is. In many of these cases the user will see a language selection on top of these pages (such as flags, etc.), indicating that a version of the page in the desired language is already available. Hence the instruction translate="no" complements the xml:lang="es" in the said example. 

The English original could however have xml:lang="en" and an explicit or inherited translate="yes", instructing an MT or other translation process to actually translate that page, if the value of the lang attribute is not the desired language.
Comment 62 Jirka Kosek 2011-12-08 10:33:47 UTC
(In reply to comment #61)
> (In reply to comment #60)
> 
> To clarify my previous remark: If a website is translated from say English into
> Spanish, the resulting page should not be translated again. That is: if a
> machine translation application comes to the Spanish page, the translate="no"
> will instruct the processor not to translate this page. The application will
> then have to decide on what to do next, for example ask the user to ignore this
> instruction or just show the website as it is. In many of these cases the user
> will see a language selection on top of these pages (such as flags, etc.),
> indicating that a version of the page in the desired language is already
> available. Hence the instruction translate="no" complements the xml:lang="es"
> in the said example. 

I think that this example is not good use of translate=yes/no. In this case there is no reason to forbid translation of Spanish page. But there should be more complex metadata attached saying that this is not primary version of page and that English original is available somewhere else. But this would be different feature and hence separate bug should be opened.
Comment 63 Martin Dürst 2011-12-08 11:11:49 UTC
(In reply to comment #62)

> I think that this example is not good use of translate=yes/no. In this case
> there is no reason to forbid translation of Spanish page. But there should be
> more complex metadata attached saying that this is not primary version of page
> and that English original is available somewhere else. But this would be
> different feature and hence separate bug should be opened.

There is no need to submit a separate bug. If you want to associate metadata to Web pages, there are a lot of different ways already to do that. What you might want to do is to look into link relations, and define a new one (pointing to the original for translations) if nothing feasible is available. But that's outside the HTML spec.
Comment 64 Ian 'Hixie' Hickson 2012-01-13 22:54:31 UTC
Ok. Since this is becoming a widely-felt problem, it seems like a more formal solution than class="notranslate" is due. Based on the proposals above, I suggest that we add a translate="" global attribute to HTML, with two valid values, "yes" and "no", with the obvious semantics. The value would inherit to descendants, much like "lang". Invalid values (translate="foo", etc) would be ignored. A missing value (translate="") would be treated like translate="yes".

Any objections?
Comment 65 Yves Savourel 2012-01-16 02:32:19 UTC
> Based on the proposals above, I suggest that we add a translate=""
> global attribute to HTML, with two valid values, "yes" and "no", 
> with the obvious semantics. The value would inherit to
> descendants, much like "lang". Invalid values (translate="foo", etc)
> would be ignored. A missing value (translate="") would be
> treated like translate="yes". 
> Any objections?

How would the semantics apply to the attributes? Some are translatable but most are not. Maybe only the 'translatable' ones like 'alt' should be affected by the new attribute?
Comment 66 Ian 'Hixie' Hickson 2012-01-17 21:24:06 UTC
The translate="" attribute would apply to the element, attributes, contents, and all. What exactly needs translating is a policy issue; presumably <style> elements' contents wouldn't be translated, for example, and similarly lang="" attributes wouldn't be translated.
Comment 67 Lea Verou 2012-01-18 00:50:10 UTC
While we’re at it, it would be great to have a way to provide pre-translated texts for translatable elements or attributes.

Then, not only machine translation could take advantage of them to improve results, but the browser could expose a UI similar to the language pickers many multi-lingual websites have. This wouldn't just make website i18n easier for authors, but it would also make multi-lingual websites easier for web users.

Not sure if I should make a new bug for this. I'd love to hear Hixie's opinion first.
Comment 68 Martin Dürst 2012-01-18 04:23:50 UTC
(In reply to comment #67)
> While we’re at it, it would be great to have a way to provide pre-translated
> texts for translatable elements or attributes.
> 
> Then, not only machine translation could take advantage of them to improve
> results, but the browser could expose a UI similar to the language pickers many
> multi-lingual websites have. This wouldn't just make website i18n easier for
> authors, but it would also make multi-lingual websites easier for web users.
> 
> Not sure if I should make a new bug for this. I'd love to hear Hixie's opinion
> first.

Please make this a separate 'bug'. It is way more far-reaching than the translatable attribute.
Comment 69 Felix Sasaki 2012-01-18 07:36:34 UTC
(In reply to comment #66)
> The translate="" attribute would apply to the element, attributes, contents,
> and all. What exactly needs translating is a policy issue; 

The question when is how a policy (even if not defined by the HTML5 spec) would relate to the translate="" attribute. If I set via a separate policy all attributes to be non translatable, would it be possible to override that with a translate="yes" attribute (or the other way round). Some guidance might be useful, e.g. saying "the translate attribute takes precedence over policies related to translation of an HTML document".
Comment 70 Martin Dürst 2012-01-18 09:06:43 UTC
(In reply to comment #69)
> (In reply to comment #66)
> > The translate="" attribute would apply to the element, attributes, contents,
> > and all. What exactly needs translating is a policy issue; 
> 
> The question when is how a policy (even if not defined by the HTML5 spec) would
> relate to the translate="" attribute. If I set via a separate policy all
> attributes to be non translatable, would it be possible to override that with a
> translate="yes" attribute (or the other way round). Some guidance might be
> useful, e.g. saying "the translate attribute takes precedence over policies
> related to translation of an HTML document".

I think there is no need to complicate this more than necessary. We already know that we can't set the natural language (with the lang/xml:lang attribute) independently on attributes such as title and alt, and that this is sometimes a problem. We also know that stuff like CSS styles (except potentially for generated content), and therefore attributes such as lang and style (as mentioned by Ian) don't have or need natural language information.

From that, we know that it's a bad idea to have attributes with human-readable text. But we have some of these, in particular the already mentioned title and alt, and we will have to live with these. Whether (and to what extent) they will have to be translated will have to be decided in a more manual fashion, probably in the same whay this is decided currently for all of the content of a Web page. The translate attribute addresses a very big need, and in an average document will cover between 80 and close to 100% of the content. For attributes, we are not worse off than before having the translatability attribute: Some stuff (such as excluding the lang attribute as no-need-to-translate) can already be done automatically, and some other text will have to rely on human judgement of a translator (or the whim of a translation engine).
Comment 71 Richard Ishida 2012-01-18 10:32:34 UTC
(In reply to comment #67)
> Not sure if I should make a new bug for this. I'd love to hear Hixie's opinion
> first.

As Martin says, please create a separate bug for this, since it is a different topic.
Comment 72 Richard Ishida 2012-01-18 10:38:56 UTC
(In reply to comment #64)

To dot the i's and cross the t's, maybe we need to say that by default, in the absence of a translate attribute on the current element or parent, the content should be translatable, ie. you don't have to add this attribute to the html tag of every document to make MT tools translate it.
Comment 73 Ian 'Hixie' Hickson 2012-01-26 19:49:21 UTC
Felix: what is the use case?

Generally speaking, it's possible to have elements and attributes have different languages; you just use multiple elements. e.g.:

   <p lang="en">Hello
    <span lang=fr title="Ceci est le mot anglais pour 'chat'.">
     <span lang=en>Cat</span>
    </span>.
   </p>

Without a compelling reason to have to put everything on one element, I think we're far better off without the added complexity.
Comment 74 Felix Sasaki 2012-01-26 20:16:33 UTC
(In reply to comment #73)
> Felix: what is the use case?

The use case is to cover existing web content, which doesn't follow the good practice you and Martin have mentioned with regards to attributes.

But I won't push for that here - we have an upcoming working group that will also define best practices how to deal with that kind of existing HTML content, see
http://www.w3.org/2011/12/mlw-lt-charter.html
and it's OK that this is not covered in the HTML5 spec itself.

So for the record, I agree with your proposal at
https://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c64

Felix

> 
> Generally speaking, it's possible to have elements and attributes have
> different languages; you just use multiple elements. e.g.:
> 
>    <p lang="en">Hello
>     <span lang=fr title="Ceci est le mot anglais pour 'chat'.">
>      <span lang=en>Cat</span>
>     </span>.
>    </p>
> 
> Without a compelling reason to have to put everything on one element, I think
> we're far better off without the added complexity.
Comment 75 Ian 'Hixie' Hickson 2012-01-26 22:15:37 UTC
Well, if authors don't do what they can do now, there's nothing we can add to the language to help them!
Comment 76 Felix Sasaki 2012-01-27 05:44:06 UTC
(In reply to comment #75)
> Well, if authors don't do what they can do now, there's nothing we can add to
> the language to help them!
When I said "dealing with existing web content" I wasn't talking to the language and a mechanism for authors, but about consumers of web pages for translation process purposes - sorry for not being clear. I think Yves' question 
https://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c65
was from that consumer perspective. Again, I agree that there is no need to complicate things for browser implementors and for Web authors - having just the "translate" attribute with the semantics you describe is already a big win.
Comment 77 Felix Sasaki 2012-01-27 06:23:06 UTC
(In reply to comment #76)
> (In reply to comment #75)
> > Well, if authors don't do what they can do now, there's nothing we can add to
> > the language to help them!
> When I said "dealing with existing web content" I wasn't talking to the
> language and a mechanism for authors, but about consumers of web pages for
> translation process purposes - sorry for not being clear. I think Yves'
> question 
> https://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c65
> was from that consumer perspective. Again, I agree that there is no need to
> complicate things for browser implementors and for Web authors - having just
> the "translate" attribute with the semantics you describe is already a big win.

This may have not been very clear, so let me rephrase:

- A is authoring a web page, not following the best practice "don't create attributes with translatable content"
- B needs to prepare the web page for subsequent translation, specifying what has to be translated or not. In terms of the industry, B are e.g. localization service providers.
- C does the actual translation - as a human translator, a machine translation system, or both.

the "translate" attribute is to be used by A and make the life of B and C easier. The mechanism I (and again, I think Yves) talked about is for B.
Comment 78 Yves Savourel 2012-01-27 11:50:29 UTC
(In reply to comment #77)
> (In reply to comment #76)
> > (In reply to comment #75)
>
> - B needs to prepare the web page for subsequent translation, specifying what
> has to be translated or not. In terms of the industry, B are e.g. localization
> service providers.
> the "translate" attribute is to be used by A and make the life of B and C
> easier. The mechanism I (and again, I think Yves) talked about is for B.

Yes.

And I'm fine with Hixie's proposal in https://www.w3.org/Bugs/Public/show_bug.cgi?id=12417#c64 I just wanted to know how translatable attributes were affected.
Comment 79 Ian 'Hixie' Hickson 2012-02-07 00:22:15 UTC
EDITOR'S RESPONSE: This is an Editor's Response to your comment. If you are satisfied with this response, please change the state of this bug to CLOSED. If you have additional information and would like the editor to reconsider, please reopen this bug. If you would like to escalate the issue to the full HTML Working Group, please add the TrackerRequest keyword to this bug, and suggest title and text for the tracker issue; or you may create a tracker issue yourself, if you are able to do so. For more details, see this document:
   http://dev.w3.org/html5/decision-policy/decision-policy.html

Status: Accepted
Change Description: see diff given below
Rationale: Done.
Comment 80 contributor 2012-02-07 00:23:08 UTC
Checked in as WHATWG revision r6971.
Check-in comment: Introduce a new global attribute for localisers to tell whether or not content should be translated.
http://html5.org/tools/web-apps-tracker?from=6970&to=6971
Comment 81 Jirka Kosek 2012-02-07 08:45:01 UTC
(In reply to comment #80)
> Checked in as WHATWG revision r6971.
> Check-in comment: Introduce a new global attribute for localisers to tell
> whether or not content should be translated.
> http://html5.org/tools/web-apps-tracker?from=6970&to=6971

Hi Ian, thanks for fix.

There is probably missing < in your commit source before code start tag near "Otherwise, the element's ****code title="attr-translate">translate attribute..."

Also I think it would be more reasonable to replace:

"When an element is in the translate-enabled state, the element's attribute values and the values of its Text node children are to be translated when the page is localized."

with

"When an element is in the translate-enabled state, the element's attribute values *of type Text* and the values of its Text node children are to be translated when the page is localized."

The rationale is that it doesn't make much sense to label attributes like class or id as translateable. Of course some attributes of Text type are also not translateable by nature, but at least this definition is closer to reality.
Comment 82 Andy Mabbett 2012-02-09 12:49:02 UTC
Observation:

Some comment has an implicit translate="no", which may be implied from the presence of microformat/schema.org/ RDFa markup. for example, the names of people, street addresses, species names, etc.

Regarding the latter, I wrote [1], in 2008:

<blockquote>Another benefit [of the 'species' microformat] would be that user-agents could be instructed to treat text marked up in this way as not being in the base language of the document or element in which they occur - pronunciation should be as for Latin, they should not be translated (e.g. where a component word happens also to be a valid word in that language, such as the genus <i><strong>Colon</strong><i/>, <i><strong>Circus</strong> cyaneus</i>, <i>Hesperia <strong>comma</strong></i>, or anything with major or minor on an English-language page) and should not be spell-checked, or be spell-checked with a specialised dictionary (a need [I] identified in this [2] 2003 ietf-languages discussion of language values for taxonomic names).</blockquote>

[1] http://microformats.org/wiki/index.php?title=species&oldid=24984#Proposal]

[2] http://www.alvestrand.no/pipermail/ietf-languages/2003-February/000574.html
Comment 83 Andy Mabbett 2012-02-09 12:52:31 UTC
Apologies for the non-rendered HTML in that, and for "Some comment", which should read "Some content".

I wish Bugzilla allowed contribtors to edit their comments...
Comment 84 Addison Phillips 2014-03-06 23:19:58 UTC
I18N is satisfied by these changes. Thank you.