ISSUE-295: Remove code point restrictions from IMSC

Constrained code pages

Remove code point restrictions from IMSC

State:
CLOSED
Product:
TTML IMSC 1.0
Raised by:
Nigel Megitt
Opened on:
2013-11-07
Description:
Concerns have been expressed within the EBU XMLSubtitles group regarding the restriction of xml:lang to a single value at the top level of IMSC documents, and the restriction of supported Unicode code points dependent on the value of xml:lang, on the basis that this usage is contrary to the accepted use of the Unicode standard.

The IMSC document [1] states in Appendix A. Recommended Unicode Code Points per Language that “The following table specifies the [UNICODE] code points that SHOULD be used in a document's textual content if xml:lang is present (Primary language subtag is as defined in IETF RFC 5646).”

[1] https://dvcs.w3.org/hg/ttml/raw-file/tip/ttml-ww-profiles/ttml-ww-profiles.html#recommended-unicode-code-points-per-language

IMSC only allows a single xml:lang value… thus all content of an IMSC compliant document is labelled as belonging to a single language. Further the implication of the Appendix A heading is that only the code points identified for a specific xml:lang tag value should be used in a document with that specific xml:lang tag value. This is effectively recommending that no ‘foreign language’ phrase can appear in an IMSC document. For a subtitle document, it could be argued that a foreign phrase would not appear (since it would all be translated). This is however incorrect. For example subtitles for a travel program will of course quite likely contain foreign phrases. In addition, many international languages use ‘loan words’ and phrases, which when correctly represented should retain their proper accents and presentation. More importantly, in many countries, captions are expected to be verbatim. The scope of the IMSC document expressly includes captions, so it is difficult to understand how verbatim speech might be conveyed in an IMSC document if the speaker being captioned chooses to use foreign words or phrases.

Of course, should the xml:lang tag be permitted on elements within the IMSC document with different values in different elements… i.e. correctly identifying languages, then the above recommendations are of less impact.

Removing the xml:lang restriction would also support the use of IMSC subtitles as a source of 'spoken subtitles' in which the distributed subtitles are rendered as speech, which may require the use of language-dependent speech synthesis models.

It is apparent that the intention of the IMSC document is to simplify the implementation requirements, essentially permitting and encouraging implementations to only support perceived specific regional requirements. However, this perspective is fundamentally flawed as it does not accept the multi-cultural nature of the world. E.g. In the USA, the demographic for main language spoken according to the American Community Survey 2009 (and endorsed by the United States Census Bureau) is:

English - 229 million
Spanish - 35 million
Chinese languages - 2.6 million + (mostly Cantonese speakers, with a growing group of Mandarin speakers)
Tagalog - 1.5 million + (Most Filipinos may also know other Philippine dialects)
French - 1.3 million
Vietnamese - 1.3 million
German - 1.1 million (High German) + German dialects
Korean - 1.0 million
Russian - 881,000
Arabic - 845,000
Italian - 754,000
Portuguese - 731,000
French Creole - 659,000
Polish - 594,000
Hindi - 561,000
… followed by many more languages with less than 500,000 speakers per language

Other countries and regions have even more diverse demographics (e.g. Europe, SE Asia).

EBU XML Subtitles group does not consider it appropriate for a W3C specification to make a recommendation that is so limited in acceptance of Internationalisation principles or to encourage the development of client implementations that are unable to support the real broader internationalisation needs of the viewing audience.

Further, Table B “Typical Practice for Subtitles per Region (Informative)” in the IMSC document [1] perpetuates this ‘nationalistic and parochial’ viewpoint by implying that only certain subtitle languages are typically used in certain regions. This is clearly also flawed. For example I can quite readily purchase a single (region 2) DVD in the UK that has the following subtitle languages available: English, French, Italian, Spanish, Danish, Dutch, Finnish, Icelandic, Norwegian, Portuguese, Swedish and Arabic.

A final concern is that whilst the inclusion of the tables included in Appendix A of the IMSC specification may seem to have some ‘single point of reference’ utility, there is the very real potential for a loss of synchronisation between IMSC and the real owners of this intellectual space- The Unicode Consortium. The Unicode characters used by specific languages are already clearly identified by the Unicode Consortium in the CLDR project. This should be the reference for such information, should it be needed.

We propose that the IMSC document should be amended to remove the inferences that limited character sets in client implementations are acceptable or recommended, that the tables should be removed and that due references should be made to the appropriate authorities (Unicode CLDP).
Related Actions Items:
Related emails:
  1. {minutes} TTWG Meeting 19/6/2014 (from nigel.megitt@bbc.co.uk on 2014-06-19)
  2. RE: {agenda} TTWG Meeting 19/6/2014 (from mdolan@newtbt.com on 2014-06-18)
  3. {agenda} TTWG Meeting 19/6/2014 (from nigel.megitt@bbc.co.uk on 2014-06-18)
  4. RE: {agenda} TTWG Meeting 12/6/2014 (from mdolan@newtbt.com on 2014-06-12)
  5. Re: {agenda} TTWG Meeting 12/6/2014 (from pal@sandflow.com on 2014-06-11)
  6. Re: {agenda} TTWG Meeting 12/6/2014 (from nigel.megitt@bbc.co.uk on 2014-06-11)
  7. Re: {agenda} TTWG Meeting 12/6/2014 (from pal@sandflow.com on 2014-06-11)
  8. Re: {agenda} TTWG Meeting 12/6/2014 (from nigel.megitt@bbc.co.uk on 2014-06-11)
  9. {agenda} TTWG Meeting 12/6/2014 (from nigel.megitt@bbc.co.uk on 2014-06-11)
  10. {minutes} TTWG Meeting 27/2/2014 (from nigel.megitt@bbc.co.uk on 2014-02-27)
  11. {agenda} TTWG Meeting 27/2/2014 (from nigel.megitt@bbc.co.uk on 2014-02-26)
  12. {minutes} TTWG Meeting 20/2/2014 (from nigel.megitt@bbc.co.uk on 2014-02-20)
  13. {agenda} TTWG Meeting 20/2/2014 (from nigel.megitt@bbc.co.uk on 2014-02-19)
  14. RE: {agenda} TTWG Meeting 13/2/2014 (from mdolan@newtbt.com on 2014-02-13)
  15. Re: {agenda} TTWG Meeting 13/2/2014 (from glenn@skynav.com on 2014-02-13)
  16. {agenda} TTWG Meeting 13/2/2014 (from nigel.megitt@bbc.co.uk on 2014-02-12)
  17. Re: {agenda} TTWG Meeting 6/2/2014 (from silviapfeiffer1@gmail.com on 2014-02-06)
  18. {agenda} TTWG Meeting 6/2/2014 (from nigel.megitt@bbc.co.uk on 2014-02-05)
  19. {agenda} TTWG Meeting 30/01/2014 (from nigel.megitt@bbc.co.uk on 2014-01-30)
  20. {minutes} 16/1/14 TTWG meeting (from nigel.megitt@bbc.co.uk on 2014-01-16)
  21. {agenda} 16/1/14 TTWG meeting (from nigel.megitt@bbc.co.uk on 2014-01-16)
  22. {minutes} 9/1/14 TTWG Meeting (from nigel.megitt@bbc.co.uk on 2014-01-09)
  23. TTWG Agenda for 9/1/14 (from nigel.megitt@bbc.co.uk on 2014-01-08)
  24. Minutes for 12/12/13 (from nigel.megitt@bbc.co.uk on 2013-12-12)
  25. Revised IMSC ED (from pal@sandflow.com on 2013-12-11)
  26. TTWG Agenda for 12/12/13 (from nigel.megitt@bbc.co.uk on 2013-12-11)
  27. Re: TTWG Agenda for 5/12/13 (from nigel.megitt@bbc.co.uk on 2013-12-11)
  28. Re: TTWG Agenda for 5/12/13 (from glenn@skynav.com on 2013-12-11)
  29. RE: TTWG Agenda for 5/12/13 (from mdolan@newtbt.com on 2013-12-10)
  30. TTWG Minutes for 5/12/13 (from nigel.megitt@bbc.co.uk on 2013-12-05)
  31. TTWG Agenda for 5/12/13 (from nigel.megitt@bbc.co.uk on 2013-12-04)
  32. TTML Minutes for 15/11/13 (from nigel.megitt@bbc.co.uk on 2013-11-21)
  33. TTML Minutes for 11/11/13 (from nigel.megitt@bbc.co.uk on 2013-11-21)
  34. Re: TTML Agenda for 21/11/13 (from cyril.concolato@telecom-paristech.fr on 2013-11-21)
  35. Re: TTML Agenda for 21/11/13 (from glenn@skynav.com on 2013-11-21)
  36. Re: TTML Agenda for 21/11/13 (from tmichel@w3.org on 2013-11-21)
  37. TTML Agenda for 21/11/13 (from nigel.megitt@bbc.co.uk on 2013-11-20)
  38. ISSUE-296 (xml:lang constraints in IMSC): Remove xml:lang placement restrictions from IMSC [IMSC] (from sysbot+tracker@w3.org on 2013-11-07)
  39. ISSUE-295 (Multiple languages): Allow xml:lang to be set anywhere needed within IMSC documents [IMSC] (from sysbot+tracker@w3.org on 2013-11-07)

Related notes:

This is related to ISSUE-296.

Nigel Megitt, 7 Nov 2013, 13:52:51

[nigel]: action on glenn and pierre to consult richard ishida - is there a baseline to reference, or an external source?

11 Nov 2013, 09:41:10

[nigel]: see action-244 and action-243.

15 Nov 2013, 08:56:32

Display change log ATOM feed


David Singer <singer@apple.com>, Nigel Megitt <nigel.megitt@bbc.co.uk>, Chairs, Thierry Michel <tmichel@w3.org>, Philippe Le Hégaret <plh@w3.org>, Staff Contacts
Tracker: documentation, (configuration for this group), originally developed by Dean Jackson, is developed and maintained by the Systems Team <w3t-sys@w3.org>.
$Id: index.php,v 1.325 2014-09-10 21:42:02 ted Exp $