This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 28255 - [webvtt] no provision for indicating overall content language(s) [I18N-ISSUE-420]
Summary: [webvtt] no provision for indicating overall content language(s) [I18N-ISSUE-...
Status: RESOLVED MOVED
Alias: None
Product: TextTracks CG
Classification: Unclassified
Component: WebVTT (show other bugs)
Version: unspecified
Hardware: PC All
: P2 normal
Target Milestone: ---
Assignee: This bug has no owner yet - up for the taking
QA Contact: Web Media Text Tracks CG
URL:
Whiteboard: widereview, see comment 16
Keywords:
Depends on:
Blocks:
 
Reported: 2015-03-22 00:01 UTC by Silvia Pfeiffer
Modified: 2017-08-09 12:26 UTC (History)
9 users (show)

See Also:
silviapfeiffer1: needinfo? (ishida)
silviapfeiffer1: needinfo? (addison)


Attachments

Description Silvia Pfeiffer 2015-03-22 00:01:16 UTC
Feedback by Addison Phillips from W3C I18N group:
http://lists.w3.org/Archives/Public/public-tt/2015Mar/0054.html

I18N comment:  https://www.w3.org/International/track/issues/420

This is a comment on:
http://www.w3.org/TR/webvtt1

We didn't find a means for indicating the natural language of the content as a whole. There is a cue language span, for spanning runs of text, but no way to indicate the language of the file as a whole (that is, the language outside a spanned section).

Lack of language tags means that language-specific processing, such as language-specific font selection, may not be available for the text, leading to substandard presentation.
Comment 1 Simon Pieters 2015-03-23 07:57:32 UTC
For out-of-band tracks in HTML there is <track srclang>, FWIW.
Comment 2 Philip Jägenstedt 2015-03-23 10:00:27 UTC
Yes, but it will not affect rendering. We could of course make it behave as if all the cues of a track were in a div with a corresponding lang attribute.
Comment 3 Silvia Pfeiffer 2015-06-08 11:31:52 UTC
(In reply to Philip Jägenstedt from comment #2)
> Yes, but it will not affect rendering. We could of course make it behave as
> if all the cues of a track were in a div with a corresponding lang attribute.

Then I suggest we also have a DEFAULTS section that sets the language and the @srclang would overwrite that.
Comment 4 Philip Jägenstedt 2015-06-08 12:34:56 UTC
We don't have a cue-level language setting, so we'd have to add that before DEFAULTS could save the day.
Comment 5 Silvia Pfeiffer 2015-06-08 12:51:44 UTC
(In reply to Philip Jägenstedt from comment #4)
> We don't have a cue-level language setting, so we'd have to add that before
> DEFAULTS could save the day.

We have a <lang> tag.
Comment 6 Philip Jägenstedt 2015-06-09 10:30:14 UTC
(In reply to Silvia Pfeiffer from comment #5)
> (In reply to Philip Jägenstedt from comment #4)
> > We don't have a cue-level language setting, so we'd have to add that before
> > DEFAULTS could save the day.
> 
> We have a <lang> tag.

Yes, but that isn't a cue-level setting. Having "DEFAULTS lang:en" map to <lang> in some way seems odd, how would that be written?
Comment 7 Silvia Pfeiffer 2015-06-09 19:47:06 UTC
(In reply to Philip Jägenstedt from comment #6)
> (In reply to Silvia Pfeiffer from comment #5)
> > (In reply to Philip Jägenstedt from comment #4)
> > > We don't have a cue-level language setting, so we'd have to add that before
> > > DEFAULTS could save the day.
> > 
> > We have a <lang> tag.
> 
> Yes, but that isn't a cue-level setting. Having "DEFAULTS lang:en" map to
> <lang> in some way seems odd, how would that be written?

I misunderstood: I thought you referred to author ability to overwrite a default language in a cue.

What is the problem with DEFAULTS lang and mapping to cue-level settings? In my mind, providing a DEFAULTS lang sets the language for all the cues. Currently, that language is defined by <track srclang>, so there already should be a language setting on cues.
Comment 8 Philip Jägenstedt 2015-06-10 17:27:33 UTC
(In reply to Silvia Pfeiffer from comment #7)
> (In reply to Philip Jägenstedt from comment #6)
> > (In reply to Silvia Pfeiffer from comment #5)
> > > (In reply to Philip Jägenstedt from comment #4)
> > > > We don't have a cue-level language setting, so we'd have to add that before
> > > > DEFAULTS could save the day.
> > > 
> > > We have a <lang> tag.
> > 
> > Yes, but that isn't a cue-level setting. Having "DEFAULTS lang:en" map to
> > <lang> in some way seems odd, how would that be written?
> 
> I misunderstood: I thought you referred to author ability to overwrite a
> default language in a cue.
> 
> What is the problem with DEFAULTS lang and mapping to cue-level settings? In
> my mind, providing a DEFAULTS lang sets the language for all the cues.
> Currently, that language is defined by <track srclang>, so there already
> should be a language setting on cues.

The problem is that there isn't a cue-level setting, but we could add one.
Comment 9 Silvia Pfeiffer 2015-06-27 13:51:19 UTC
(In reply to Philip Jägenstedt from comment #8)
> 
> The problem is that there isn't a cue-level setting, but we could add one.

Actually, I think we don't need it. WebVTT internal nodes already have an applicable language and the root node is such a node. So, while we don't have a way for authors to set it, the concept of an applicable language of a cue already exists. We now just need to define a file-wide setting, such as

DEFAULT
Language: en-US

and we can propagate that as the applicable language on the root node of trees of WebVTT Node objects.
Comment 10 Silvia Pfeiffer 2015-09-30 23:27:50 UTC
After discussion at FOMS, we've arrived at the following:
WebVTT files are typically used through the <track> element, through DASH, HLS and MP4 or WebM in-band file.

<track> has @srclang to provide the language.

DASH and HLS have manifest files to provide the language.

MP4 ISO format and WebM have a header to specify the language.

In essence: the language of the file is something that is specified externally to the WebVTT file.

Language-specific font selection is supported through the CSS :lang() selector, which can be applied to all cues in a WebVTT file through the use of ::cue, i.e. ::cue:lang() .
Comment 11 nigelmegitt 2015-10-01 10:17:39 UTC
It may be true that language can be specified externally, but that's not especially helpful for asset management. 

I would argue also that the spec is incomplete in that it includes the WebVTT Language Objects to indicate when the language is different from the surrounding text language but no mechanism to indicate what the outer level language is. The obvious fix for this would be to add a language header with file scope, e.g. a reserved keyword for a metadata header, like "lang" so you might have a file that begins:

WEBVTT

lang: en-GB

[rest of file]
Comment 12 Silvia Pfeiffer 2015-10-01 23:54:30 UTC
(In reply to nigelmegitt from comment #11)
> It may be true that language can be specified externally, but that's not
> especially helpful for asset management. 

Particularly in asset management, you have a database which contains this information and is much more easily searchable than the content files themselves. I don't buy this argument.

> I would argue also that the spec is incomplete in that it includes the
> WebVTT Language Objects to indicate when the language is different from the
> surrounding text language but no mechanism to indicate what the outer level
> language is. The obvious fix for this would be to add a language header with
> file scope, e.g. a reserved keyword for a metadata header, like "lang" so
> you might have a file that begins:
> 
> WEBVTT
> 
> lang: en-GB
> 
> [rest of file]

There didn't seem to be enough of a use case from practitioners to make it worth the effort at this point in time. It seems to be an academic request without a real need.
Comment 13 nigelmegitt 2015-10-02 09:00:51 UTC
Asset management databases are like search indexes - they're very useful but shouldn't be relied on as a permanent record of the content. The general experience with asset management is there should always be an alternative mechanism to recover asset information in case the database cannot be relied upon.

From http://www.w3.org/TR/html5/embedded-content-0.html#the-track-element :

> The srclang attribute gives the language of the text track data. The value must be a valid BCP 47 language tag. This attribute must be present if the element's kind attribute is in the subtitles state. [BCP47]

> If the element has a srclang attribute whose value is not the empty string, then the element's track language is the value of the attribute. Otherwise, the element has no track language.

So even if there is no asset management system, anyone wanting to publish a WebVTT file in a track with @kind="subtitles" is forced to derive the main language from some external knowledge. Since the language in the case of @kind="subtitles" is expected to be different from the main media language there may be multiple WebVTT files to publish. Having no standard mechanism to derive the language of each file is a huge hole in the spec in my opinion. This is a practical point not an academic one.

Additionally, I've proposed a very simple solution that will not break existing implementations, so on that basis I'm reopening this issue.
Comment 14 David Singer 2015-10-02 17:38:14 UTC
It is much easier to manage self-describing assets, i.e. If the file has an internal Lang header. It is also then easier to build eg the web page. Carrying side info is a pain. People will fudge it, eg make up a header, have name conventions, but...
Comment 15 Martin Dürst 2015-10-03 05:43:46 UTC
I agree with Nigel and David. For HTML, we have both external and internal mechanisms for indicating language (as well as charset, i.e. character encoding). Nobody has ever questioned this, and its usefulness has been widely validated in practice. I cannot see any reason that WebTTV would be different here.

It is clear that for some usage scenarios, e.g. asset/content management systems, external information will work well. But there are other usages where internal information works much better (or is essentially the only thing that works).

The spec shouldn't dictate how the data is managed, stored, and so on, but should be amenable to various different needs. The request isn't in any way academic; in fact one use case where it's most helpful is very small projects. If Web technology were only supporting the use case of big users who have the resources to set up an asset management system, then we would be doing something wrong.
Comment 16 Simon Pieters 2015-10-04 20:36:04 UTC
It seems to me there are two bugs here:

1. It is not specified that an external language definition actually affects the language of the cue text.

2. Missing internal file-wide language declaration.
Comment 17 Addison Phillips 2015-10-04 21:21:40 UTC
I agree with Martin's comment #15.

(In reply to Simon Pieters from comment #16)
> It seems to me there are two bugs here:
> 
> 1. It is not specified that an external language definition actually affects
> the language of the cue text.

Affects is probably the wrong word. Indicates is probably a better choice. 

Probably the best you can do here is allow an external language definition to indicate the language of text (including the cue text) in the file, in the absence of local-to-the-file information.

> 
> 2. Missing internal file-wide language declaration.

That would be the initial intention of this issue. 

In particular, WebVTT currently allows only for span-based override of some unspecified base language. That requires every bit of text to be wrapped in a cue language span if the language information is important to the processing or rendering. This is not very efficient (especially since most files will likely contain only a single language). 

For example, a cue file containing Chinese might need to be tagged with a language tag of "zh-Hans" or "zh-CN" so that a Simplified Chinese font will be applied by the font fallback system rather than defaulting to an inappropriate Japanese font on certain systems (producing a ransom note effect).

A couple of nits from reading your current editor's copy:

In http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules you should say "language tag", not "language code".

In http://dev.w3.org/html5/webvtt/#webvtt-cue-text you still say the cue language span "must be a valid BCP 47 language tag", where 'valid' has a particular meaning in BCP 47 and it is not clear if you intend that specific meaning. You should: (a) omit the word valid; (b) clarify that you mean BCP 47 valid; or (c) use the word 'well-formed' instead. (Valid in BCP 47 means that the implementation checks to see that all of the subtags are registered, at least as of an implementation specific date)
Comment 18 Simon Pieters 2015-10-05 14:58:09 UTC
(In reply to Addison Phillips from comment #17)
> A couple of nits from reading your current editor's copy:
> 
> In http://dev.w3.org/html5/webvtt/#webvtt-cue-text-parsing-rules you should
> say "language tag", not "language code".
> 
> In http://dev.w3.org/html5/webvtt/#webvtt-cue-text you still say the cue
> language span "must be a valid BCP 47 language tag", where 'valid' has a
> particular meaning in BCP 47 and it is not clear if you intend that specific
> meaning. You should: (a) omit the word valid; (b) clarify that you mean BCP
> 47 valid; or (c) use the word 'well-formed' instead. (Valid in BCP 47 means
> that the implementation checks to see that all of the subtags are
> registered, at least as of an implementation specific date)

Thanks, filed
https://github.com/w3c/webvtt/issues/216
https://github.com/w3c/webvtt/issues/217
Comment 19 nigelmegitt 2015-10-30 13:52:22 UTC
(In reply to Addison Phillips from comment #17)
> I agree with Martin's comment #15.
> 
> (In reply to Simon Pieters from comment #16)
> > It seems to me there are two bugs here:
> > 
> > 1. It is not specified that an external language definition actually affects
> > the language of the cue text.
> 
> Affects is probably the wrong word. Indicates is probably a better choice. 
> 
> Probably the best you can do here is allow an external language definition
> to indicate the language of text (including the cue text) in the file, in
> the absence of local-to-the-file information.

Actually 'affects' is correct since AIUI the text rendering should take the language into account: the same set of code points can be differently mapped to glyphs (including variant glyphs) and laid out depending on which language applies. Another reason why indicating language correctly is not academic.
Comment 20 Addison Phillips 2015-10-30 16:22:12 UTC
(In reply to nigelmegitt from comment #19)
> (In reply to Addison Phillips from comment #17)
> > I agree with Martin's comment #15.
> > 
> > (In reply to Simon Pieters from comment #16)
> > > It seems to me there are two bugs here:
> > > 
> > > 1. It is not specified that an external language definition actually affects
> > > the language of the cue text.
> > 
> > Affects is probably the wrong word. Indicates is probably a better choice. 
> > 
> > Probably the best you can do here is allow an external language definition
> > to indicate the language of text (including the cue text) in the file, in
> > the absence of local-to-the-file information.
> 
> Actually 'affects' is correct since AIUI the text rendering should take the
> language into account: the same set of code points can be differently mapped
> to glyphs (including variant glyphs) and laid out depending on which
> language applies. Another reason why indicating language correctly is not
> academic.

No, 'affects that language' suggests that it changes what the language of the text actually is. The use case you cite is precisely why I filed this issue in the first place. What you (and I) mean is 'affects the processing of the cue'.
Comment 21 Simon Pieters 2015-11-13 12:12:42 UTC
(In reply to Simon Pieters from comment #16)
> It seems to me there are two bugs here:
> 
> 1. It is not specified that an external language definition actually affects
> the language of the cue text.

https://github.com/w3c/webvtt/pull/257
Comment 22 Simon Pieters 2015-11-13 12:40:49 UTC
(In reply to Simon Pieters from comment #16)
> 2. Missing internal file-wide language declaration.

https://github.com/w3c/webvtt/issues/259
Comment 23 Simon Pieters 2015-11-13 13:13:24 UTC
HTML part of (1) is

https://github.com/whatwg/html/pull/338
Comment 24 David Singer 2016-10-11 17:44:51 UTC
Fixed that we now inherit from the environment the overall language and this is propagated into the algorithm (see "fallback language").

Though I like self-describing files, internal tagging would open the possibility of conflict, since the environment can also indicate the overall language. We did not introduce self-describing language mechanism for VTT files.
Comment 25 David Singer 2016-10-11 17:45:29 UTC
see also https://github.com/w3c/webvtt/issues/259
Comment 26 Silvia Pfeiffer 2017-08-09 12:26:49 UTC
Addison,

Would you mind checking if your bug is resolved?

We've landed the following fixes:
* https://github.com/w3c/webvtt/issues/216 : Editorial: language code -> language tag
* https://github.com/w3c/webvtt/pull/257 : Allow external language information to apply
* https://github.com/w3c/webvtt/issues/259 : Missing internal file-wide language declaration
* https://github.com/whatwg/html/pull/338 : Forward the track language to the track rendering rules

Hmm, just looking at this, it actually seems to me that https://github.com/w3c/webvtt/issues/259 was your core issue and it's not resolved yet?