This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 18571 - Note about Vorbis appears incorrect
Summary: Note about Vorbis appears incorrect
Status: RESOLVED INVALID
Alias: None
Product: HTML WG
Classification: Unclassified
Component: Media Source Extensions (show other bugs)
Version: unspecified
Hardware: PC Linux
: P2 normal
Target Milestone: ---
Assignee: Adrian Bateman [MSFT]
QA Contact: HTML WG Bugzilla archive list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-08-15 08:53 UTC by Philip Jägenstedt
Modified: 2012-08-21 01:53 UTC (History)
4 users (show)

See Also:


Attachments

Description Philip Jägenstedt 2012-08-15 08:53:44 UTC
http://dvcs.w3.org/hg/html-media/raw-file/tip/media-source/media-source.html#source-buffer-media-segment-constraints

"Gaps between media segments that are smaller than the audio frame size are allowed and should be rendered as silence. Such gaps should not be reflected by buffered.
Note: This is intended to simplify switching between audio streams where the frame boundaries don't always line up across encodings (e.g. Vorbis)."

To quote http://xiph.org/vorbis/doc/vorbisfile/crosslap.html

"Vorbis introduces no extra samples at the beginning or end of a stream, nor does it remove any samples."

Given this, it's not clear why gaps are allowed.
Comment 1 Aaron Colwell (c) 2012-08-15 15:19:22 UTC
(In reply to comment #0)
> http://dvcs.w3.org/hg/html-media/raw-file/tip/media-source/media-source.html#source-buffer-media-segment-constraints
> 
> "Gaps between media segments that are smaller than the audio frame size are
> allowed and should be rendered as silence. Such gaps should not be reflected by
> buffered.
> Note: This is intended to simplify switching between audio streams where the
> frame boundaries don't always line up across encodings (e.g. Vorbis)."
> 
> To quote http://xiph.org/vorbis/doc/vorbisfile/crosslap.html
> 
> "Vorbis introduces no extra samples at the beginning or end of a stream, nor
> does it remove any samples."
> 
> Given this, it's not clear why gaps are allowed.

Vorbis encodes at different bitrates do not guarentee that the exact same frame durations will be picked at each instant in the timeline. Higher bitrate encodes may opt to use a short window where the lower bitrate picked a long one. This means that the frame boundries in the two encodes won't always line up if you splice them at an arbitrary point. This is why support for gaps is needed.
Comment 2 Philip Jägenstedt 2012-08-16 09:33:25 UTC
Oh right, we don't get to choose how the frames align, which has nothing to do with what I quoted...

http://xiph.org/vorbis/doc/Vorbis_I_spec.html says that the frame size is 64 to 8192, which ought to mean that if one cuts an arbitrary sample the worst case mismatch is 4096 samples? That's a rather big gap to smooth over, is that something you see in practice?

Why is this not a problem for other formats? http://en.wikipedia.org/wiki/Advanced_Audio_Coding says that AAC has variable frame length.
Comment 3 Ralph Giles 2012-08-16 21:38:35 UTC
(In reply to comment #1)

> Vorbis encodes at different bitrates do not guarentee that the exact same frame
> durations will be picked at each instant in the timeline.

I think the point of what Aaron is saying is that Vorbis (and Opus) are variable-frame-size codecs, and and as such there's no reason to expect that the frame boundaries will align between two different encodes.

That said, we should not specify 'render as silence'. That will cause an audible drop out (or pop if you do it without any cross-fade). We should specify that implementations cross-lap the streams in a way that preserves the sample index when switching streams. A bad implementation could indeed pad with silence, but a good one would do intelligent reconstruction to bridge the gap, the way packet-loss concealment works in voip. This is the technique described on the crosslap page, implemented by the libvorbisfile crosslap API, and described by Monty at OVC a couple of years ago.
Comment 4 Adrian Bateman [MSFT] 2012-08-16 21:49:36 UTC
(In reply to comment #3)
> That said, we should not specify 'render as silence'. That will cause an
> audible drop out (or pop if you do it without any cross-fade). We should
> specify that implementations cross-lap the streams in a way that preserves the
> sample index when switching streams. A bad implementation could indeed pad with
> silence, but a good one would do intelligent reconstruction to bridge the gap,
> the way packet-loss concealment works in voip. This is the technique described
> on the crosslap page, implemented by the libvorbisfile crosslap API, and
> described by Monty at OVC a couple of years ago.

This sounds like a good quality of implementation feature. Simple implementations of media source may still be useful without this. The spec can suggest implementations might consider implementing something more sophisticated without requiring any particular technique.
Comment 5 Aaron Colwell (c) 2012-08-17 23:57:15 UTC
(In reply to comment #2)
> Oh right, we don't get to choose how the frames align, which has nothing to do
> with what I quoted...
> 
> http://xiph.org/vorbis/doc/Vorbis_I_spec.html says that the frame size is 64 to
> 8192, which ought to mean that if one cuts an arbitrary sample the worst case
> mismatch is 4096 samples? That's a rather big gap to smooth over, is that
> something you see in practice?

Most of the Vorbis content I've seen tends to keep frame durations around ~23ms (ie 1024 @ 44100). My guess is that the higher numbers are for higher sample rate content.

> 
> Why is this not a problem for other formats?
> http://en.wikipedia.org/wiki/Advanced_Audio_Coding says that AAC has variable
> frame length.
I believe this is referering to the number of bits in the coded frame not the number of samples output per frame. MP3 had a fixed set of coded frame sizes which AAC does not. I'm pretty sure both MP3 & AAC output a fixed number of samples per coded frame though. There may be counter examples, but in my experience the Xiph codecs (Vorbis, Opus?) tend to be the only ones that have varying output sample counts.
Comment 6 Philip Jägenstedt 2012-08-20 10:54:55 UTC
Ah, I see, resolving this as invalid.

Nevertheless, is it not feasible to write a segment splitter that makes the cuts where the audio frames align across encodes, potentially letting the video frames straddle the boundaries?
Comment 7 Aaron Colwell (c) 2012-08-20 15:32:13 UTC
(In reply to comment #6)
> Ah, I see, resolving this as invalid.
> 
> Nevertheless, is it not feasible to write a segment splitter that makes the
> cuts where the audio frames align across encodes, potentially letting the video
> frames straddle the boundaries?
It isn't clear to me how much distance there would be between such points in practice. That would also only work for cases where you are switching between different encodes of the same source content. The UA has no idea whether that is the case or not. MSE was also intended to be use for splicing completely different content and waiting for such an alignment point would be unacceptable because it could result in playing audio that doesn't match the video being displayed.
Comment 8 Timothy B. Terriberry 2012-08-21 01:53:36 UTC
(In reply to comment #5)
> Most of the Vorbis content I've seen tends to keep frame durations around ~23ms
> (ie 1024 @ 44100). My guess is that the higher numbers are for higher sample
> rate content.

It's actually for lower-bitrate content.

> samples per coded frame though. There may be counter examples, but in my
> experience the Xiph codecs (Vorbis, Opus?) tend to be the only ones that have
> varying output sample counts.

It is not hard to configure Opus to always use a constant frame size (in fact, this is what all the current tools do by default... right now you would have to write some custom code to be able to change it). But that can only be enforced if you control the encoder.

(In reply to comment #7)
> It isn't clear to me how much distance there would be between such points in
> practice. That would also only work for cases where you are switching between

In practice, they would line up on any transient. Depending on the content, that may not be often enough (it's also possible to construct pathological streams where they would _never_ line up).