19676 – timestampOffset accuracy

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 19676 - timestampOffset accuracy

Summary: timestampOffset accuracy

Status:	RESOLVED LATER

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	Media Source Extensions (show other bugs)
Version:	unspecified
Hardware:	All Windows 3.1

Importance:	P2 normal
Target Milestone:	---
Assignee:	Aaron Colwell (c)
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:	tpac2012
Keywords:

Depends on:
Blocks:

Reported:	2012-10-23 17:39 UTC by Pierre Lemieux
Modified:	2013-04-09 15:37 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description Pierre Lemieux 2012-10-23 17:39:15 UTC

Time offsets and durations within media streams, e.g. to specify splice points, are often expressed in multiples of video frame or audio sample durations. These durations are therefore typically rational* numbers, e.g. 1/24, 1001/30000, 1/48000, etc. 

As a floating-point double, timestampOffset cannot exactly represent such rational time offsets. For instance, (double) ((1/24)*17) < 17*1/24 < (double) (17/24) -- at least in Win32 python.

Accuracy can be important when a splice point needs to fall on a specific frame boundary or when comparing multiple timestampOffset. For instance, depending on how it was calculated, (double) (17/24) might actually fall either within the 16th frame or the 17th frame, instead of the boundary between the 16th and 17th frame.

Potential approaches include:

- specifying rounding and closest frame boundary selection algorithms
- expressing timestampOffset as a rational (the implementation could store this internally as it wishes)

* Many container formats (ISO BMFF, MXF...) express durations and offsets as rationals, e.g. as integer multiples of a rational timescale expressed as the ratio between an integer numerator and an integer denominator.

Comment 1 Adrian Bateman [MSFT] 2012-10-23 17:56:54 UTC

Adding tpac2012 tag.

Would it be sufficient to specify that the presentation & decode timestamps must be rounded to the nearest microsecond after the timestampOffset is applied? It seems like this should be sufficient for most practical applications. This seems like a better option than introducing rationals everywhere and also provides a simple way to maintain support for variable frame-rate & variable samplerate content.

Comment 3 Pierre Lemieux 2013-01-08 08:14:22 UTC

In terms of providing an unambiguous splice point I am not sure that rounding is sufficient. I think the specification would also need to indicate where the splice will happen exactly, e.g. the closest "edit unit boundary/access unit boundary" in time or the next "edit unit boundary/access unit boundary" in time. A challenge is defining "edit unit boundary/access unit boundary" for each essence kind, e.g. audio sample, coded audio frame, video frame, GOP, etc... Makes sense?

(In reply to comment #3)
> In terms of providing an unambiguous splice point I am not sure that
> rounding is sufficient. I think the specification would also need to
> indicate where the splice will happen exactly, e.g. the closest "edit unit
> boundary/access unit boundary" in time or the next "edit unit
> boundary/access unit boundary" in time. A challenge is defining "edit unit
> boundary/access unit boundary" for each essence kind, e.g. audio sample,
> coded audio frame, video frame, GOP, etc... Makes sense?

I still don't understand why this is necessary. Please provide concrete examples where rounding to the nearest microsecond would not be sufficient for the common use cases on the web. It seems to me that microsecond precision is sufficient for the typical audio sample rates, and frame rates that are commonly used.

Comment 5 Pierre Lemieux 2013-01-08 16:36:41 UTC

> Please provide concrete examples where rounding
> to the nearest microsecond would not be sufficient
> for the common use cases on the web.

Splices between media streams do not occur on millisecond boundaries, but on editable unit boundaries (frame, sample, GOP, etc...), so the specification needs to specify which frame boundary the splice will happen.

(In reply to comment #5)
> > Please provide concrete examples where rounding
> > to the nearest microsecond would not be sufficient
> > for the common use cases on the web.
> 
> Splices between media streams do not occur on millisecond boundaries, but on
> editable unit boundaries (frame, sample, GOP, etc...), so the specification
> needs to specify which frame boundary the splice will happen.

My point is that I believe microsecond precision is sufficient to unambiguously indicate which "editable unit boundary" is intended. Do you agree? If not can you provide a concrete example where using microseconds would be problematic?

Comment 7 Pierre Lemieux 2013-01-08 17:01:46 UTC

> If not can you provide a concrete example where using microseconds would be problematic?

Why require rounding to the nearest microsecond instead of keeping the full precision of the double?

Comment 8 Cyril Concolato 2013-01-20 16:44:03 UTC

In the same spirit, why is the spec mandating in section 4.5.6 "Coded Frame Processing" point 1.1 and 1.2 that the Media Source Engine maintains internal timestamps as double precision floating points, as in:

"Let presentation timestamp be a double precision floating point representation of the coded frame's presentation timestamp."
An implementation could decide to store the timestamp and run the algorithm using  rationals.

I suggest deleting the part "a double precision floating point representation of" in those two points.

(In reply to comment #8)
> In the same spirit, why is the spec mandating in section 4.5.6 "Coded Frame
> Processing" point 1.1 and 1.2 that the Media Source Engine maintains
> internal timestamps as double precision floating points, as in:
> 
> "Let presentation timestamp be a double precision floating point
> representation of the coded frame's presentation timestamp."
> An implementation could decide to store the timestamp and run the algorithm
> using  rationals.
> 
> I suggest deleting the part "a double precision floating point
> representation of" in those two points.

The reason I added text about converting to double precision floating point was to address issues with timestamp rollover in that can occur in bytestreams like MPEG2-TS. The idea is to convert the bytestream timestamp representation to a common representation that doesn't have the same rollover problems that the bytestream format may have. I picked double since that is what all timestamps in the existing HTML5 API's as well as the MSE API's use. It is also easier to talk about adding the timestamp offset to these timestamps because addition of doubles is well defined and doesn't require any consideration for timestamp rollover.

I don't think removing "a double precision floating point representation of" is sufficient to address these original concerns. Perhaps adding a note indicating something along the lines of "Implementations do not have to store timestamps internally as doubles but they must use sufficient precision to avoid timestamp rollovers when applying a timestamp offset. The conversion to double precision floating point is suggested here to make understanding timestamp modification easier."

So I plan on adding a note indicating that implementations don't have to use double precision floating point for the timestamps, and indicate that they are simply used in the spec for clearly describing the intended addition behavior for applying timestampOffset to the timestamps in the media data. While I think that is an important clarification to the spec, I don't think it fully addresses your concern here and I'm not sure what else needs to be done for this bug. 

- I don't think it makes sense to convert the timestampOffset field to a rational.

- closest frame boundary is a hard concept to nail down when multiple frame rates or sample rates could be used in a single presentation. It is only really defined when you are overlapping existing frames in the buffer. That makes me nervous because it means, rounding only happens during overlaps which I think could lead to other problems down the road.

- I understand that rationals can be computed several ways that don't always result in the same double precision value, but the only way I can think of to address that is to define some delta which indicates how close the timestamps have to be for them to be considered identical. That is where I was going with the microsecond resolution comments. It provides a content independent grid that all timestamps can be mapped to.

How would you like me to proceed?

Changes committed.
https://dvcs.w3.org/hg/html-media/rev/77975abeec41

Added a note stating that implementations don't have to use doubles as their internal representation.

Comment 12 Pierre Lemieux 2013-02-12 05:19:19 UTC

(In reply to comment #10)
> I'm not sure what else needs to be
> done for this bug. 

Two options come to mind:

- describe precisely the algorithm that the implementation will use to determine the out point of the earlier segment and the in point of the later segment, taking  into account the granularity of the segments (frame, sample, etc.) and the granularity of timestampOffset (e.g. microsecond grid); or

- allow the timestampOffset to be specified as a rational. An implementation does no need to preserve the rational representation internally, so I am not sure I understand the burden.

The second approach is well documented in other standards. I am however happy to be shown that the first works.

Changes committed.
https://dvcs.w3.org/hg/html-media/rev/d5956e93b991

Changes have been added to round timestamps to the nearest sample boundry for audio. Video frames that are slightly before a frame already in the buffer will overwrite the one in the buffer. Video frames that are up to 1us after a frame in the track buffer will cause the existing frame to be removed. I believe this should provide sufficient behavior to accurately splice video while also still allowing video with different frame rates to be easily spliced together. As long as the web application doesn't introduce rounding errors greater that 1us, I think everything will work as the content author intends.

Comment 14 Pierre Lemieux 2013-03-09 08:56:44 UTC

Thanks for the updated draft. Some initial comments below based on my attempts at implementing the algorithm.

> and presentation timestamp lies within a coded frame already
> let overlapped frame be the coded frame in track buffer that contains presentation timestamp.

What do 'lie' and 'contain' mean? Specifically, do we mean ''overlapped frame' -= 'existing frame N' such that 'existing frame N presentation timestamp' <= presentation timestamp < 'existing frame N+1 presentation timestamp' ?

> If track buffer contains video coded frames and presentation
> timestamp is less than 1 microsecond beyond the presentation
> timestamp of overlapped frame, then remove overlapped frame
> and any coded frames that depend on it from track buffer.

What does 'beyond' mean?

Does 'overlapped frame' mean the coded frame whose presentation timestamp is 'presentation timestamp' +/- 1 us, or something else?

The note below the paragraph states "as long as it is within 1 microsecond".

> Let overlapped frame be the coded frame in track buffer that overlaps
> with new coded frame (ie. it contains presentation timestamp).

In contrast with coded video frames, the timestampOffset for coded audio frames does not include a rounding tolerance, so ambiguities can occur. See below an example using AC3 frames containing 44.1 kHz audio.

(5*1536)/44100 - 5*(1536)*(1/44100) = -2.7755575615628914e-17

> Round & update presentation timestamp and decode timestamp

'round' should be defined. Do we mean floor(x + 0.5)?

Changes committed.
https://dvcs.w3.org/hg/html-media/rev/f0fb58d45f96

Updated text to make the algorithms more clear.

(In reply to comment #14)
> Thanks for the updated draft. Some initial comments below based on my
> attempts at implementing the algorithm.
> 
> > and presentation timestamp lies within a coded frame already
> > let overlapped frame be the coded frame in track buffer that contains presentation timestamp.
> 
> What do 'lie' and 'contain' mean? Specifically, do we mean ''overlapped
> frame' -= 'existing frame N' such that 'existing frame N presentation
> timestamp' <= presentation timestamp < 'existing frame N+1 presentation
> timestamp' ?
>

My most recent changes removed these terms and use >= & < language that is basically equivalent to this.
 
> > If track buffer contains video coded frames and presentation
> > timestamp is less than 1 microsecond beyond the presentation
> > timestamp of overlapped frame, then remove overlapped frame
> > and any coded frames that depend on it from track buffer.
> 
> What does 'beyond' mean?

I meant >, but it sounded weird to say "less than 1 microsecond greater than." I've rearranged this text in my latest update to make this clearer.

> 
> Does 'overlapped frame' mean the coded frame whose presentation timestamp is
> 'presentation timestamp' +/- 1 us, or something else?
> 
> The note below the paragraph states "as long as it is within 1 microsecond".

I clarified the definition for this as well. Overlapped frame only gets removed by the 1 microsecond rule only if the "presentation timestamp" is greater than the overlapped frame's presentation timestamp. If it is before then the normal frame removal logic applies. This 1 us rule is just there to prevent the existing frame from staying in the buffer if the web application slightly overshoots the existing frame's presentation timestamp. 

> 
> > Let overlapped frame be the coded frame in track buffer that overlaps
> > with new coded frame (ie. it contains presentation timestamp).
> 
> In contrast with coded video frames, the timestampOffset for coded audio
> frames does not include a rounding tolerance, so ambiguities can occur. See
> below an example using AC3 frames containing 44.1 kHz audio.
> 
> (5*1536)/44100 - 5*(1536)*(1/44100) = -2.7755575615628914e-17
> 
> > Round & update presentation timestamp and decode timestamp
> 
> 'round' should be defined. Do we mean floor(x + 0.5)?

I've removed the word round because it isn't really accurate. The UA just computes the sample timestamps that is higher and lower than the presentation timestamp and just picks the closest one. I've also added text to describe what to do in the equidistant case. There is no need for a rounding tolerance for audio.

Comment 17 Pierre Lemieux 2013-03-18 23:20:46 UTC

Thanks. Couple of follow-up issues:

> Let remove window timestamp equal overlapped frame presentation
> timestamp plus 1 microsecond.

This seems to apply only when there is overlap. If so, what about if the splice is supposed to happen at the end of the existing frame buffer, i.e. the end of the last frame of the existing frame buffer corresponds to the start of the first frame of the added segment?

> Update presentation timestamp and decode timestamp to the nearest audio
> sample timestamp based on sample rate of the audio in overlapped frame.

Don't we need to adjust timestampOffset so that the following frames will align as well?

>  then remove overlapped frame and any coded frames that depend on it from track buffer.

Same here, don't we need to adjust timestampOffset so that the presentation time of following frames reflect the small offset as well?

>  There is no need for a rounding tolerance for audio.

What about if the splice is designed with no overlap, and the coded audio frames are supposed to butt?

> that overlap presentation timestamp plus the splice duration of 5 milliseconds.

Larger than or equal?

(In reply to comment #17)
> Thanks. Couple of follow-up issues:
> 
> > Let remove window timestamp equal overlapped frame presentation
> > timestamp plus 1 microsecond.
> 
> This seems to apply only when there is overlap. If so, what about if the
> splice is supposed to happen at the end of the existing frame buffer, i.e.
> the end of the last frame of the existing frame buffer corresponds to the
> start of the first frame of the added segment?

Yes. The frame will be inserted at the specified timestamp if there isn't overlap.

> 
> > Update presentation timestamp and decode timestamp to the nearest audio
> > sample timestamp based on sample rate of the audio in overlapped frame.
> 
> Don't we need to adjust timestampOffset so that the following frames will
> align as well?

No. The point of this mechanism is not to modify timestampOffset. It is to make sure that timestamps that are slightly off can properly trigger removal of existing frames in the buffer. The UA is not in the position to override the web applications intent. It is up to the web application to make sure that timestampOffset is as accurately set as possible.

> 
> >  then remove overlapped frame and any coded frames that depend on it from track buffer.
> 
> Same here, don't we need to adjust timestampOffset so that the presentation
> time of following frames reflect the small offset as well?

No.

> 
> >  There is no need for a rounding tolerance for audio.
> 
> What about if the splice is designed with no overlap, and the coded audio
> frames are supposed to butt?

If there is no overlap then there isn't a problem. The specified timestamp will be used and if there is a slight gap then silence will be inserted. The sample rates aren't very high so if the application has trouble accurately specifying the start of the frame then it can expect problems.

> 
> > that overlap presentation timestamp plus the splice duration of 5 milliseconds.
> 
> Larger than or equal?

I'll add text to clarify this refers to all coded frames that have presentation timestamps > _presentation_timestamp_ and < _presentation_timestamp_ + 5 microseconds.

Change committed that clarifies text mentioned in comment 18
https://dvcs.w3.org/hg/html-media/rev/1e6898152c5b

Please keep this bug closed. The main accuracy issues have been address and I think we should wait until we have implementation experience before determining whether further changes in this area are really needed. If an application is really worried about errors introduced by timestampOffset, it can simply author the content with the desired timestamps and not use the timestampOffset mechanism at all. I believe the current mechanism is "good enough" for most practical uses and I think we should wait until issues arise in actual implementations before reopening this.