23169 – reconsider the jitter video quality metrics again

This is an archived snapshot of W3C's public bugzilla bug tracker, decommissioned in April 2019. Please see the home page for more details.

Bug 23169 - reconsider the jitter video quality metrics again

Summary: reconsider the jitter video quality metrics again

Status:	RESOLVED FIXED

Alias:	None

Product:	HTML WG
Classification:	Unclassified
Component:	Media Source Extensions (show other bugs)
Version:	unspecified
Hardware:	All All

Importance:	P2 normal
Target Milestone:	LC
Assignee:	Aaron Colwell
QA Contact:	HTML WG Bugzilla archive list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-09-05 18:24 UTC by David Singer
Modified:	2013-12-10 16:45 UTC (History)
CC List:	7 users (show)

See Also:

Attachments

Description David Singer 2013-09-05 18:24:26 UTC

<https://www.w3.org/Bugs/Public/show_bug.cgi?id=22148>

We are concerned about the new definition of the displayed frame delay, and the use of this value to accumulate a jitter value in totalFrameDelay.

Displayed Frame Delay
The delay, to the nearest microsecond, between a frame's presentation time and the actual time it was displayed. This delay is always greater than or equal to zero since frames must never be displayed before their presentation time. Non-zero delays are a sign of playback jitter and possible loss of A/V sync.

and
totalFrameDelay
The sum of all displayed frame delays for all displayed frames. (i.e., Frames included in the totalVideoFrames count, but not in the droppedVideoFrames count.

Here are our concerns:

1. The use of microseconds may be misleading. There is an implied precision here which is rarely (if ever) achievable; by no means everyone can time 'to the nearest microsecond' and sometimes the measurement has to be done 'before the photons emerge from the display', at a point in the pipeline where the rest of it is not completely jitter-free.

2. In any case, frames are actually displayed at the refresh times of the display; display times are actually quantized to the nearest refresh time. So, if I was slightly late in pushing a frame down the display pipeline, but it hit the same refresh as if I had been on time, there is no perceptible effect at all.

3. Thus, ideally, we'd ask for the measurement system to be aware of which display refresh the frame hit, and all results would be quantized to the refresh rate. However, in some (many?) circumstances, though the average or expected pipeline delay is known or can be estimated, the provision of frames for display is not tightly linked to the display refresh, i.e. at the place of measurement, we don't know when the refreshes happen.

4. There is a big difference in jitter between presenting 2000 frames all 5ms late (consistently), and in presenting 50 of them 200ms late and the rest on time, though for both we'd report 10,000ms totalFrameDelay. The 5ms late may not matter at all (see above), whereas 200ms is noticeable (lipsync will probably be perceptibly off). There is nothing in the accumulation of values, today, that takes into account *variation*, which is really the heart of what jitter is about.

I don't have a proposal right now for something better, but felt it was worth surfacing these concerns. Do others have similar, or other, concerns, about these measurements? Or indeed, suggestions for something that might alleviate these or other concerns (and hence, be better)?

I guess a big question is: what are the expected *uses* of these two values?

Comment 1 Mark Watson 2013-09-05 19:19:23 UTC

Regarding 1, 2, 3:

These are fair points. It's probably not correct to require "to the nearest microsecond" - the whole thing will always be approximate.

The downstream delay due to frame refresh will be on average half the refresh interval. So, on average, this could be accounted for.

Regarding 4:

The user of this property should have some notion of the number of frames displayed, or at least elapsed time which will suffice so long as the frame rate is roughly constant. The intention is that they would sample this on a regular basis and evaluate the rate of change. A rate of change below some threshold is in the noise or indicates perfect rendering. If the rate of change is above some threshold then this indicates consistent late rendering.

The application is to detect CPU overload, which in some system manifests as dropped frames but in other systems manifests first as late rendering before frames are dropped in order to "catch up" (if things don't get back on track). An app can track the severity of such events over time and decide to stream at a lower rate on this device.

Comment 2 Aaron Colwell 2013-09-13 20:06:06 UTC

I'm still not a huge fan of this metric but here might be a compromise that can address David's conserns and hopefully be acceptable to Mark.

I propose that we introduce the concept of a "late" frame that represents a frame that was displayed, but not in the correct screen refresh. We then add a lateVideoFrames counter to track the number of frames that fit this criteria.

I'm still not sure whether this is going to be overly useful to the application so I am going to try to outline the various scenarios that I could see happening. For all these examples, I am going to assume a 60Hz refresh rate and that only one frame will actually be displayed per refresh interval. I'll also use notation like R0, R1, R2 to describe individual refresh intervals.

Scenario 1: Clip frame rate > refresh rate.
Say we have 240fps content. Since the refresh rate is only 60 fps, I'd expect that droppedVideoFrames would increment by 3 for every 4 totalVideoFrames because only 1 out of the 4 frames for each refresh interval would get displayed. In this case we don't need late frames.

Scenario 2: Clip frame rate == refresh rate.
For 60fps content, I would expect that dropppedVideoFrames would reflect any missed refresh intervals. For example, if a frame was supposed to be displayed in R0 but wasn't displayed until R2, I would expect that the frames that should have been displayed in R1 & R2 would cause the droppedVideoFrames counter to increment twice because these frames were "too late" to display. If we add the concept of a "late" frame, then I would expect the lateVideoFrame count to be incremented by 1 since 1 frame missed its deadline. The droppedVideoFrames would roughly reflect how late it was.

Scenario 3: Clip frame rate < refresh rate.
Say we have 15fps content. I'd expect frames to be delivered at R0, R4, R8, etc. If the R0 frame is displayed in R1,R2, or R3 it increments the lateVideoFrames counter because the display deadline was missed. This late display would not cause the droppedVideoFrames counter to increment because the display of another frame was not effected. If the R0 frame was displayed at R4-R7 then I'd expect the lateVideoFrames & droppedVideoFrames counter to both increment because we had 1 late display and this also resulted in a dropped frame.

From my perspective Scenario 3 is the only one where I think we would benefit from the "late" frame concept. It isn't clear to me whether it would provide a huge benefit especially if the clip frame rate and the refresh rate are pretty close to eachother. The benefit would seem to increase the lower the frame rate, but low frame rate content doesn't usually tax the CPU as heavily so it doesn't seem like the application would have much room to adapt downward anyways. Also any delays larger than the clip frame rate would show up as dropped frames so I'm not sure what this extra signal is buying us.

Comment 3 Aaron Colwell 2013-10-04 22:59:11 UTC

Ping. David & Mark please take a look at Comment 2 and provide some feedback so I can make some progress on this bug. Thanks.

Comment 4 Mark Watson 2013-10-08 15:30:01 UTC

Aaron,

What you describe assumes an implementation which drops late frames except the first. That's one possible implementation. What I understand is that there are other implementations where there could be a run of late frames.

Specifically, I believe there are implementations where frames are accompanied through the pipeline not by their absolute rendering time but by the inter-frame interval. In such an implementation there can be an accumulating mis-alignment between the correct and actual rendering time. I believe in the implementation in question such an accumulation is detected after some short time - possibly multiple frames - and accounted for by eventually dropping frames.

The totalFrameDelay was intended to enable detection of this condition by the application before or in concert with dropped frames.

At a first look, it seems like a count of late frames would also suffice for the same purpose. The count does not distinguish between a frame that is a little bit late and a frame that is a lot late. Conversely, the totalFrameDelay does not distinguish between a number of frames that are each slightly late and a single frame which is very late. I assume we do not ever expect an individual frame to be very late (like 10s of frame intervals), so neither of these is a problem and we could choose based on implementation complexity / complexity of definition. The latter favors the late frame count.

I will also check with our implementors.

Comment 5 Aaron Colwell 2013-10-08 16:18:09 UTC

(In reply to Mark Watson from comment #4)
> Aaron,
> 
> What you describe assumes an implementation which drops late frames except
> the first. That's one possible implementation. What I understand is that
> there are other implementations where there could be a run of late frames.

True. In this case, I'd expect the lateVideoFrames counter to be incremented for each frame that was late.

> 
> Specifically, I believe there are implementations where frames are
> accompanied through the pipeline not by their absolute rendering time but by
> the inter-frame interval. In such an implementation there can be an
> accumulating mis-alignment between the correct and actual rendering time. I
> believe in the implementation in question such an accumulation is detected
> after some short time - possibly multiple frames - and accounted for by
> eventually dropping frames.
> 
> The totalFrameDelay was intended to enable detection of this condition by
> the application before or in concert with dropped frames.

It seems like the effectiveness of this metric is based on how deep that pipeline is. Is there a case where incrementing the lateVideoFrames won't cause droppedVideoFrames to at least increment by one? It seems like as soon as media engine determines that a bunch of frames are late it would start dropping frames to "catch up".
- What is the scenario where late frames are tolerated for a while w/o triggering frame dropping?
- How often do you expect the web application to poll these stats to detect this condition?
- How long do you expect the delta between detecting late frames and the media engine taking action to drop frames would be?

I'm concerned that the window to take action on "lateness" is too small to be worth worrying about. 

> 
> At a first look, it seems like a count of late frames would also suffice for
> the same purpose. The count does not distinguish between a frame that is a
> little bit late and a frame that is a lot late.

Presumably "a lot late" should trigger a ton of dropped frames so the media engine could catch up. This should look catastropic to the web app and trigger a downshift I would hope.

> Conversely, the
> totalFrameDelay does not distinguish between a number of frames that are
> each slightly late and a single frame which is very late. I assume we do not
> ever expect an individual frame to be very late (like 10s of frame
> intervals), so neither of these is a problem and we could choose based on
> implementation complexity / complexity of definition. The latter favors the
> late frame count.

I'm just trying to sort out whether the application really needs to know the time delta or not. It doesn't seem like the actual time matters because there is nothing the application can do about that. It seems like counts at least provide a signal where the application can compute the percentage of lateness and dropped frames and use those as a signal of quality. The counts are also robust across frame rate changes. If you deal with time, then changes in frame rate may effect the acceptable "lateness" threshold that one uses for adaptation.

> 
> I will also check with our implementors.

Comment 6 Jerry Smith 2013-10-15 18:37:35 UTC

The rationale over all is that there needs to be something that measures frame lateness, and that having dropped frames is not enough.

We believe frame delay communicates more information than late frames.  We expect that the totalFrameDelay metric would be monitored at some interval.  The usage of totalFrameDelay would then be:

1)  Not changing value – good quality playback
2)  Uniformly increasing value – consistent A\V sync (video is behind) but no further improvements or degradations (no jitter)
3)  Non-uniformly increasing value – most likely worsening playback (video falling further behind) or jitter caused by the application trying to compensate by reducing resolution, etc (i.e. improving playback)

So, the totalFrameDelay attribute on its own can provide useful information.

It's possible for a system to go into frame-dropping mode and still be in #1 above since the frames that aren't dropped are still on-time. That state would be detected by the droppedVideoFrames attribute.

Comment 7 Aaron Colwell 2013-10-15 20:34:55 UTC

(In reply to Jerry Smith from comment #6)
> The rationale over all is that there needs to be something that measures
> frame lateness, and that having dropped frames is not enough.

Why does the late frame counter not satisfy this? Is there a specific reason you need to answer the "how late" question?

> 
> We believe frame delay communicates more information than late frames.  We
> expect that the totalFrameDelay metric would be monitored at some interval. 
> The usage of totalFrameDelay would then be:
> 
> 1)  Not changing value – good quality playback
This is equivalent to late & dropped frame counter not incrementing.

> 2)  Uniformly increasing value – consistent A\V sync (video is behind) but
> no further improvements or degradations (no jitter)

Why does the application care about this case? Isn't it up to the UA to make sure that the audio & video are rendered with proper A/V sync? What is the web application supposed to do about it, if the UA isn't doing this properly? This seems like a bug in the MSE implementation and not something that the web application should need to worry about.

> 3)  Non-uniformly increasing value – most likely worsening playback (video
> falling further behind) or jitter caused by the application trying to
> compensate by reducing resolution, etc (i.e. improving playback)

I feel like late and/or dropped frames capture this as well just in a slightly different way. I'd expect late & dropped frame counters to increment non-uniformly in this situation as well.

> 
> So, the totalFrameDelay attribute on its own can provide useful information.

I agree, but it isn't clear to me that exposing such detailed timing information is really necessary. What if we just had an enum that indicated that the UA believes it is in one of those 3 states? That seems like a much clearer way to convey the quality of experience to the web application instead of exposing the totalFrameDelay metric.

> 
> It's possible for a system to go into frame-dropping mode and still be in #1
> above since the frames that aren't dropped are still on-time. That state
> would be detected by the droppedVideoFrames attribute.

I think this would be equivalent to the late counter not incrementing and the dropped counter incrementing. The application could decide if this was an acceptable experience or not.

Ideally I'd like to get away from this time based metric because I believe it may be difficult to get consistent measurements across browsers. I think different measurement precisions and differences in various delays in each browser's media engine will cause this metric to be unreliable or may encourage browser specific interpretation. If that happens, I think we've failed.

Comment 8 Mark Watson 2013-10-22 16:02:31 UTC

I believe that either of total frame delay and dropped frame count could meet the requirement.

In either case, the threshold can be a display refresh interval - that is, a frame is 'late' if it is displayed in the wrong refresh interval.

I still have a mild preference for total frame delay, but without a strong rationale for that preference ;-)

To answer your questions:

- What is the scenario where late frames are tolerated for a while w/o triggering frame dropping?

Imagine 30fps content on a 60Hz display, a few frames are rendered late and then a bunch of frames are rendered at 60fps until we catch up. This might not be the best UX (dropping to catch up might be better), but it's a possible behaviour.

- How often do you expect the web application to poll these stats to detect this condition?

Every second or so.

- How long do you expect the delta between detecting late frames and the media engine taking action to drop frames would be?

I don't know this, but I believe there can be a scenario where there is late rendering and no frame dropping.

Comment 9 Aaron Colwell 2013-10-29 00:19:52 UTC

David please review the comments and provide some feedback. I would like to know if this discussion is addressing your concerns and getting us closer to being able to close this bug.

Comment 10 Jerry Smith 2013-11-05 01:34:02 UTC

It’s true that frame delays would be quantized to the display refresh rate; however, total delay can still provide more information than counting late frames, assuming late frames are counted once per frame whether they are late a single refresh cycle or multiple.  That should mean that once the video stream is one frame late, every frame would be counted as late, the TotalLateFrame metric would expand and client JS would presumably respond by lowering the video quality.  If just one refresh cycle late, that may not be appropriate.

TotalFrameDelay in this instance would accurately communicate that frames were running a specific time interval late, and JS would be allowed to make it's own determination on whether the delay is perceptible to users.  If, however, the stream moved to multiple refresh cycles delayed, this would show as a larger value in TotalFrameDelay, but not in TotalLateFrames.  

If this example is accurate, it would suggest that TotalLateFrames may more aggressively trigger quality changes, but perhaps not desirable ones; and TotalFrameDelay communicates more information that would allow tuning of the response to slight, moderate or large delays in the video stream.  The analog nature of the time data makes it a more desirable feedback signal in what is essentially a closed loop system.

Comment 11 Aaron Colwell 2013-11-05 02:32:52 UTC

(In reply to Jerry Smith from comment #10)
> It’s true that frame delays would be quantized to the display refresh rate;
> however, total delay can still provide more information than counting late
> frames, assuming late frames are counted once per frame whether they are
> late a single refresh cycle or multiple.  That should mean that once the
> video stream is one frame late, every frame would be counted as late, the
> TotalLateFrame metric would expand and client JS would presumably respond by
> lowering the video quality.  If just one refresh cycle late, that may not be
> appropriate.
> 

If one frame misses its display deadline I wouldn't expect that to imply that all future frames would miss their display deadlines too. Only under some sort of constant load would I expect this to happen. In that case it might be a good thing for the application to start thinking about downshifting because there is load present that is preventing the UA from hitting its deadlines.


> TotalFrameDelay in this instance would accurately communicate that frames
> were running a specific time interval late, and JS would be allowed to make
> it's own determination on whether the delay is perceptible to users.  If,
> however, the stream moved to multiple refresh cycles delayed, this would
> show as a larger value in TotalFrameDelay, but not in TotalLateFrames.  

I have concern about leaving this up to the application to sort out. If the delay goes beyond 100ms or so then it is definitely perceptable. Why defer to the application here? Also, if frames are this late, why shouldn't the UA just start dropping frames in an attempt to reestablish A/V sync? This should be a minor & temporary blip in the counts reported if nothing serious is happening.

> 
> If this example is accurate, it would suggest that TotalLateFrames may more
> aggressively trigger quality changes, but perhaps not desirable ones; and
> TotalFrameDelay communicates more information that would allow tuning of the
> response to slight, moderate or large delays in the video stream.  The
> analog nature of the time data makes it a more desirable feedback signal in
> what is essentially a closed loop system.

I think the application should only react if there is persistant lateness and/or dropped frames. I agree that responding to one off lateness would definitely result in instability. 

I do have concerns though that the totalFrameDelay signal will have different characteristics across UA implementations. I believe that will make writing adaptation algorithms that are not UA specific difficult. I think using counts might make this a little better, but different drop characteristics might lead to the same problem.


In the absence of anyone else supporting my alternate solution and since no other solution has been proposed, I'm happy to concede and just resolve this as WONTFIX. The current text was already something I could live with so if the concensus to to leave things as is, I'm fine with that.

Comment 12 David Singer 2013-11-06 23:21:49 UTC

(In reply to Aaron Colwell from comment #11)
> In the absence of anyone else supporting my alternate solution and since no
> other solution has been proposed, I'm happy to concede and just resolve this
> as WONTFIX. The current text was already something I could live with so if
> the concensus to to leave things as is, I'm fine with that.

I am working with my colleagues to try to find something better that is both easily implemented and more meaningful.  Can you hang on a bit (until after TPAC at least)?

Comment 13 David Singer 2013-11-12 01:06:36 UTC

We think there are (at least) three possible ways to go to get a measure that will help noticing when the media engine or platform is 'in trouble.'  We assume in all cases there is some reasonable way to reset the counters (which is probably not on every fetch, or high-frequency checking is likely to be less helpful as it will involve small numbers, making the web app responsible for analyzing them).  A moving window is another possibility, but these are harder to implement and need another parameter (the window length), so it's not so good.

3 possibilities:


1) Keep the following variables
* dropped frames
* displayed frames
* mean displayed lateness
* standard deviation of the lateness
* late frame count


2) Keep a count of 
* dropped frames, 
* on-time frames (or displayed frames)
* maximum lateness, and 
* late frame counts in buckets. 
* total late frames (sum over all buckets)

Because the precision of the delay is less and less important when the delay increase, we size the bucket-width in powers of two. The buckets windows can be [0;1ms[, [1ms;2ms[, [2ms;4ms[, [4ms;8ms[, [8ms;16ms[ (these first buckets are always 0 for a 60Hz display), [16ms;32ms[, [32ms;64ms[, etc. With this property, all we need to agree on is the base (here 1ms) on which the window size is calculated. I think this value can be more easily determined because it has a low impact on the usefulness of the buckets. (We could also base the buckets on the display frequency (giving [0;1/f[, [1/f;2/f[, [2/f;4/f[, [4/f;8/f[, etc.).)

This gives a reasonable number of buckets.


3) If those are too complex, keep the following counts
* dropped frames
* on-time frames (or displayed frames)
* slightly late frames
* noticeably late frames


where noticeably late is frames that are late by larger than some threshold, which is set at initialization, or which has a suitable default and a reset API.  For many purposes, frames that are late by 1-2 frame durations (when that is well defined) or late enough to cause audio sync problems, are the ones to notice, whereas 'slightly late' frames may not be a concern, or only a minor concern.


Because when a frame is displayed it is either on-time or late, giving on-time frames is equivalent to giving displayed frames.

Comment 14 Aaron Colwell 2013-11-13 09:10:03 UTC

Of the 3 options that David proposed, #3 is the one I'd support if I had to pick one. Ideally I'd like us to just pick reasonable constants for 'slightly' and 'noticeably' late instead of making these thresholds configurable.

I'm not a fan of #1 because I believe the metrics would become less sensitive to transient changes in performance as the number of frames increases. You could probably back out the underlying sums from the metrics to counteract this effect, but it may not be worth it.

Option #2 seems too complex w/o a whole lot of gain. I don't think it is clear how applications should evaluate the current quality based on the bucket distribution. I'd like to see us stick with a simpler metric than this.


At this point I'm ready to just defer to Jerry, Mark, and David here. I don't really care beyond keeping things as simple as possible for this first version of MSE. For what it's worth, my current preference order based on all the proposals is:
1. Existing text. (Since it made Jerry & Mark happy, I can live with it, and its a noop)
2. My single late frame count proposal.
3. David's option #3 which essentially adds 2 late frame counters instead of 1.

Comment 15 Mark Watson 2013-11-13 09:14:41 UTC

My order of preference is the same as Aaron's (except we do need to address the existing text which says 'to the nearest microsecond').

Comparing David's option 3 to the existing text or the late frame count, I don't really understand the value of having two separate late frame counts.

[Regarding the 'to the nearest microsecond', we could replace this with 'in microseconds, to the nearest display refresh interval']

Comment 16 Aaron Colwell 2013-12-02 19:28:46 UTC

Changes committed.
https://dvcs.w3.org/hg/html-media/rev/79954895a223

Text from comment 15 applied based in discussion @ TPAC.

Comment 17 Jerry Smith 2013-12-03 22:32:32 UTC

Microsoft recommends that the frame delay metrics be double-precision with units in seconds to match other timing variables used for media.  This matches our implementation in IE11.

Doing this would require the following tweak to Displayed Frame Delay:

The delay between a frame's presentation time and the actual time it was displayed, in a double-precision value in seconds & rounded to the nearest display refresh interval. This delay is always greater than or equal to zero since frames must never be displayed before their presentation time. Non-zero delays are a sign of playback jitter and possible loss of A/V sync.

totalFrameDelay would similarly be double-precision in seconds as a summation of the individual frame delays.

Comment 18 Aaron Colwell 2013-12-10 16:45:32 UTC

Change committed.
https://dvcs.w3.org/hg/html-media/rev/d8ad50e85da3