W3C

– DRAFT –
Media Timed Events

17 January 2022

Attendees

Present
Amber_Ryan, Chris_Needham, Fuqiao_Xue, Karl_Carter, Kaz_Ashimura, Rob_Smith, Takio_Yamaoka, Xabier_Rodriquez_Calvar, Yuhao_Fu
Regrets
-
Chair
Chris
Scribe
cpn

Meeting minutes

Introduction

Chris: This is the Media Timed Events TF call
… Welcome to Amber and Karl

DataCue review

Chris: We had some feedback from C2PA, who were looking at DataCue API as a potential solution
… Most recent feedback is they aren't looking at the DASH emsg event as carrier for their metadata
… The certificate check that was proposed may no longer be a requirement
… Related issue: https://github.com/WICG/datacue/issues/21
… Outcome is that we need to review the explainer, and possibly review the requirements related to encryption

Rob: Reading that issue, can you explain the use case?

Chris: It's about demonstrating the provenance, how content has been edited along the way

<kaz> Explainer

Rob: I have a similar use case, dashcam evidence for the police, provenance is important
… Public can submit footage to the police, must be as-captured and not be edited
… Not sure how they determine that
… With a sidecar file like WebVMT could be more difficult

Chris: It does sound interesting to explore. Here's the info: https://c2pa.org/

Rob: Another use cases is evidence when an event is reported using a smartphone. Is it authentic, has metadata been constructed or genuinely captured?
… Disaster relief, state crime, etc

Chris: I haven't had much time to contribute to the explainer or draft spec
… We talked last time about separating the DataCue part from the sourcing of in-band emsg part

SEI event handling

<kaz> video SEI event Explainer

Chris: This was presented at a previous MEIG meeting: https://www.w3.org/2021/12/07-me-minutes.html

<kaz> Yuhao's slides for the Dec-7 meeting

Yuhao: I'm in the web media team at ByteDance. There are many scenarios where we need SEI information
… Broadcasters use software to publish the stream. Events to describe when something happened, go into RTMP stream as SEI event

<kaz> (SEI stands for Supplemental Enhancement Information of H.264.)

Yuhao: The player receives the SEI information and parses it, synchronises it, between demuxer and video current time
… I raised the proposal to see if we can get SEI information directly from the video element, to make it easier to synchronise SEI information with the video frame
… Also to make it easier for the demuxer, don't need to parse manually

Chris: In this group we've looked at DASH emsg which is in the media container
… SEI events are in the media bitstream rather than the container

Yuhao: We use SEI commonly in China, we produce different live stream formats: DASH, HLS, FLV
… It's in the AVC or HEVC stream NAL unit, so there's no need to transfer the data, it can be put in any container

Chris: How does it interact with EME?
… Can the SEI information be extracted before the media enters the CDM?

Yuhao: In most scenarios, the information in SEI is simple, it doesn't need EME

Takio: Should timing be before or after decoding? There's a decoding order and a presentation order
… If you need the message before decoding, it's recommended which timing should be fired for SEI events
… For emsg, the MP4 container describes the timestamp for decoding. The video stream doesn't present any timestamp, so we could make clear the use case and requirement for firing the message

Kaz: Thank you for the proposal and discussion. Based on the discussion in the December meeting, it could be useful to clarify use cases
… and describe the timing of decoding and integration of EME

<kaz> video SEI event Explainer on GitHub

Chris: Let's capture questions in GitHub, use that to update the explainer
… 1. Interaction with EME
… 2. Timing of event firing, decode or presentation order

You describe the bullet chat use case, where information is used to describe where overlays can be placed in the image

Yuhao: In China, SEI is used to describe the shape and position of the body, and in the player we make a transparent mask

<kaz> Masking in Bullet Chatting

Chris: So is the composition of the image done in the client?
… So the client uses the metadata to place the overlaid content while playing the video

Yuhao: In IOS Safari we cannot get the stream content, we cannot demux, so it's not as accurate as we expect

Chris: Interesting from an implementation feasibility perspective
… In this case do you simply give the video element the HLS manifest?

Yuhao: Yes
… Another use case is WebRTC. We can solve the RTC problem with insertable streams. If we can get SEI information from the video directly, it will be simpler
… Getting the information from the video is a simpler solution

Chris: What are the synchronization requirements?
… The video playback runs separately to the DOM
… It can be difficult to synchronize changes to the DOM with the video frames
… In the bullet chat use case, are there SEI messages for every frame?

Yuhao: For a 60fps video, maybe we just need 15 fps frequency of updates to the information
… You may not be able to see the difference, storing data at 60fps would use more bandwidth

Chris: How precise does the overlay rendering need to be, in relation to the video?

Yuhao: In most cases for playing video, we synchronise everything to currentTime. In most cases we don't need more precision

Chris: How many milliseconds accuracy, roughly?

Yuhao: In our scenarios, we use 60 fps video, so 16 ms is the smallest unit, so 10 ms is maybe enough
… Maybe we can play with requestVideoFrame callback. When a video frame is rendered the callback will trigger, and we can synchronize to that
https://wicg.github.io/video-rvfc/
… The callback includes the presentation time of the current frame, so we can use that to synchronize on the specific frame

Rob: This relates to a problem I'm thinking about with perspective imagery
… If we're drawing on video, it matters which frame we're looking at. If it's wrong it could be noticeable
… Any overlap also with AR applications, e.g., WebXR?
… Other problem is latency, how fast can you respond to a frame

Chris: Good to look at rVFC, to see how our proposal fits

Rob: My use case is geo-pose, location and orientation (tilt, roll), related to geographic coordinates
… The problem I have is that it's easy to sample location, e.g., every second. But orientation can change very quickly
… So how fast do you need to sample it?

Chris: I also have questions about WebCodecs, e.g., related to https://github.com/w3c/webcodecs/issues/198
… Do we need a proposal for WebCodecs and also a proposal for HTML <video> based playback

Kaz: Do we want to create a TF for this discussion, or continue within the MTE TF or main call?

Chris: Don't think we've figured out yet whether DataCue is the right solution for SEI, it may or may not be
… When should we meet next to continue the discussion?
… Yuhao, can we raise questions in your GitHub repo?

Yuhao: Yes, that's OK

Chris: I'll do that
… And when schedule our next meeting?
… The next scheduled MTE call is February 21
… A future call we could discuss moving the proposal in GitHub to W3C space (probably WICG)

[Adjourned]

Minutes manually created (not a transcript), formatted by scribe.perl version 185 (Thu Dec 2 18:51:55 2021 UTC).