W3C

– DRAFT –
Sync on Web, now and next of realtime media services on web

25 September 2024

Attendees

Present
Chris_Needham, Francois_Daoust, Hisayuki_Ohmata, Kaz_Ashimura, Kunihiko_Toumura, mjwilson, Nigel_Megitt
Regrets
-
Chair
KensakuKOMATSU
Scribe
kaz, kota, cpn

Meeting minutes

Time alignment for media synchronization will be discussed in this session

Komatsu: Media over QUIC, no head of line blocking
… In video cases, each frame is fransferred over an independent QUIC stream
… That's the main difference compared to HLS or DASH

Media over QUIC is a relay protocol over QUIC or HTTP/3, often featured in the use of live streaming although the usecases are not limited to it.

Komatsu: CMAF is fragmented MP4. 50 seconds to 2 seconds latency, depending how you use it
… WIth MoQ, per-frame transfer is used, so that each QUIC stream contains under 34 milliseconds duration of data
… For realising low latency live stream services, this duration is important
… You can get very low latency services over MoQ
… HLS and DASH have flexibility, and similarly with MoQ, live and ondemand services
… Synchronising A/V data and arbitrary data is interesting
… Here's a demo to show low latency and sync

MoQT is flexible enough that developers could handle synchronization between multiple types of data such as audio and video

Komatsu: The sender sends the video and audio data and auxiliary data. Then it's transferred to a relay server
… We have this in the cloud
… We use moxygen, developed by Meta
… [Shows demo with sender and receiver]
… There's very low latency
… Glass to glass delay is under about 100ms
… Now I'll demo data synchronisation
… [Demo shows real time face detection]
… We can also send MIDI data
… With this data we can provide live services

MoQT Synchronization Demo: face landmarks/avatar data and video synchronization is performed

Komatsu: [Demo shows a virtual avatar overlaid in the video image]
… Just the data is transferred, and the avatar is rendered on the receiver side
… I think this is a fantastic feature of MoQ
… Now I'll explain about the synchronisation. The diagram shows the sender side

sending point cloud data only enable developers to render the 3d avatar on the subscriber side with its own preferences

Komatsu: Video image is transferred to WASM
… MIDI data will be transmitted to Web MIDI

Each AV and data are multiplexed into single MoQT session using multi tracks on the sender side

Komatsu: In the MoQ context, we can get capture timestamps
… MOQT is the MoQ Transport protocol
… We can send each data in a track: audio, video, data
… Send over MOQT to the relay server
… On the receiver side, the browser receives the MOQT
… Get raw image data from each frame from the capture timestamp. and the MIDI data, and syncrhonise rendering
… MIDI can be used with synthesizers etc
… In live entertainment cases, you can show a live show on the screen, and with MIDI data, the piano sound can be enjoyed by the viewers

On the receiver side AV and other data are synchronized according to captured timestamp

Komatsu: Rendering not only to screen, but orchestrated to external devices
… How do I synchronise the data?
… On the sender side, the browser uses rAF, which clocks at 16ms (60 fps screen update)
… Get the video image data, and data with the same timing. On the receiver side, get from each timeslot and render

Synchronizing face landmarks and AV is relatively easy as landmarks are sent at the same time as requestAnimationFrame function

Komatsu: With external midi devices the data is asynchronous. Inside the 16 ms slot, a MIDI event is fired in the browser
… Playing to external midi devices on the receiver side. The video clock isn't enough, because there's a time interval
… Web MIDI has a send() method, where you can indicate a time interval. MIDI works well on the receiver side
… Concern is the time lag on the input side
… In this model, once MIDI data is transferred to the browser, it goes to the event buffer, then an event is emitted

Concern about time lag of MIDIInput: is there a time lag between device driver and event emit? are melody and rhythm changes because of that?

Komatsu: With JavaScript we can the capture time at the time of event emission, not the time of input
… Example of 120 bpm music, each note could be 62.5 ms apart
… If the time lag is 3ms, it makes a 6% fluctuation
… Other use cases beyond entertainment. It can apply in other cases: remote gaming, with the time lag of the GamePad API
… Remote robot control over WebUSB. Is there a time interval argument to transferOut data, similar to WebMIDI?

There are other cases with the same kind of problems such as remote gaming, remote robot control and remote drawing

Komatsu: Similar problem in Smart City

Kaz: This would be useful for Smart City, as there are many components, devices, sensors to be connected with each other depending on the user need. So this can be an interesting mechanism
… Given the potential time lag between client and server, would it help to a real time OS on each side to manage the time synchronisation?

question from kaz: would it be ideal for both publisher/subscriber to have some kinds of realtime operating systems?

Komatsu: A realtime OS could be onsidered
… Jitter in the network itself, whether it works or not is a question
… My idea is that event objects have a capture timestamp property
… That would be enough for the internet cases I've seen
… What accuracy is required depends on the use case

answer from komatsu: putting captured timestamp in objects might be enough for most usecases

Komatsu: Don't want to talk about details of API changes, but instead talk about whether this is a question or not
… Is timeline alignment really a problem? What use cases should be considered?
… Worth discussing?
… Any other related topics to cover?

Song: Excellent presentation. I raised a similar topic in teh Web & Networks IG about cloud gaming
… With cloud gaming, China Mobile launched last month, have millions of subscribers
… We transmit the data with WebRTC, the time duration is 15 ms in the network across China
… Rendering is 15-20ms, still acceptable
… The biggest part for end to end commercial use is translation of mouse and keyboard events for games, this can cost 90 ms
… For every business case, e.g., digital twins, could be very different
… With the data example I mentioned, we get complaints from game and web developers

song: in the new cloud gaming service from China Mobile, data are sent via WebRTC tkaing about 15ms, rendering takes 15-20ms and user events in games takes about 90ms

Song: The infrastructure is based on IP network, which is best effort
… The request we get from game companies is a deterministic network
… The headache for us is breking the network conditions for milliions of users
… In the Web & Networks IG, we have 2 solutions. One is MoQ. That's in 3GPP release 20, called 6G
… That can change the synchronisation systematically, the router, switch, radio access, coordinate the MIDI with the time clock in the device. Long term solution
… Second is use cloud edge client coordination. If I can't change the best effort IP network, this is why WNIG incubates the cloud edge
… What do you think?

Komatsu: Delay would fluctuate, does that cause confusion for gaming services?

Song: Can follow up with you

Bernard: There are several classes of use case: syncing media+events from a single particpants. Will discuss in Media WG tomorrow
… Trickier is syncing from multiple clients. We found we need additional extensions, both to the network and the web
… Web RTC WG is working on absolute capture timestamp, sync to the server's wall-clock

Bernard: synchronizing between multiple participants would be way more difficult
… We're investigating the timing information necessary, then in the media, everything to be synchronised will need the capture timestamp

Komatsu: I agree. NTP. Depends on the use case. The current WebRTC activity should be considered

Bernard: We want to make it general, not only WebRTC but MOQ and across the board

Harald: A lot of these problems are familiar
… To sync, you need to know the delay between the real world happening
… Differs by device, which is awkward
… Jitter in the network, difference in delays. That has to be handled with jitter buffers, which introduces delay
… A too short jitter buffer means losing stuff
… That's the trade-off
… When we did WebRTC for stadia, we had a concept called ??
… You'd sync events that happened before, so the efects are visible. You wish for time travel!
… Timestamping everything at the outset is a good starting point

Komatsu: MoQ we can manipulate the jitter buffer in the browser

jya: We tried doing media sync with Media Session. Not to this level of synchronicity

Paul: Look at the Proceedings of the Web Audio conference over the years
… We're able to do sub-millisecond sync of real time audio
… In general, the question is tradeoff between latency and resilience
… Need to consider clock domain crossing. Clocks on the sender side, different no the rx side. Need a source of truth, and re-clock and resample so there are no gaps
… This means the relationship between the audio and midi events are preserved, then you offset that by the latencies (audio output, video, etc), reclock everything
… Important to preserve the relationship between the two
… Typically between two sound cards, can be 0.08% difference. If you watch a movie for 1 hour, it's skewed and broken, needs to be taken care of
… Installation at WAC showed real time audio streams playing across different audio devices nicely. There is hope, but it's a clock thing. Delay vs resilience is the question

Jer: To add to jya's point, we were going for about a second

Michael: I'm in Audio WG, co-editor of Web MIDI. We have an issue about syncing MIDI with Web Audio on the same host
… IMO jitter in MIDI is more important than latency. Now is a good time to add things to the spec, if those are easy

Kaz: Given those potential and promising technologies, I wonder about what kind of mechanism would be preferred to handle sync of multiple threads? Interesting to think about the contorl mechanism

Paul: Web Audio gives you the system and audio clocks, so you can calculate the slope. rAF(), two timestamps, understand the slope and drift. With this, it's possible
… Real time OS might be overkill. We get good results with commercial OS's
… If we're looking at variation about 1ms, a regular computer with proper scheduling classes will go far

Komatsu: To wrap up, I want to talk about next steps
… Community group, or existing CG or IG?

Chris: You're welcome to bring this to MEIG if you want to discuss more about use cases and requirements

Harald: Attend the Media WG / Web RTC meeting where we'll discuss sync

Komatsu: Thank you all!

[adjourned]

Minutes manually created (not a transcript), formatted by scribe.perl version 229 (Thu Jul 25 08:38:54 2024 UTC).

Diagnostics

Succeeded: s/wallcock/wall-clock/

Succeeded: s/Bernald/Bernard

Maybe present: Bernard, Chris, Harald, Jer, jya, Kaz, Komatsu, Michael, Paul, Song

All speakers: Bernard, Chris, Harald, Jer, jya, Kaz, Komatsu, Michael, Paul, Song

Active on IRC: baboba0, cpn, hta, kaz, kota, ktoumura, mjwilson, nigel, ohmata, song, Sun, tidoust, tpac-breakout-bot