Whatever can be done, will be done
Presenter: Christoph Guttandin
Duration: 14 minutes
Slides & video
Keyboard shortcuts in the video player
- Play/pause: space
- Increase volume: up arrow
- Decrease volume: down arrow
- Seek forward: right arrow
- Seek backward: left arrow
- Captions on/off: C
- Fullscreen on/off: F
- Mute/unmute: M
- Seek to 0%, 10%… 90%: 0-9
Hi everyone, thanks for having me. It’s a real pleasure to be a part of this workshop.
I titled my talk Whatever can be done will be done". And I hope it becomes apparent in the end why I did so.
I’m Christoph Guttandin. I have a company called Media Codings. And I do freelance work for various other companies and I guess the two most interesting in the context of this workshop are Source Elements and InVideo.
I’m usually named chrisguttandin anywhere on the internet. So in case you want to chat with me please feel free to reach out. Just send me a message on the platform of your choice.
When I was asked to prepare a talk for this workshop I thought it’s a great opportunity to present our wishes to the world. So I asked my co-workers to help me prepare a list with the things that we would like to implement at some point or which we implemented already but would really like to implement in a better way if possible.
The first item on our list is using custom codecs with WebRTC.
Doing this has been possible for a while at least to a certain extent. You can do it by encoding the audio data and video data yourself. And then you would ignore the media functionality of WebRTC and send the data over a DataChannel. But the whole process is very cumbersome and at least for video it’s not very accurate as well. Every video frame needs to be drawn to a canvas and then it needs to be grabbed from there to hand it over to WebAssembly. It’s very likely that you miss a few frames when doing this with a live feed.
Luckily this is not necessary anymore since we can now use WebCodecs to do this in a much more efficient way.
But WebCodecs are only available in Chromium browsers so far. Firefox is working on it. But unfortunately it’s unclear to me what Apple is thinking about it.
The next item in our list is partial decoding.
And with that we mean the ability to only decode a certain range or maybe only one specific frame of a media asset.
There is a very hacky way to do this for audio which works by using the decodeAudioData() method. This method is available on an AudioContext. Unfortunately it automatically re-samples the audio to the sampleRate of the AudioContext. That means the file needs to be parsed manually to know the correct sampleRate before doing the actual decoding. And decodeAudioData() only works with full files which is another reason why the file needs to be parsed before the decoding. We need to find out where it can be sliced. This is not that easy to figure out but it’s possible for most file types. And when done correctly and you’re lucky decodeAudioData() will happily decode a part of a file because it believes that it’s decoding the full file.
Sadly though decodeAudioData() is really broken in the newest version of Safari. Apparently the bug is already fixed in the codebase. But no one knows when that fix will be available for Safari users.
To decode a single video frame one could load the video with a media element and then use seekToNextFrame() to get the frames one by one.
But sadly this only works in Firefox.
But again all of this is not necessary anymore now that we have WebCodecs support in Chromium and soon in Firefox.
Another thing which is crucial for us is to offload as much as we can to other threads. In the end the main thread should just be there for triggering the work but not for doing it.
There are a number of APIs which follow this pattern already. One of them is the AudioWorklet and the Web Audio API in general.
For video content there is the OffscreenCanvas which can be used from within a Web Worker.
And last but not least it’s possible to insert a TransformStream into a MediaStream and transfer that to a Web Worker as well.
But as you can see browser support for anything but the Web Audio API is not that great. Chromium supports all of those APIs. Firefox has an OffscreenCanvas implementation behind a flag but didn’t say anything about transferable streams so far. And as usual I don’t even dare to guess when this will be available in Webkit or Safari.
Another thing which is often tricky for us is to keep media in sync.
Especially if there is some audio or video processing involved which will delay one or the other making sure audio and video appear to be sync again when played back becomes very tricky.
There are two properties on an AudioContext which allow us to know when a sound scheduled on that AudioContext can actually be heard by the user. And this allows us to make sure the video frame displayed at that time matches the audio.
But sadly these properties only fully work in Firefox so far. I guess I don’t have to mention anymore for which browser I don’t know when they will be available or if they become available at all.
Another hot topic for us is the possibility to select a certain output device instead of using the default one.
There is a method one can call to change the output device of a media element but it only works in Chromium browsers so far. It’s called setSinkId().
And as far as I know Firefox is currently implementing the selectAudioOutput() method which is a new way to give consent to access audio output devices. And this is basically what blocks them from enabling setSinkId().
Chromium browsers expose the audio output devices already. Therefore implementing selectAudioOutput() is not really necessary to use setSinkId().
But as usual I don’t have an idea what Apple is up to.
So when looking at the wishlist again it looks like it became a todo list. All those things can be done now and existing hacks can be replaced with solid implementations, either today or in the near future.
At least if we ignore Safari for now.
Anyway, I came up with some more wishes which aren’t really spec related anymore and are more geared towards the implementers.
The first item on that list is that I wish for releases to be as boring as possible.
I think Chromium browsers and Firefox do have a pretty good process to ensure that already.
Both browsers publish a nightly build. Right now this is version 97 for Chromium and 95 for Firefox. But the exact numbers don’t really matter here.
Every 4 weeks the state of the nightly version gets promoted to the next stage.
Whatever has been the nightly version at that time becomes the beta version. Likewise the nightly version will increase as well. So it’s clear that whatever is in the nightly versions of those browsers today will be in beta in at least 4 weeks.
Another 4 weeks later the beta version becomes the stable version and the nightly version - whatever that will be at that time - becomes the new beta version.
It’s like a steady stream of updates. And by the time a feature reaches the beta channel you can calculate the date at which it becomes available to all regular users.
The whole process is very predictable and yes it's also super boring.
I usually test my code against the nightly versions to check regressions coming down the pipe and to make sure there will be no surprises (at least for me) when a new version of Chromium or Firefox gets published.
Sadly things are a bit different with Safari. There is a Technology Preview of Safari which is currently at version 133. It contains a lot of experimental and unfinished implementations of upcoming features.
But it is unknown which of those features end up in the next stable release of Safari. They treat it as a secret. The Technology Preview and the stable Safari have completely unrelated version numbers. I guess the stable version is a subset of the Technology Preview but I can’t really say that for sure.
There is no way for developers to test their apps with what becomes the next stable version of Safari. Testing it is only possible after it got already released to all the users.
This is of course challenging when you try to build a reliable web app that users can trust to work as expected.
Another problem is that the stable version of Safari gets only updated every 6 months. Which means the minimum lifetime of regressions is usually 6 months. And since regressions can’t get caught before they get released, regressions are really not unheard of in Safari.
As I said there is currently one which breaks decodeAudioData() and another one which breaks audio streams in WebRTC.
This leads me to my next wish.
I would love regressions to get fixed as soon as possible.
Imagine building a web app that media professionals rely on to get their work done on every single day. And suddenly a browser update causes that app to fail.
I know that even Safari can get security updates in a very timely manner. And I would love that to happen for patches to fix regressions, too.
I know some powerful features have the potential to get abused by malicious pages. And I definitely agree that certain features should not be enabled by default for each and every page.
However I think the users of a browser should have the option to allow certain sites to access the file system, to record the whole screen, to capture the system audio, to receive MIDI messages, to run high priority threads and so on and so on.
I think this doesn’t need to be an explicit permission prompt in every case. It could also be a little toast-style message that pops up to notify the users about the usage of a certain API or something totally different.
The point is I think the users should be empowered to decide themselves which features they want to enable and which ones they better don’t want to use right now.
Similarly as a developer I would really love to have the same power. As I said before I like to run automated tests against the current and the upcoming version of each browser.
I do this locally and I also do this in the cloud with services like BrowserStack and Sauce Labs.
It’s a real challenge to test media APIs because they usually require user interaction to work. But there is obviously no user when running automated tests.
There are flags that one can set for Chromium browsers and Firefox. But they are not really very well documented and they always lag behind the capabilities of the browser and sadly they have the tendency to break from time to time.
And at least as far as I know it’s not even possible to disable the autoplay policy in Safari when starting the browser programmatically. That means things are more difficult to test in Safari. Which in turn means bugs get less caught. And this is of course a real problem since as I said before a typical bug stays in Safari for at least 6 months. But only if it gets caught in the first place.
Another thing which is of course totally different but I think is a bit problematic as well is that building web apps seems to be so easy at first glance. Last week I saw a full WebCodecs example which fit on a single slide. That was very impressive and it showed that one could build something really powerful in no time.
And yes, that’s absolutely true. I did for example quickly hack together a little app to record this talk in the browser. I didn’t use it in the end but that’s a different story. But building a real web app which is meant to be used by people in a professional context day in and day out is a completely different story. Doing that requires a lot of effort.
I think many people underestimate the amount of work that needs to be done to build a full product on the web.
I honestly think building professional apps is challenging in any environment and I wouldn’t expect to be easy on the web.
But after all I think it’s mostly a chicken and egg problem. Once a few well known and well established web apps exist that have wide adoption among media professionals others will become interested in bringing there apps to the web as well.
Source Elements does for example have a plugin for DAWs. It’s currently a native application since the DAWs that it’s meant to be used with are native applications as well. And this could only ever change if the DAWs become web apps at some point.
There is definitely still a long way to go but I think the shift has already begun.
In conclusion I would like to repeat the title of this talk again: A fundamental rule in technology says whatever can be done will be done..
I think building professional media applications for the web is something that can be done today. And I know many people working on it. I hope and believe it’s only a matter of time until that becomes the new normal.
Thanks again to the people I work with for helping me to prepare this talk and many thanks to all of you for watching it. See you at the workshop. Bye.
Interested in sponsoring the workshop?
Please check the sponsorship package.