Audio Session

W3C First Public Working Draft,

More details about this document
This version:
https://www.w3.org/TR/2024/WD-audio-session-20241107/
Latest published version:
https://www.w3.org/TR/audio-session/
Editor's Draft:
https://w3c.github.io/audio-session/
History:
https://www.w3.org/standards/history/audio-session/
Feedback:
GitHub
Editors:
(Apple)
(Mozilla)

Abstract

This API defines an API surface for controlling how audio is rendered and interacts with other audio playing applications.

Status of this document

This section describes the status of this document at the time of its publication. A list of current W3C publications and the latest revision of this technical report can be found in the W3C technical reports index at https://www.w3.org/TR/.

Feedback and comments on this specification are welcome. GitHub Issues are preferred for discussion on this specification. Alternatively, you can send comments to the Media Working Group’s mailing-list, public-media-wg@w3.org (archives). This draft highlights some of the pending issues that are still to be discussed in the working group. No decision has been taken on the outcome of these issues including whether they are valid.

This document was published by the Media Working Group as a First Public Working Draft using the Recommendation track. This document is intended to become a W3C Recommendation.

This document is a First Public Working Draft.

Publication as a First Public Working Draft does not imply endorsement by W3C and its Members.

This is a draft document and may be updated, replaced or obsoleted by other documents at any time. It is inappropriate to cite this document as other than work in progress.

This document was produced by a group operating under the W3C Patent Policy. W3C maintains a public list of any patent disclosures made in connection with the deliverables of the group; that page also includes instructions for disclosing a patent. An individual who has actual knowledge of a patent which the individual believes contains Essential Claim(s) must disclose the information in accordance with section 6 of the W3C Patent Policy.

This document is governed by the 03 November 2023 W3C Process Document.

1. Introduction

People increasingly consume media (audio/video) through the Web, which has become a primary channel for accessing this type of content. However, media on the Web often lacks seamless integration with underlying platforms. The Audio Session API addresses this gap by enhancing media handling across platforms that support audio session management or similar audio focus features. This API improves how web-based audio interacts with other apps, allowing for better audio mixing or exclusive playback, depending on the context, to provide a more consistent and integrated media experience across devices.

Additionally, some platforms automatically manage a site’s audio session based on media playback and the APIs used to play audio. However, this behavior might not always align with user expectations. This API allows developers to override the default behavior and gain more control over an audio session.

2. Concepts

A web page can do audio processing in various ways, combining different APIs like HTMLMediaElement or AudioContext. This audio processing has a start and a stop, which aggregates all the different audio APIs being used. An audio session represents this aggregated audio processing. It allows web pages to express the general nature of the audio processing done by the web page.

An audio session can be of a particular type, and be in a particular state. An audio session manages the audio for a set of individual sources (microphone recording) and sinks (audio rendering), named audio session elements.

An audio session's element has a number of properties:

An audio session element is an audible element if its audible flag is true.

Additionaly, an audio session element has associated steps for dealing with various state changes. By default, each of these steps is empty list of steps:

A top-level browsing context has a selected audio session. In case of a change to any audio session, the user agent will update which audio session becomes the selected audio session. A top-level browsing context is said to have audio focus if its selected audio session is not null and its state is active.

User agents can decide whether to allow several top-level browsing context to have audio focus, or to enforce that only a single top-level browsing context has audio focus at any given time.

3. The AudioSession interface

AudioSession is the main interface for this API. It is accessed through the Navigator interface (see § 4 Extensions to the Navigator interface).

[Exposed=Window]
interface AudioSession : EventTarget {
  attribute AudioSessionType type;

  readonly attribute AudioSessionState state;
  attribute EventHandler onstatechange;
};

To create an AudioSession object in realm, run the following steps:

  1. Let audioSession be a new AudioSession object in realm, initialized with the following internal slots:

    1. [[type]] to store the audio session type, initialized to auto.

    2. [[state]] to store the audio session state, initialized to inactive.

    3. [[elements]] to store the audio session elements, initialized to an empty list.

    4. [[interruptedElements]] to store the audio session elements that where interrupted while being audible, initialized to an empty list.

    5. [[appliedType]] to store the type applied to the audio session, initialized to auto.

    6. [[isTypeBeingApplied]] flag to store whether the type is being applied to the audio session, initialized to false.

  2. Return audioSession.

Each AudioSession object is uniquely tied to its underlying audio session.

The AudioSession state attribute reflects its audio session state. On getting, it MUST return the AudioSession [[state]] value.

The AudioSession type attribute reflects its audio session type, except for auto.

On getting, it MUST return the AudioSession [[type]] value.

On setting, it MUST run the following steps with newValue being the new value being set on audioSession:

  1. If audioSession.[[type]] is equal to newValue, abort these steps.

  2. Set audioSession.[[type]] to newValue.

  3. Update the type of audioSession.

3.1. Audio session types

By convention, there are several different audio session types for different purposes. In the API, these are represented by the AudioSessionType enum:

playback
Playback audio, which is used for video or music playback, podcasts, etc. They should not mix with other playback audio. (Maybe) they should pause all other audio indefinitely.
transient
Transient audio, such as a notification ping. They usually should play on top of playback audio (and maybe also "duck" persistent audio).
transient-solo
Transient solo audio, such as driving directions. They should pause/mute all other audio and play exclusively. When a transient-solo audio ended, it should resume the paused/muted audio.
ambient
Ambient audio, which is mixable with other types of audio. This is useful in some special cases such as when the user wants to mix audios from multiple pages.
play-and-record
Play and record audio, which is used for recording audio. This is useful in cases microphone is being used or in video conferencing applications.
auto
Auto lets the user agent choose the best audio session type according the use of audio by the web page. This is the default type of AudioSession.
enum AudioSessionType {
  "auto",
  "playback",
  "transient",
  "transient-solo",
  "ambient",
  "play-and-record"
};

An AudioSessionType is an exclusive type if it is playback, play-and-record or transient-solo.

3.2. Audio session states

An audio session can be in one of the following state , which are represented in the API by the AudioSessionState enum:

active
the audio session is playing sound or recording microphone.
interrupted
the audio session is not playing sound nor recording microphone, but can resume when it will get uninterrupted.
inactive
the audio session is not playing sound nor recording microphone.
enum AudioSessionState {
  "inactive",
  "active",
  "interrupted"
};

The audio session's state may change, which will automatically update the state of its AudioSession object.

4. Extensions to the Navigator interface

Each Window has an associated AudioSession, which is an AudioSession object. It represents the default audio session that is used by the user agent to automatically set up the audio session parameters. The user agent will request or abandon audio focus when audio session elements start or finish playing. Upon creation of the Window object, its associated AudioSession MUST be set to a newly created AudioSession object with the Window object’s relevant realm.

The associated AudioSession list of elements is updated dynamically as audio sources and sinks of the Window object are created or removed.

[Exposed=Window]
partial interface Navigator {
  // The default audio session that the user agent will use when media elements start/stop playing.
  readonly attribute AudioSession audioSession;
};

5. Privacy considerations

6. Security considerations

7. Examples

7.1. A site sets its audio session type proactively to "play-and-record"

navigator.audioSession.type = 'play-and-record';
// From now on, volume might be set based on 'play-and-record'.
...
// Start playing remote media
remoteVideo.srcObject = remoteMediaStream;
remoteVideo.play();
// Start capturing
navigator.mediaDevices
  .getUserMedia({ audio: true, video: true })
  .then((stream) => {
    localVideo.srcObject = stream;
  });

7.2. A site reacts upon interruption

navigator.audioSession.type = "play-and-record";
// From now on, volume might be set based on 'play-and-record'.
...
// Start playing remote media
remoteVideo.srcObject = remoteMediaStream;
remoteVideo.play();
// Start capturing
navigator.mediaDevices
  .getUserMedia({ audio: true, video: true })
  .then((stream) => {
    localVideo.srcObject = stream;
  });

navigator.audioSession.onstatechange = async () => {
  if (navigator.audioSession.state === "interrupted") {
    localVideo.pause();
    remoteVideo.pause();
    // Make it clear to the user that the call is interrupted.
    showInterruptedBanner();
    for (const track of localVideo.srcObject.getTracks()) {
      track.enabled = false;
    }
  } else {
    // Let user decide when to restart the call.
    const shouldRestart = await showOptionalRestartBanner();
    if (!shouldRestart) {
      return;
    }
    for (const track of localVideo.srcObject.getTracks()) {
      track.enabled = true;
    }
    localVideo.play();
    remoteVideo.play();
  }
};

8. Acknowledgements

The Working Group acknowledges the following people for their invaluable contributions to this specification:

Conformance

Document conventions

Conformance requirements are expressed with a combination of descriptive assertions and RFC 2119 terminology. The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in the normative parts of this document are to be interpreted as described in RFC 2119. However, for readability, these words do not appear in all uppercase letters in this specification.

All of the text of this specification is normative except sections explicitly marked as non-normative, examples, and notes. [RFC2119]

Examples in this specification are introduced with the words “for example” or are set apart from the normative text with class="example", like this:

This is an example of an informative example.

Informative notes begin with the word “Note” and are set apart from the normative text with class="note", like this:

Note, this is an informative note.

Conformant Algorithms

Requirements phrased in the imperative as part of algorithms (such as "strip any leading space characters" or "return false and abort these steps") are to be interpreted with the meaning of the key word ("must", "should", "may", etc) used in introducing the algorithm.

Conformance requirements phrased as algorithms or specific steps can be implemented in any manner, so long as the end result is equivalent. In particular, the algorithms defined in this specification are intended to be easy to understand and are not intended to be performant. Implementers are encouraged to optimize.

Index

Terms defined by this specification

Terms defined by reference

References

Normative References

[DOM]
Anne van Kesteren. DOM Standard. Living Standard. URL: https://dom.spec.whatwg.org/
[HTML]
Anne van Kesteren; et al. HTML Standard. Living Standard. URL: https://html.spec.whatwg.org/multipage/
[RFC2119]
S. Bradner. Key words for use in RFCs to Indicate Requirement Levels. March 1997. Best Current Practice. URL: https://datatracker.ietf.org/doc/html/rfc2119
[WEBAUDIO]
Paul Adenot; Hongchan Choi. Web Audio API. 17 June 2021. REC. URL: https://www.w3.org/TR/webaudio/
[WEBIDL]
Edgar Chen; Timothy Gu. Web IDL Standard. Living Standard. URL: https://webidl.spec.whatwg.org/

IDL Index

[Exposed=Window]
interface AudioSession : EventTarget {
  attribute AudioSessionType type;

  readonly attribute AudioSessionState state;
  attribute EventHandler onstatechange;
};

enum AudioSessionType {
  "auto",
  "playback",
  "transient",
  "transient-solo",
  "ambient",
  "play-and-record"
};

enum AudioSessionState {
  "inactive",
  "active",
  "interrupted"
};

[Exposed=Window]
partial interface Navigator {
  // The default audio session that the user agent will use when media elements start/stop playing.
  readonly attribute AudioSession audioSession;
};