Warning:
This wiki has been archived and is now read-only.

Articulatory Synthesis Markup Language

From Synthetic Media Community Group
Jump to: navigation, search

Introduction

Articulatory Synthesis Markup Language (ASML) intends to support the model-independent simulation and animation of the muscles and organs of speech articulation. ASML intends to support the blending in of audio samples, of use for difficult-to-simulate consonants. ASML intends to support, via custom XML elements, the programming of reusable animation sequences and the stylization and parameterization of such animation sequences. ASML intends to support useful nesting scenarios for custom XML elements. ASML intends to support audio post-processing effects and the creation of audio post-processing graphs.

Envisioned use-case scenarios for ASML include the cloud-based, server-side and client-side rendering of high-quality speech audio. An example cloud-based scenario is that of a multimodal dialogue system or digital character interacted with via WebRTC. In this scenario, ASML processors could run on the cloud, processing ASML into audio streams, supporting both stateless and stateful, session-based, services. In addition to streaming audio and video to end-users from cloud-based or server-side rendering, computer animation and speech articulation tracks could be streamed to end-users for client-side rendering.

Language Features

Which features should ASML have?

  • It should be scriptable and include a JavaScript API.
  • It should be configurable.
  • It should be fine-tuneable, providing users with the means to obtain desired audio outputs with precision.
  • It should be model-independent, working with arbitrary articulatory synthesis models.
  • It should be extensible, allowing custom XML elements to be defined and utilized.
  • It should support multiple levels of abstraction.
  • It should support animation blending, transitions and parametric animation.
  • It should support the use of audio samples, allowing one to blend in pre-recorded audio samples. This could be of use for difficult-to-simulate consonants.
  • It should support audio post-processing graphs for audio post-processing and audio effects.
  • It should support configuring and animating the parameters of components of audio post-processing graphs.
  • It should support both stateless and stateful, session-based, scenarios.

JavaScript API Features

With the JavaScript API, it should be possible to:

  • iterate, inspect and select available articulatory synthesis models.
  • iterate, inspect and select available preconfigured speakers or voices.
  • create and configure custom speakers or voices.
  • iterate and inspect available audio post-processing node types.
  • create audio post-processing nodes and interconnect them into graphs.
  • configure and animate the parameters of audio post-processing nodes and connections.
  • define custom XML elements for utilization.

Custom XML Elements

Custom XML elements can be defined for use in ASML via the techniques of templates and custom element registries. We can add time-based JavaScript callbacks, e.g. onplayenter and onplayexit, to the JavaScript classes for custom XML elements.

See also: https://developer.mozilla.org/en-US/docs/Web/Web_Components

customElements.defineNS(namespace, name, class, options);

Rough-draft Sketches

The following rough-draft sketch is a markup skeleton which showcases ASML features.

In the following <head> section, one can observe expressiveness for: JavaScript support, model-independence, speaker voice configurability, and audio post-processing effects.

In the following <body> section, one can observe expressiveness for: keyframe-based animation and the blending in of audio samples.

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1">
  <head>
    <script type="text/javascript" src="file.js" />
    <model name="urn:artisynth:models:praat:2020">
      <!-- parameters for the selected model -->
    </model>
    <voice>
      <!-- parameters for a speaker in the selected model -->
    </voice>
    <processing>
      <!-- post-processing, audio effects graphs, digital signal processing graphs -->
    </processing>
  </head>
  <body>
    <seq> <!-- "par" and "seq" resembling SMIL3 -->
      <key ... > <!-- a keyframe -->
        <p name="canuse.dot.notation" value="..." tangent="..." weight="..." ... />
        <p ... />
      </key>
      <audio src="urn:ipa:..." ... /> <!-- can blend in audio samples, of use for difficult-to-simulate consonants -->
      <key ... >
        <p ... />
        <p ... />
      </key>
    </seq>
  </body>
</asml>

The following example illustrates that, with custom XML elements, reusable, parameterizable sequences of articulatory animation can be defined.

In the following example, the <custom:hello> and <custom:world> elements map to articulatory animation sequences which produce audio for the lexemes of "hello" and "world" as stylized or parameterized by a number of elided attributes.

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1"
  xmlns:custom="http://www.example.com/ns/custom/">
  <head>
    <script type="text/javascript" src="custom.js" />
  </head>
  <body>
    <seq>
      <custom:hello ... />
      <custom:world ... />
    </seq>
  </body>
</asml>

The following example illustrates that custom XML elements can be meaningfully nested.

In the following example, the <custom:emph> element adds emphatic prosody to the contained <custom:hello> element which maps to an articulatory animation sequence which produces audio for the lexeme of "hello".

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1"
  xmlns:custom="http://www.example.com/ns/custom/">
  <head>
    <script type="text/javascript" src="custom.js" />
  </head>
  <body>
    <seq>
      <custom:emph ... >
        <custom:hello ... />
      </custom:emph>
      <custom:world ... />
    </seq>
  </body>
</asml>

The following example illustrates the use of custom XML elements from multiple namespaces:

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1"
  xmlns:custom="http://www.example.com/ns/custom/"
  xmlns:alphabet="http://www.example.com/ns/alphabet/"
  xmlns:greek="http://www.example.com/ns/greek-alphabet/"
  xmlns:numbers="http://www.example.com/ns/numbers/">
  <head>
    <script type="text/javascript" src="custom.js" />
    <script type="text/javascript" src="alphabet.js" />
    <script type="text/javascript" src="greek.alphabet.js" />
    <script type="text/javascript" src="numbers.js" />
  </head>
  <body>
    <seq>
      <alphabet:e ... />
      <custom:to ... />
      <custom:the ... />
      <alphabet:i ... />
      <greek:pi ... />
      <custom:plus ... />
      <numbers:one ... />
      <custom:equals ... />
      <numbers:zero ... />
    </seq>
  </body>
</asml>

With JavaScript, one could define custom XML elements which utilize pronunciations provided in attribute values to animate articulatory synthesis:

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1"
  xmlns:p="http://www.example.com/ns/pronounce/">
  <head>
    <script type="text/javascript" src="pronounce.js" />
  </head>
  <body>
    <seq>
      <p:lexeme p:pronounce="pri-ˈsi-zhən" ... />
    </seq>
  </body>
</asml>

Such custom markup could also resemble:

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1"
  xmlns:p="http://www.example.com/ns/pronounce/">
  <head>
    <script type="text/javascript" src="pronounce.js" />
  </head>
  <body>
    <seq>
      <p:lexeme ... >
        <p:syllable p:pronounce="pri" ... />
        <p:syllable p:pronounce="si" ... />
        <p:syllable p:pronounce="zhən" ... />
      </p:lexeme>
    </seq>
  </body>
</asml>

Parse trees could be of use for structuring content so as to adorn it with various parameters for articulatory animation:

<asml xmlns="https://www.w3.org/community/synthetic-media/asml/" version="0.1"
  xmlns:parse="http://www.example.com/ns/parse/">
  <head>
    <script type="text/javascript" src="parse.js" />
  </head>
  <body>
    <seq>
      <parse:sentence ... >
        <parse:noun-phrase ... >
          ...
        </parse:noun-phrase>
        <parse:verb-phrase ... >
          ...
        </parse:verb-phrase>
      </parse:sentence>
    </seq>
  </body>
</asml>

Discussion Topics

Animation Blending

Two types of animation blending to consider are: inter-animation blending (or blend transitions) and intra-animation blending (or parametric animation).

Inter-animation blend transitions are important for smoothly transitioning between animations.

Intra-animation blending is the combining of articulatory animations together into resultant articulatory animations. A modifier on the <par> element, e.g. <par blend="...">, or an ASML element for blending scenarios, e.g. <blend>, could be useful for blending together child animations. More granular forms of intra-animation blending could also be of use, e.g. by placing blendweight attribute scalars on the keyframes of the animations which are desired to be dynamically blended or combined together.

3D Audio Effects

The dynamic spatial positioning of speech audio is important for a number of scenarios including digital entertainment, AR and VR. A topic for discussion is whether ASML should support specifying and animating the spatial positions of speech audio outputs (e.g. for 3D audio, binaural audio, and surround sound) and/or whether client-side audio libraries (such as DirectSound and OpenAL) should be utilized to spatially position and animate the positions of speech audio streams.