Math Speech Annotations
Background
MathML makes it possible to generate speech corresponding to the MathML. But sometimes, users have specific needs and want to specify exactly what should be spoken. This might happen in high stakes testing where the speech should neither give away the answer. An example would be scientific notation where you don't want the system to be clever and read a number with lots of zeros in it as scientific notation if the question is asking what the scientific notation is.
MathML allows for the addition of alttext to the Math element, but there are several drawbacks to using that:
- The entire expression must have an equivalent, not just the part that needs it
- The alttext is limited to plain text, so speech queues such as pausing and pitch changes can't be given
- There is no connection to the MathML, so sync highlight or navigation is not possible
Use cases
- Testing
- in some test situations, you want to be careful about the speech. E.g., "Is 1,000,000 a) one thousand, b) one million, c) one billion"? [This example probably wouldn't be in MathML, but it shows the idea]
- eBook or other electronic document
- I see two main ways the speech annotation can be handled:
- At document creation time, an automated tool adds the speech annotation by inspection of the MathML.  Some tools may permit the author to interactively guide the speech annotation.  The resulting document would contain both the MathML and the speech annotation.
- At document rendering time, the presentation system can create the speech annotation from the MathML directly. Hence, the document would not need the speech annotation as a static item. The human reader of the document may interactively select various speech or reading options and the presentation system would automatically produce the required speech annotation.
Of course a hybrid approach is also possible, where the speech annotation is a static item in the document but the rendering system can add dynamic annotation. In some cases, the author may need to manually adjust the speech annotation as part of the document finalization.
Requirements
- whatever is done must be valid MathML wrt to the current schema
- prosody information (pauses, pitch/rate changes) should be supported, along with forced pronunciation (at least the long "a" for English)
- support for multiple languages (MathML supports xml:lang)
- support for multiple speech options (e.g., ETS wants to say it one way and the college board another way)
- allowing for cross-linking of subparts so that navigation picks up the corresponding part of the exact speech (this could be done via node numbering rather than explicit href/ids)
- flag to say whether author generated or machine generated (particularly if it is the whole expression). Actually, there is a chunk of info that is metadata for speech that should be included somewhere.
- Should speech be limited to correspond to MathML elements, or can they begin/end in the middle of them (e.g., "+b" in "a+b"?
In the above, if a language or target (e.g, SAT or "Learning Disability") is supported, that information needs to be in the markup so that the specific case can be pulled out when needed.
Should these be requirements?
Recorded Audio
- Dennis
- maybe we should support recorded audio
- Neil
- I'm not keen on linking to recorded audio because that seems like something that should be at a higher level (one would record the entire expression or maybe even the entire document, not just some subexpression. On the hand, maybe recorded audio is the mglyph of the audio realm... For those systems (like EPUB) where recorded audio is allowed/important, hopefully the math "just works". A drawback to recording just the math is that the speech won't match that of the rest of the document (rate, voice, ...).
- Dennis
- Yes, using recorded audio can be a bit dicey. But it looks like SSML supports recorded audio through it's <audio> element, so I think we can lean on that feature.
Ideas
SSML
Simple SSML Example
One possible way to specify speech is to put SSML in a math annotation. E.g.,
<math xmlns='http://www.w3.org/1998/Math/MathML'>
  <mrow>
    <semantics>
      <mrow> <mi>a</mi> <mo>+</mo> <mi>b</mi> </mrow>
      <annotation-xml name="exactspeech" encoding="application/ssml+xml">
        <speak xmlns="http://www.w3.org/2001/10/synthesis">
          <phoneme alphabet="ipa" ph="eɪ;"> a </phoneme>
          added to b
        </speak>
      </annotation-xml>
    </semantics>
    <mo>=</mo>
    <mi>c</mi>
  </mrow>
</math>
SSML with synchronized highlighting
SSML has a <mark> element that I believe we can leverage for synchronized highlighting. Here's an example that contains a second annotation of audio. SSML also supports audio and I think using that feature is probably better than this example.
Using IDs for cross-referencing
<?xml version="1.0" encoding="UTF-8"?>
<m:math id="gh10" xmlns:m="http://www.w3.org/1998/Math/MathML">
<m:semantics>
  <m:mrow id="gh11">
    <m:mi id="gh12">x</m:mi>
    <m:mo id="gh13">=</m:mo>
    <m:mfrac id="gh14">
      <m:mrow id="gh15">
        <m:mo id="gh16" form="prefix">−<!-- − --></m:mo>
        <m:mi id="gh17">b</m:mi>
        <m:mo id="gh18">±<!-- ± --></m:mo>
        <m:msqrt id="gh19">
          <m:msup id="gh20">
            <m:mi id="gh21">b</m:mi>
            <m:mn id="gh22">2</m:mn>
          </m:msup>
          <m:mo id="gh23">−<!-- − --></m:mo>
          <m:mn id="gh24">4</m:mn>
          <m:mo id="gh25">⁢<!-- ⁢ --></m:mo>
          <m:mi id="gh26">a</m:mi>
          <m:mo id="gh27">⁢<!-- ⁢ --></m:mo>
          <m:mi id="gh28">c</m:mi>
        </m:msqrt>
      </m:mrow>
      <m:mrow id="gh29">
        <m:mn id="gh30">2</m:mn>
        <m:mo id="gh31">⁢<!-- ⁢ --></m:mo>
        <m:mi id="gh32">a</m:mi>
      </m:mrow>
    </m:mfrac>
  </m:mrow>
  <m:annotation encoding="text/plain" xml:lang="en-US">
x equals, Start-Fraction, minus, b plus-or-minus Start-Root b Superscript 2 Baseline minus 4 INVISIBLE TIMES a INVISIBLE TIMES c End-Root, Over, 2  INVISIBLE TIMES a, End-Fraction
  </m:annotation>
  <m:annotation-xml encoding="application/ssml+xml" xml:lang="en-US">
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
<s><mark name="gh12"/><say-as interpret-as="spell">x</say-as></s>
<s><mark name="gh13"/>equals</s>
<s><mark name="gh14"/><break time="500ms"/>Start-Fraction</s>
<s><mark name="gh16"/><break time="500ms"/> minus </s>
<s><mark name="gh17"/><say-as interpret-as="spell">b </say-as></s>
<s><mark name="gh18"/> plus-or-minus </s>
<s><mark name="gh19"/><break time="25ms"/>Start-Root</s>
<s><mark name="gh21"/><break time="25ms"/><say-as interpret-as="spell">b </say-as></s>
<s><mark name="gh20"/><break time="25ms"/>Superscript</s>
<s><mark name="gh22"/><break time="25ms"/>2</s>
<s><mark name="gh20"/><break time="25ms"/>Baseline</s>
<s><mark name="gh23"/><break time="25ms"/> minus </s>
<s><mark name="gh24"/>4</s>
<s><mark name="gh25"/> INVISIBLE TIMES </s>
<s><mark name="gh26"/><say-as interpret-as="spell">a </say-as></s>
<s><mark name="gh27"/> INVISIBLE TIMES </s>
<s><mark name="gh28"/><say-as interpret-as="spell">c </say-as></s>
<s><mark name="gh19"/><break time="25ms"/>End-Root</s>
<s><mark name="gh14"/><break time="525ms"/>Over</s>
<s><mark name="gh30"/><break time="500ms"/>2</s>
<s><mark name="gh31"/> INVISIBLE TIMES </s>
<s><mark name="gh32"/><say-as interpret-as="spell">a </say-as></s>
<s><mark name="gh14"/><break time="500ms"/>End-Fraction</s>
</speak>  
  </m:annotation-xml>
  <m:annotation encoding="audio/mpeg" src="audio/quadform1.mp3" xml:lang="en-US"/>
  </m:semantics>
</m:math>
Using Depth-first numbering for cross-referencing
Below is markup that differs from the above in that instead of putting id references on all the elements, it uses the implicit depth-first numbering of the tree nodes for the bookmarks.
For both explicit ids and for implicit numbering, there is a problem created by mfenced and by token nodes with multiple characters. For mfenced, there is no way to reference the open/close/seperator attributes that are displayed. E.g., there is no way to indicate the opening fence. For tokens, there is no way to highlight just a single character in a multi-character token. For speech, that's not a problem but when you get to navigation, it can be one. What we've done in MathPlayer and prototyped in MathJax is to augment the depth first numbering to account for those situations. I can provide details if someone wants to get down to that level.
The node numbering for highlighting consists of two 16 bit numbers that indicate a start and end of numbering so that a range can be references. Since (sadly) many MathML authoring tools generate flat <mrow> structure for linear expressions (e.g., for "a=b+c"), being able to refer to a range (e.g., the "b+c" part) is useful. In the following, no ranges are referenced. To decode a bookmark in the following:
- take the decimal number, convert it to hex
- take the upper and lower 16 bits. Those reference the depth first number.
For example, 262148 converts to 00040004, and so the range is given by the nodes [4,4] which is the '='.
Note: 16 bit numbers are used because other (older but still widely used) voices use standards that limit the mark's value to 32 bits.
Note 2: Node numbering starts at 1. The value 0 is reserved as special case that means to clear any highlighting. Hence, in the example below, the final mark clears any highlighting.
<math>
 <semantics>
  <mrow>
    <mi>x</mi>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mo>−<!-- − --></mo>
        <mi>b</mi>
        <mo>±<!-- ± --></mo>
        <msqrt>
          <msup>
            <mi>b</mi>
            <mn>2</mn>
          </msup>
          <mo>−<!-- − --></mo>
          <mn>4</mn>
          <mo>⁢<!--  --></mo>
          <mi>a</mi>
          <mo>⁢<!--  --></mo>
          <mi>c</mi>
        </msqrt>
      </mrow>
      <mrow>
        <mn>2</mn>
        <mo>⁢<!--  --></mo>
        <mi>a</mi>
      </mrow>
    </mfrac>
  </mrow>
  <annotation-xml encoding="application/ssml+xml" xml:lang="en-US">
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
      <mark name='196611'/> x <break time='60ms'/>
      <mark name='262148'/> equals <break time='1000ms'/>
      the fraction with numerator <break time='80ms'/>
      <mark name='458759'/> negative <break time='60ms'/> 
      <mark name='524296'/> b <break time='150ms'/>
      <mark name='589833'/> plus or minus <break time='600ms'/>
      the square root of
      <mark name='786444'/> b
      <mark name='851981'/> squared <break time='150ms'/>
      <mark name='917518'/> minus <break time='330ms'/>
      <mark name='983055'/> 4 <break time='60ms'/> <break time='60ms'/>
      <mark name='1114129'/> <phoneme alphabet='ipa' ph='æ'> eh</phoneme> <break time='60ms'/> <break time='60ms'/>
      <mark name='1245203'/> c   <break time='600ms'/> <break time='200ms'/>
      and denominator <break time='100ms'/>
      <mark name='1376277'/> 2 <break time='60ms'/> <break time='60ms'/>
      <mark name='1507351'/> <phoneme alphabet='ipa' ph='æ'> eh</phoneme> <break time='80ms'/>
      <mark name='0'/>
    </speak> 
 </annotation-xml>
 </semantics>
</math>
An alternative to ids and implicit numbering is to use a standard such as xpointer. Ids can be used as can positional properties via ``element`` as in element(1/2/1) to get the indexed children from the root of the math expr. This is sort of compromise, but doesn't deal with the mfenced problem.
Expanded details
Here is an expanded version of the above example. Below is a partial tree of the <math> element. The red numerals represent the depth first numbering of the nodes.
And here is an expanded version of the SSML text. The decimal values for the 'name' attributes have been split into the two 16 bit values. So name='1245203' in the original example (which is 0x00130013) becomes name='19:19'. This example also include additional speech for the fraction and square root to illustrate using a range of nodes.
      <mark name='3:3'/> x <break time='60ms'/>
      <mark name='4:4'/> equals <break time='1000ms'/>
      <mark name='7:23'/>the fraction <break time='600ms'/>
      <mark name='7:19'/>with numerator <break time='100ms'/>
      <mark name='7:7'/> negative <break time='60ms'/> 
      <mark name='8:8'/> b <break time='150ms'/>
      <mark name='9:9'/> plus or minus <break time='600ms'/>
      <mark name='12:19'/> the square root of <break time='100ms'/>
      <mark name='12:12'/> b
      <mark name='13:13'/> squared <break time='150ms'/>
      <mark name='14:14'/> minus <break time='330ms'/>
      <mark name='15:15'/> 4 <break time='60ms'/> <break time='60ms'/>
      <mark name='17:17'/> <phoneme alphabet='ipa' ph='æ'> eh</phoneme> <break time='60ms'/> <break time='60ms'/>
      <mark name='19:19'/> c   <break time='600ms'/> <break time='200ms'/>
      <mark name='21:23'/> and denominator <break time='100ms'/>
      <mark name='21:21'/> 2 <break time='60ms'/> <break time='60ms'/>
      <mark name='23:23'/> <phoneme alphabet='ipa' ph='æ'> eh</phoneme> <break time='80ms'/>
      <mark name='0'/>
Using XPath for cross-referencing
As Bruce suggested, XPath could also be used to provide the cross-referencing. Here is an example of it. Note this is not optimized for conciseness of XPath expressions. Perhaps they could be simplified,
<math>
 <semantics>
  <mrow>
    <mi>x</mi>
    <mo>=</mo>
    <mfrac>
      <mrow>
        <mo>−<!-- - --></mo>
        <mi>b</mi>
        <mo>±<!-- ± --></mo>
        <msqrt>
          <msup>
            <mi>b</mi>
            <mn>2</mn>
          </msup>
          <mo>−<!-- - --></mo>
          <mn>4</mn>
          <mo>−<!--  --></mo>
          <mi>a</mi>
          <mo>−<!--  --></mo>
          <mi>c</mi>
        </msqrt>
      </mrow>
      <mrow>
        <mn>2</mn>
        <mo>−<!--  --></mo>
        <mi>a</mi>
      </mrow>
    </mfrac>
  </mrow>
  <annotation-xml encoding="application/ssml+xml" xml:lang="en-US">
    <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
      <mark name='math/semantics/mrow/mi'/> x <break time='60ms'/>
      <mark name='math/semantics/mrow/mo'/> equals <break time='1000ms'/>
      <mark name='math/semantics/mrow/mfrac/descendant::*[self::mi | self::mo | self::mn]'/>the fraction <break time='600ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/descendant::*[self::mi | self::mo | self::mn]'/>with numerator <break time='100ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/mo[1]'/> negative <break time='60ms'/> 
      <mark name='math/semantics/mrow/mfrac/mrow[1]/mi'/> b <break time='150ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/mo[2]'/> plus or minus <break time='600ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/descendant::*[self::mi | self::mo | self::mn]'/> the square root of <break time='100ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/msup/mi'/> b
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/msup/mn'/> squared <break time='150ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/mo[1]'/> minus <break time='330ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/mn'/> 4 <break time='60ms'/> <break time='60ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/mi[1]'/> <phoneme alphabet='ipa' ph='æ'> eh</phoneme> <break time='60ms'/> <break time='60ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[1]/msqrt/mi[2]'/> c   <break time='600ms'/> <break time='200ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[2]/descendant::*[self::mi | self::mo | self::mn]'/> and denominator <break time='100ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[2]/mn'/> 2 <break time='60ms'/> <break time='60ms'/>
      <mark name='math/semantics/mrow/mfrac/mrow[2]/mi'/> <phoneme alphabet='ipa' ph='æ'> eh</phoneme> <break time='80ms'/>
      <mark name='0'/>
    </speak> 
 </annotation-xml>
 </semantics>
</math>
MetaData in SSML
- SSML supports xml:lang so language versions can be tagged that way
- SSML supports a meta and metadata element, so meta data such as parameters used to generate the speech could be added in there. The metadata element has a schema associated with it.
Here is a simple example using the SSML metadata element to describe how the speech annotation was generated. Currently, generating speech annotations will likely require custom or proprietary tools. Therefore, a flexible metadata mechanism must be used to permit adequate description of the parameters used by the proprietary tool.
The example shows only a portion of the complete MathML.
 . . .
<m:annotation-xml encoding="application/ssml+xml" xml:lang="en-US">
  <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    <metadata>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
             xmlns:dc="http://purl.org/dc/elements/1.1/"
             xmlns:mathspeak="http://www.gh-mathspeak.com/speech-annotation">
      <rdf:Description rdf:about="http://www.w3.org/1998/Math/MathML/speech-annotation">
        <dc:Creator>mathspeak</dc:Creator>
        <dc:Publisher>gh, LLC</dc:Publisher>
        <mathspeak:Creator>mathspeak</mathspeak:Creator>
        <mathspeak:Version>2.13.100</mathspeak:Version>
        <mathspeak:Format>SSML</mathspeak:Format>
        <mathspeak:Explicivity>Off</mathspeak:Explicivity>
        <mathspeak:SemanticInterpretation>Off</mathspeak:SemanticInterpretation>
        <mathspeak:Verbosity>Off</mathspeak:Verbosity>
        <mathspeak:Lexicon>Standard American English</mathspeak:Explicivity>
      </rdf:Description>
    </rdf:RDF>
  </metadata>
  <s><mark name="gh12"/><say-as interpret-as="spell">x</say-as></s>
 . . .
This example uses the XML syntax of RDF, as recommended by the SSML W3C Recommendation.
Usage of Dublin Core metadata Creator and Publisher items facilitates standardized recognition of some metadata. The "mathspeak" namespace demonstrates how to include custom or proprietary parameters. Interpretation of the "mathspeak" namespace data is left to particular proprietary tool. This example shows parameters used by the "mathspeak" tool from the publishing organization "gh, LLC".
Here is a similar example showing parameters from MathType by Design Science, which also consumes the ETSSpeech:Matrix parameter.
. . .
<m:annotation-xml encoding="application/ssml+xml" xml:lang="en-US">
  <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis">
    <metadata>
    <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
             xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
             xmlns:dc="http://purl.org/dc/elements/1.1/"
             xmlns:dsi="http://www.dessci.com/speech-annotation"
			 xmlns:ETSSpeech="http://www.ets.org/">
      <rdf:Description rdf:about="http://www.w3.org/1998/Math/MathML/speech-annotation">
        <dc:Creator>MathType</dc:Creator>
        <dc:Publisher>Design Science, Inc.</dc:Publisher>
        <dsi:subjectarea>LinearAlgebra</dsi:subjectarea>
        <ETSSpeech:Matrix>SilentColNum</ETSSpeech:Matrix>
      </rdf:Description>
    </rdf:RDF>
  </metadata>
  <mark name='196611'/> x <break time='60ms'/>
 . . .
ARIA
Another idea is to use ARIA's to provide a textual description of the math. ARIA provides three ways one could do that:
- label
- labeledby
- describedby
The later two link an element to text on the page and are therefore not appropriate unless that text is made invisible. For MathML, the [ARIA spec] recommends using aria-label. At the moment, if used in an XHTML context, an ARIA attributes should use the http://www.w3.org/ns/wai-aria/ namespace.
Using this feature on a math element requires a schema modification for XHTML but does not require one for HTML. One possible use is to add it to a <math> element as in:
<math xmlns='http://www.w3.org/1998/Math/MathML' aria-label='a plus b equals c'>
  <mrow>
    <mrow> <mi>a</mi> <mo>+</mo> <mi>b</mi> </mrow>
    <mo>=</mo>
    <mi>c</mi>
  </mrow>
</math>
As shown above, this offers no advantages to using alttext other than maybe some AT might be more likely to process aria-label than alttext. aria-label could be used on each element. That would allow for text to spoken when navigating, but at the cost of having to have speech text for each node.
Another alternative is to use describedby with SSML as above and hide the SSML with some technique such as making it invisible (have not verified this works). Another option would be to include it in an annotation-xml element (which would not normally be spoken or displayed). That element could contain multiple <span>s with ids corresponding to the MathML subelements. For example:
<math>
<semantics>
  <mrow>
    <mrow>
       <mi aria-describedby="gh1">a</mi>
       <mo aria-describedby="gh2">+</mo>
       <mi aria-describedby="gh3">b</mi>
    </mrow>
    <mo aria-describedby="gh4">=</mo>
    <mi aria-describedby="gh5">c</mi>
  </mrow>
  <annotation-xml encoding="text/html">
     <span id="gh1"> a </span>
     <span id="gh2"> plus </span>
     <span id="gh3"> b </span>
     <span id="gh4"> equals </span>
     <span id="gh5"> c </span>
  </annotation-xml>
</semantics>
</math>
The pros and cons are:
- Pros
- maybe might work with some AT out of the box that have implemented ARIA (didn't work with JAWS or NVDA with either aria-label or aria-describedby)
- Cons
- it requires a schema change for XHTML if we don't want to require namespaces
- can't add prosody information
Discussion on above
Dennis: should the ability to write/edit annotations by hand be a consideration in choosing from the above? Should we support two different approaches based on ease of hand authoring and ease of machine generation?
Neil: I think both are so complex, some machine generation will be required. Where I can see hand editing possibly being done is having someone add an annotation (that was machine generated) by hand.
Dennis: The range idea (in "Using Depth-first numbering for cross-referencing") is more powerful, but it introduces a micro syntax and it would be good to avoid that. An alternative to having a range is to say that if that feature is needed, then there needs to be an mrow corresponding to the range so it can be pointed to.
Neil: I agree having well structured MathML is desirable, but few editors generate it. <mrow>s could be hand-added when needed, but I like the flexibility of a range more.