HTML Speech Incubator Group Teleconference -- 03 Nov 2011

<smaug> hi

<smaug> well, who am I then o_O

<smaug> pong

<burn> trackbot, start telcon

<trackbot> Date: 03 November 2011

<Milan> ScribeNick: Milan

Review recently sent examples

<DanD> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#introduction

<mbodell> http://bantha.org/~mbodell/speechxg/example1.html

Michael: Speech Web Search Markup only

Robert: Found addGrammarFrom() is awkward
... really a hint

Glen: True that input has no grammar

Michael: It's a builtin grammar

Robert: What about derviveGrammarFrom

Glen: It's an append grammar

DanD: Option might be a better example

Michael: Text is a grammar

Robert: Assume q is an object from which a grammar can be derived

<smaug> Nit, <button name="mic" onclick="speechClick()"> is a submit button, so when you click it, the form is submitted. type="button" would fix the problem

DanB: addDerivedGrammar

Debbie: Figgure out semantics first

Robert: AddDerivedGrammarFromID

Glen: Also rename q to 'inputField'
... Also from text input type to date or somethign more contrained
... Need to specify the lack of grammars
... Is this dictation?

Robert: improve example by defaulting to UTF-8

<glen> Section 5.1: when no grammar specified, defaults to builtin:dictation

Robert: Base 64 encoding is ugly
... to the point where it is unsualbe

Michael: Worried about directly inserting XML due to 8th bit

DanB: Are there already common protocols for inserting strings derived from URLs into local variables?

Glen: Should only be a W3C standard, implmentation is orthoginal

Robert: AddFromString() would be nice:?

Glen: addStringGrammar() and addElementGrammar()

Avery: Perfer longer name because its truer to form

<smaug> Couldn't you just prepend "data:application/srgs+xml," to the serialized XML. But anyway, using data urls is kind of hackish, IMO.

Robert: Too many dots to get the interpretation

Milan: Propose addGramamrFromURI()

Robert: Newing up a speech grammar is better approach

Michael: Let's just raise issues now rather than solve them

Debbie: Example is complex, and gets mixed up with arguement that JS is complex

* laptop?

Michael: Next example from Bjorn

Robert: The example lacks a grammar

<smaug> s/onclick="startSpeech"/onclick="startSpeech(event)"/

<DanD> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0008/web-speech-sample-code.html

Robert: Need to define what happens when lacking a grammar

Avery: Is there a policy against comments in the examples?

Michael: Planning on adding examples to an appendix

Avery: It's a decent example, as long as it is clear that this instance lacks a grammar

Robert: Example shows default behavior

Rahul: Could also delete button as means of shorting example

<glen> per Avery's suggestion: add a comment "since no grammar is specified and no element is binded, uses default grammar builtin:dictation"

Rahul: Two different ways to perform same array access

Glen: Should make it consistent in example

<mbodell> In Bjorn's second example need sir.maxNBest = 2;

<glen> use same notation: s/q.value = event.result.item(0).interpretation;/q.value = event.result[0].interpretation;/

Robert: Intent is to get a text transcript of the user's input
... why are we accessing the interpretation instead of tokens?

Milan: Need to bring this up in protocol team

<all agreed> to replace to "utterance" in place of interpretation

Milan: Last two comments should apply here as well
... Should we have company-specific references?

Michael: Prefer example.org

Robert: Is there speech recognition in turn by turn>

Michael: Speech recognition is just destination capture

<smaug> Again, s/onclick="startSpeech"/onclick="startSpeech(event)"/

Robert: The prefer speek next instruction should cancel last instruction

Glen: Thought the purpose of example was to show interplay between speech and tts?

Michael: TTS play resumes where last left off

Glen: Way to stop prior play is a good feature
... we should change this example

<glen> change example to show how to stop, by persisting the tts object and calling stop before adding .text and .play

Michael: Ollie example next

<mbodell> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0009/htmlspeech_permission_example.html

Micahel: First example is just removing unauthorized elements?
... but second example doesn't allow speech input to start

Ollie: Yes

Michael: Can you transition from not authorized to authorized?

Ollie: Should be possible, but example doesn't do that
... but could also just reload the page

* Going on break now

<inserted> scribe:ddahl_

<scribe> scribe:ddahl

Robert's example

robert: two recognitions in a row, you want to pick your cities based on what state you're in.

<Avery> Actually I think it's based on what state is specified in the first reco, not necessarily what state you're in. A minor nit.

robert: it really should say "interpretation.state", not just "interpretation"
... used push instead of adding things to the array of speech grammars
... a bug on result, should be city, also, sr.onMatch should be sr.onResult
... second example is rereco
... gives grammars to speechInputRequest, then classifies, then does rereco with a specific grammar

glenn: this seems to be a strange use of "interpretation"

robert: there is a huge universe of grammars

rahul: this is identifying one grammar as different from the others

robert: using the attribute "modal" to activate and deactivate grammars
... would change the example to get interpretation.classification
... strange to have multiple "modals" as true, think modal might be a bad idea

speech-enabled email

michael: one interesting thing is that you might get notifications that you would want to speak to, but without clicking

robert: was mostly thinking about things like "reply", but you could also imagine saying "read it to me" after notification
... made up a method to cancel TTS

michael: you could just delete the element

robert: what if you set up the element with stuff in it?

glenn: destroy should not be to only way to cancel

Milan's example of protocol

milan: will augment with API calls that trigger protocols
... need a result index of some kind
... then recognizer decides to change its mind and reorders results
... strange to get a "complete" result in the middle of a long dictation
... result index 0 is the first fragment, then halfway through the second fragment, the recognizer says the first one is done
... different from MRCP, because in MRCP that means it's the end of it
... then retracts a result, not sure how to represent this, maybe an "IN_PRO
... GRESS" message with no payload
... we will put this in the larger document as an example of the protocol

michael johnston's multimodal use case

<smaug> Could you please paste links to the example here

michael: "I want to go from here to there" is the use case

<smaug> ( would be then easier to read minutes later )

<mbodell> Michael's example: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0020/multimodal_example.html

<mbodell> You can walk through the examples from: http://bantha.org/~mbodell/speechxg/f2f.html which links to http://bantha.org/~mbodell/speechxg/examples.html which then walks through the examples

glenn: it would be good to have a "state" attribute
... the "nomatch" state is more of a result, not a state
... we may need more than one attribute to get results of speech processing

michael: this also has the EMMA so that you can see the mapping from EMMA
... this example makes use of a remote speech service

<glen> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0020/multimodal_example.html

michael: the EMMA shows the combined speech and gui input

robert: this should be a wss: , that is, a web socket protocol, but what should we do if someone uses http?

michael: you could get the command right but not the person if you didn't do the "clickInfo"

Charles Hemphill's example

<glen> http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0024.html

danD: we should start with the simplest example

Michael Bodell's example 8, translation

<glen> http://bantha.org/~mbodell/translate.html

<glen> view-source:http://bantha.org/~mbodell/translate.html

michael: different example of translation
... there's from and to languages, you choose, and then click on microphone to talk
... there's a progress bar that get's updated
... we're grabbing our language from the selector, we're using a dictation grammar for whatever language we're using
... where are we doing capture?

glen: wouldn't that be the microphone?

michael: not necessarily, there could be other things like media streams

glen: is capture necessary or does it just provide more features?

michael: we didn't have any examples of capture from other places, like from Web RTC
... right now there's no standard for accessing microphone

glen: would like to see default example where we don't have to explicitly do capture

michael: all examples assume that there's magic for capturing audio

glen: can't we make it so that the magic is what happens by default?

dan: there are many security and privacy issues
... different permissions for getting access to media but also to do something to the media

michael: this is also raised in some of our issues, we only have a two sentence note now
... can TTS work on Web Sockets?

robert: yes

michael: on audio start, etc. are in our spec. another issue is that payload of start, stop events isn't defined

robert: : do we have VU meter events?

michael: no

dan: that came up in Web RTC, they don't have that, but they could create it

michael: we do have speech-x events for custom extensions

robert: most speech apps have one

michael: is that part of the UA or the app?

Debbie's example

multi-slot filling

<mbodell> Debbie's: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/att-0031/Multi-slotSpeech1.html

debbie: in this example you have to pull out the slot values from the EMMA

robert: is this the same as saying "interpretation.booking"?

debbie: not sure
... we don't know what's in "interpretation"

robert: we could get rid of "interpretation"

michael: it could be a useful pointer into the EMMA
... that is available in VXML

<mbodell> Issue: we should make sure it is clear what the interpretation points to

<trackbot> Created ISSUE-1 - We should make sure it is clear what the interpretation points to ; please complete additional details at http://www.w3.org/2005/Incubator/htmlspeech/track/issues/1/edit .

michael: should do an if to make sure that you really got a value

debbie: could add the EMMA
... would there be value in some kind of convenience syntax so that you don't need the full DOM generality to manipulate the EMMA result?

<mbodell> Charles' example: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0033.html

another example from Charles Hemphill

michael: the same example as before but with an external grammar

avery: what's the advantage of having "reco" element as a child under "input"

michael: there are two different ways to do the same thing, with "reco" under as a "child" under <input> you don't need an id

<smaug> <input> element can't have child elements

actually, input is a child of reco in the proposal

<smaug> My comments to example 3 http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Nov/0034.html

michael: another example with a real inline grammar so that you don't have to do data uri
... we would have to define a "grammar" tag

robert: we would have to define for browsers how to interpret SRGS

avery: like putting script in page vs an external reference

<smaug> Milan: remember, we're talking about HTML here, not XML

<smaug> (I assume that was Milan)

milan: could we say "as long as this is valid XML ignore it and pass it to us"?

robert: why wrap the whole thing with the grammar element?

michael: if there's an SRGS 1.1, you wouldn't know what version it was, for example
... would like to have inline grammar, if any, be full SRGS with <grammar> element
... that is the end of the examples

<Milan> * Good point Ollie

<glen> scribenick: glen

issues

burnett: if can't agree, depends on importance. If important, capture different opinions in doc.
... (not required to resolve everything in incubator group)

<mbodell> First issue to discuss: http://bantha.org/~mbodell/speechxg/issuep1.html

1. What Content-Type do we want to use on an empty message? Use case was nulling out previous candidate recognition.

milan: do we have to specify? can it be assumed?
... empty means no payload?

robert: protocol doesn't require a body
... in which case I don't think it needs a content type. Example getParams

michael: what about interim results?
... if no content and not content type then nulls out corresponding result. Example: an interim result gets replaced with no result (e.g. if a <cough> is initially recognized as some text)

Protocol Issues

2. I am skeptical about changing established MRCP event/method names. I sort of agree that LISTEN is better than RECOGNIZE, but do not think the reasons are good enough to warrant ensuing churn.

Robert: Microsoft doesn't care if similar to MRCP, rather that it's compatible with our web sockets protocol

burnett: web sockets is just a transport
... violates many types of protocol design
... if standards track, IETF is a logical place

robert: so naming doesn't matter much at this point.

all: agree

burnett: some talk of using SIP to setup, would have to separate signaling and data...which is one thing wrong with this.

robert: this is more to illustrate a point that it can be done

burnett: companies could implement today, and may not be completely interoperable (as is often the case on first implementations)

michael: we agree, not to change names right now. Names will likely to be re-evaluated in a standards track.
... minor syntax issues can be called out as a note in the doc.

burnett: when gets into a standards group, they look at requirements and take ideas into consideration, but they consider MANY other factors, e.g. security, that drive

3. We need a way to index the recognition results. I suggest using a Result-Index header

all: agree to add. if a one-shot recognition, it's only [0] and still optional

4. It was awkward to use a RECOGNITION-COMPLETE message presumably with a COMPLETE status during continuous speech. Instead, I used INTERMEDIATE-RESULT with a new Result-Status header set to final.

robert: just rename RECOGNITION-COMPLETE as RECOGNITION-RESULT
... it's an intermediate, unless it's a final response type.

burnett: MRCP has separate status code and completion code

Milan: we need a complete flag, not sure it was defined. We haven't stated which status codes correspond to which messages.

burnett: in MRCP, status is about communication (like 200 OK). In MRCP, the completion code indicates what happened (e.g. successful reco)

robert: so status indicates "sending more", so status should be in-progress for continuous reco case.
... need request state?

burnett: request has been made, has it been completed yet? status is success, illegal method, illegal value, unsupported header

robert: reco result, 200 OK, in progress

5. Perhaps Source-Time should also be required on final results

all: yes, everything's fine, more to come

Milan: by time have final result, should know start time.

all: agree, require only reco result

Milan: could be reco result with type = pending

michael: pending implies have already started

robert: in progress more accurate

all: agree to leave as is

6. Wanted to confirm that channel identification is being handled by the WebSocket container

robert: handled by web socket
... if two separate recos, then two web sockets and two audio streams. (Can have 2 grammars active in one reco)

milan: continuous hotword case

robert: that's continuous reco
... start session with hotword and command-control grammar, all is continuous results

michael: hard if change over time
... because have to pause to change
... so not continuous

robert: don't want to transmit audio twice, but with two sessions, you must

avery: does emma result specifies which grammar?

michael: yes

7. I noticed that Completion-Cause was missing from Robert's spec example in section 4.2.

robert: accidental omission, need to add

Web API Issues

1. To get the reco result I think i have to write "e.result.item(0).interpretation". This is a lot of dots and an index just to get the top result.

robert: I want to write e.interpretation -- because most of the time that's what I want (but still could use the verbose way as well)

<mbodell> Here is the link to where the event is defined: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#speechinputresult

milan: e.result.interpretation

michael: can already use e.result[0].interpretation

glen: we should change utterance to match

all: e.interpretation and e.utterance
... agreed

2. "utterance" has a couple of different meanings in the doc. It's alternatively the recording of what the person said, or the transcript returned by the recognizer.

michael: transcript? text? tokens?

text and token are over used and confusing

robert: but it is text, so not overloading the concept
... (unlike token)

burnett: transcript, closest to what's actually happening, laymen get it

glen: text is not descriptive: interpretation is text, whereas transcript vs interpretation is clear

all: agree: rename utterance to transcript

5. The "modal" attribute on SpeechGrammar is unnecessarily restrictive

Discussion: There are cases where I'll want to have multiple grammars active, but not all, and not just one. Developers would be better off with a boolean enabled attribute on each grammar. Would be useful to clarify the behavior when there is more than 1 grammar with this set to true (only the first in the list is active?) Is this even useful at all? What is the case for having grammars which aren't active in the reco? Can we change the state of the modal/u

robert: less lines of code if just set one to true

milan: alternatively, could add/remove from grammars array

glen: sending all at once allows caching
... of grammars
... what about continuous case, can grammars change on the fly

michael: we decided to simplify by re-calling .start to change grammars or anything else

milan: should have a separate way to preload

burnett: voicexml has defineGrammar

milan: grammar set object on the SpeechInputRequest
... I proposing sets of grammars

robert: I'd like it flatter, get rid of enabled/disabled -- just delete -- and don't allow preload

michael: already have .open that allows preloading

<mbodell> See web api: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#dfn-open

<scribe> scribenick: glen

burnett: nervous about this, we discussed this for a long time and considered many edge-cases

robert: alternative: get rid of modal and enable, and just use a bunch of grammars

avery: if .open has already been called, .start doesn't call it (.start only calls .open if hasn't been opened yet)

burnett: wondering if there are performance advantages in reco engine if can call enable/disable as opposed to calling .open multiple times?

<smaug> start might need to re-call open if authorizationState has changed from not-authorized to authorized

milan: MRCP didn't solve this, why should we?
... all good MRCP clients do what you're saying automatically, they automatically check for the deltas

robert: this runs at web-scale, distributed
... big difference between telephony and web

michael: options: eliminate model, keep and define what happens if multiple set to true

avery: easier to add later than remove

agree: eliminate .modal

5. The interpretation attribute is completely opaque. That may be necessary given that SISR can pretty much return anything. But it'll need some examples to show how to use it.

burnett: there was support for a flat array of interpretations
... I didn't like that, Nuance and their customers didn't like it,

debbie: use emma to define layout

michael: different reco engines may use emma in different ways
... fundamentally, .interpretation points to somewhere in emma, which simplifies (and a corresponding .transcript)

all: agree, specify which part of emma holds the interpretation

michael: mapped to a DOM object, emma literal or node
... like debbie's slot-filling example
... I will send text for this

5. The array of SpeechGrammar objects is too cumbersome

<smaug> something happened to the audio. it is all just noise

<smaug> though, getting late here

Discussion: Robert: The array of SpeechGrammar objects is too cumbersome. In most cases I'd like to write something simple like: mySR.speechGrammars.push("USstates.grxml","majorUScities.grxml","majorInternationalCities.grxml"); But I can't. I have new-up a separate object for each one then add it to the array, even when I don't care about the other attributes. Better to just make it an array of URI strings, and add functions for the edge cases. e.g. void ena

void setWeight(in DOMString grammarUri, in float weight); And yeah, I remember arguing the opposite on the phone call. But that's before I tried writing sample code. Glen: "The uri of a grammar associated with this reco. If unset, this defaults to the default builtin uri." Presumably using the grammar attribute overwrites the default grammar, so if a developer wishes to add a grammar that supplements the default grammar, then this alternative should work: re

would add clarity. Michael: If you view source on the web api document you'll see the grammar functions and descriptions are there commented out as I anticipated, and agree, with this comment. We should have both functions and array/collections and this makes the things that Robert and Glen describe much easier/better.

michael: grammar spec after ? are hints, before builtin: are required and errors if not supported
...example: builtin:contacts may recognize names in smartphone
... require built:generic

burnett: built:generic means I'll take anything you got: if it's just a date grammar, I'll take it.

<mbodell> We are talking about http://bantha.org/~mbodell/speechxg/issuew5.html but really more about what happens with no grammar

milan: builtin:generic could respond with failure, builtin:dictation could also respond with failure

robert: builtin:generic should be builtin:default
... and none specified is builtin:default

burnett: what if want to use both default and another grammar

glen: then add builtin:default and builtin:foo

michael: default is not user default, but service or ua default

milan: want a way to record without a grammar

michael: we define builtin:default, encourage vendors to implement, and state when none specified, it's on by default. (and when other grammars specifed, it can also be added.
... I like .addGrammer(url, weight) as a simplification from creating object and then setting it

robert: .addGrammarFromUrl(url, weight)
... .addGrammarFromElement(element, weight) .addGrammarFromString(string, weight)
... better yet: .addUrlGrammar .addElementGrammar .addStringGrammar
... but advantage for objects to be alphabetical order, grouped together in docs

glen: .addGrammarUrl .addGrammarElement .addGrammarString
... remove is a JavaScript array operation

michael: also .addCustomParameter(name, value)

all: agree: .addGrammarUrl .addGrammarElement .addGrammarString .addCustomParameter

<smaug> I think this is enough for me. I'll read the minutes tomorrow and send comments

<smaug> It is midnight here

<smaug> dark? it has been dark hear for the last 6 hours

<smaug> here

<rahul> scribenick: rahul

Issue 6

<mbodell> Link to the current issue: http://bantha.org/~mbodell/speechxg/issuew6.html

<glen> 6. The names are a bit long.

<glen> Discussion: e.g. "new SpeechInputRequest()" vs "new SpeechIn()" . e.g. "mySR.speechGrammars.push("foo")" vs "mySR.grammars.push("foo")" . e.g. "resultEMMAXML" vs "EMMAXML" or just "EMMA" (call the other one "EMMAText" ) e.g. "inputWaveformURI" vs "inputURI"

Milan: how about SpeechRequest instead of SpeechInputRequest?

Robert: SpeechRecognizer?

Milan: AudioSynthesizer?

Glen: SpeechReco?

<Milan> Milan: AudioSynth

<Milan> * test

Resolution: We will use SpeechReco instead of SpeechInputRequest

<matt> Parkinson's Law of Triviality

<scribe> ACTION: Editing team to update to SpeechReco [recorded in http://www.w3.org/2011/11/03-htmlspeech-minutes.html#action01]

<trackbot> Sorry, couldn't find user - Editing

Issue 7

7. SpeechInputRequest.outputToElement() should be an attribute, perhaps 'forElement'

<matt> Issue 7

Resolution: Replace outputToElement() function with the outputElement attribute

Issue 8

<inserted> Issue 8

8. SpeechInputResult has a getter "item(index)". SpeechInputResultEvent has an array "SpeechInputResult[] results.

Discussion: Can we change both to be collections similar toÂ http://www.w3.org/TR/FileAPI/#dfn-filelistÂ (accessible via [] operator and optionally with a .item() method)?

Resolution: Accepted

Issue 9

<matt> Issue 9

9. The <reco> element should probably be a void element with no content on its own

Discussion: Satish:Â http://dev.w3.org/html5/spec/Overview.html#void-elements. I just noticed this in the for attribute's description, missed it in earlier reads: "If the for attribute is not specified, but the reco element has a recoable element descendant, then the first such descendant in tree order is the reco element's reco control." Is there a benefit to doing this over requiring the 'for' attribute to be set and making reco a void element? Charles: I a

<glen> resolution: can specify with either descendent or with for= attribute

Resolution: Agreed to leave it as-is using either the for or the descendant pattern

Issue 10

<matt>

<inserted> Issue 10

10. TTS is hard

<matt> |http://bantha.org/~mbodell/speechxg/issuew8.html||

Discussion: Bjorn: I can't see any easy way to do programmatic TTS. The TTS element is at least missing the attributes @text and @lang. Without those, it's pretty hard to do the very simple use case of generating a string and speaking it. It's possible, but you need to build a whole SSML document. For use cases, see the samples I sent earlier today. Dominic: For TTS, I don't understand where the content to be spoken is supposed to go if it's not specified in

Michael: @lang is not missing since it could be inherited
... there is no @text there

Glen: content within <tts></tts> will show up within older browsers

<mbodell> Discussion is <tts src="data:text/plain,Hello, world"/> versus <tts value="Hello, world"/> versus something else. Note in JS we could define a function so it is pretty similar, but from Markup a little harder to get the function creating the data uri (probably still possible)

<glen> michael: tts as a markup may render visually a control (play, stop, etc)

<glen> ...other dom can interact

<glen> glen: most uses of tts need dynamic control -- that is require javascript

<glen> michael: because tts inherits from media-element, it requires a src attribute

<glen> michael: <tts> is not used as an alternative fallback

Dan: usecase for <tts> element is to facilitate easy generation as part of markup rather than generating script

s/|http://bantha.org/~mbodell/speechxg/issuew8.html||//

Michael: the @lang inherited from the <media> element should be passed as a parameter to the synthesizer

Resolution: Add a @text attribute to <tts>.

Issue 11

-> http://bantha.org/~mbodell/speechxg/issuew11.html Issue 11

11. How does binding to button work

Discussion: Satish: "When the recoable element is aÂ buttonÂ then if the button is notÂ disabled, then the result of a speech recognition is to activate the button." "For button controls (submit, image, reset, button) the act of recognition just activates the input." "For type checkbox, the input should be set to a checkedness of true. For type radiobutton, the input should be set to a checkedness of true, and all other inputs in the radio button group must b

Michael: propose to have an issue note that this needs further thought

Robert: define what we can, and for others say there is no binding

Resolution: Add issue note that more work to be done on bindings

Issue 12

12. What about meter, progress, and output elements?

Discussion: Satish: The meter, progress and output elements all seem to be aimed at displaying results and not for taking user input. Is there a reason why these are included as recoable elements?Michael: This is specified atÂ Reco Bindings. A person could want to be able to speak and have it change a progress bar or meter or output element. The primary reason is matching what is done with label. These are all labelable elements and thus ended up as recoable

Glen: suggest we not talk about bindings to these

Dan: we need to decide which ones to leave out, I agree since these are not even <input> elements

Resolution: Remove these from the recoable elements and bindings

Issue 13

13. grammars and parameters should be collections

Discussion: Satish: Similar toÂ issue 8, SpeechInputRequest attributes 'grammars' and 'parameters' should probably be turned into a collection as well

Resolution: Accepted

Issue 14

14. rename language to lang

Discussion: SpeechInputRequest.language should probably be changed to 'lang' to matchÂ lang attributes.

Resolution: Accepted

Issue 15

15. rename iterimResults to interimResultsInterval

Discussion: SpeechInputResult.interimResults should probably be renamed to interimResultsInterval to indicate its usage similar to how other attributes have 'Timeout' in their names

Resolution: Turn into boolean property, name does not change

Issue 16

16. drop enum prefixes

Discussion: SPEECH_AUTHORIZATION_ prefix could be dropped for the enums and just have 'UNKNOWN', 'AUTHORIZED' & 'NOT_AUTHORIZED' (similar toÂ XHR States). Same for SPEECH_INPUT_ERR_* and other such enums.

Resolution: Accepted (given Satish's input and expertise)

Issue 17

17. A way to uncheck automatically by speech?

Discussion: Glen: "For type checkbox, the input should be set to a checkedness of true." It would be nice to have a way to allow user to say something to set it to false, but I can't think of a good convention for this other than adding an attribute or grammar. Perhaps this could/should only be possible via scripting. (I don't like the idea of toggling the checkbox because some users may not be able to easily observe what state the checkbox is currently in.)

Resolution: See resolution to issue 11

<smaug> mbodell: I'm kind of online

<smaug> what enum conflicts?

<smaug> if the const is in an interface, then no

Issue 18

<inserted> Issue 18

18. Binding hints versus requirements

Discussion: Glen: "For date and time types ... type of color ... type of range the assignment is only allowed if it is a valid ..." On our call we discussed how these grammars are hints, and in particular how pattern may be difficult to implement. We discussed that showing an output response, even an invalid one, may be more valuable than no response. Michael: We can do hints for patterns on text, and for numbers out of range, but for other types HTML5 is jus

Resolution: See resolution to issue 11

<glen> satish provides this example of two sets of enums, with no prefixes.

<glen> https://developer.mozilla.org/en/DOM/HTMLMediaElement

Issue 19

19. Does reco and TTS need to be on a server as opposed to client side?

Discussion: Dominic: The spec for both reco and TTS now allow the user to specify a service URL. Could you clarify what the value would be if the developer wishes to use a local (client-side) engine, if available? Some of the spec seems to assume a network speech implementation, but client-side reco and TTS are very much possible and quite desirable for applications that require extremely low latency, like accessibility in particular. Is there any possibility

<matt> Issue 19

<glen> satish continues: HTMLMediaElement.LOADED so no clashes

<glen> (above refers to issue 16)

Resolution: The service does not need to be remote, UAs may define URIs to local engines. We should add clarifying text specifying this. Also, the serviceURI does not need to be remote. We will clarify this as well.

Issue 20

20. Set lastMark?

Discussion: Dominic: An earlier draft had the ability to set lastMark, but now it looks like it's read-only, is that correct? That actually may be easier to implement, because many speech engines don't support seeking to the middle of a speech stream without first synthesizing the whole thing. Michael: Actually the speech xg version has never supported setting a lastMark. You can control playback using the normalÂ mediaÂ controls (setting currentTime, seekabl

Resolution: Leave as-is right now. Add issue note about also making it writeable.

Issue 21

21. More frequent callbacks?

Discussion: Dominic: When I posted the initial version of the TTS extension API on the chromium-extensions list, the primary feature request I got from developers was the ability to get sentence, word, and even phoneme-level callbacks, so that got added to the API before we launched it. Having callbacks at ssml markers is great, but many applications require synchronizing closely with the speech, and it seems really cumbersome and wasteful to have to add an s

Resolution: Leave as-is. Suggest as enhancement to SSML.

Issue 22

22. How do we fit with capture/input/MediaStream?

Discussion: Michael: Our spec has:Â attribute MediaStream input;Â but we have nearly no explanation of it and our examples don't show how to use it. Can we do better?

<mbodell> spec link: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Oct/att-0064/speechwepapi_1_.html#dfn-input

Resolution: This XG probably can't do better. We should have an issue note and include the assumption that media stream input somehow happens.This seems to be of interest to numerous groups (Audio, DAP, Web RTC, HTML Speech XG ...), Debbie will follow up as part of the HCG.

<scribe> ACTION: ddahl2 to set up follow-up via HCG [recorded in http://www.w3.org/2011/11/03-htmlspeech-minutes.html#action02]

<trackbot> Created ACTION-4 - Set up follow-up via HCG [on Deborah Dahl - due 2011-11-11].

Issue 23

23. How does speechend and related events do timing?

<matt> Issue 23

Discussion: Michael: Our spec is missing explanations around the timing and how the information is reflected.

<scribe> Meeting: HTML Speech Incubator Group - 2011 TPAC F2F, Day 1

Resolution: define the data to reflect the source-time back into the events. Do it on all events that accept time (including result and speech-x). Note this timing is always relative to the "stream-time" and real time may be faster or slower than that.

<kaz> [ Thursday meeting adjourned ]

- DRAFT -

HTML Speech Incubator Group Teleconference

03 Nov 2011

Attendees

Contents

Review recently sent examples

Robert's example

speech-enabled email

Milan's example of protocol

michael johnston's multimodal use case

Charles Hemphill's example

Michael Bodell's example 8, translation

Debbie's example

another example from Charles Hemphill

issues

Protocol Issues

Web API Issues

Issue 6

Issue 7

Issue 8

Issue 9

Issue 10

Issue 11

Issue 12

Issue 13

Issue 14

Issue 15

Issue 16

Issue 17

Issue 18

Issue 19

Issue 20

Issue 21

Issue 22

Issue 23

Summary of Action Items

Scribe.perl diagnostic output