HTML Speech Incubator Group Teleconference -- 29 Sep 2011

<burn> trackbot, start telcon

<burn> Date: 29 September 2011

<burn> Scribe: Dan_Druta

<burn> ScribeNick: DanD

Web API

<mbodell> Discussing SpeechInputResults

<mbodell> Bjorn's mail: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Aug/0033.html

<mbodell> Satish's proposal: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0034.html

mbodell: speech results discussion

<mbodell> My mail: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/0043.html

bringert: I did not understand what the semantics were

mbodell: Three interfaces like in Bijorn proposal
... You get a speech result and inside you get an array of results

bringert: so the results are a history

burn: So the result number will not decrease

mbodell: I can decrease if you get corrections

bringert: what is the benefit to have everything?

burn: It's not the history. It's the combined one

mbodell: in a simple case you get a concatenation

bringert: How do you know if this preliminary?

mbodell: we can add a boolean

Charles: Don't you want if it's final

burn: when you get a Boolean in the event that is marked as final it means that particular indexed value in the results array is final

Milan: I can see results coming finalized but the recognizer to shuffle them around

bringert: It's up to recognizer

Milan: let's have an example that shows that chunking

bringert: Why can't we have a preliminary that replaces the previous one until it hits final

Milan: this is more powerful and gives more flexibility
... I thought the proposal was simplier as was discussed in the F2F
... Why not the recognizer send another complete hypothesis, replacing all preceding results?
... The combined result would be more efficient (less headers)

mbodell: result chunk size is not necessarily the same as finalizing chunk size

Milan: If you are to dictate an email, are we expecting to have lots of indexes?
... we have an unrealistic example here

Bringert: Michael can you explain a bit better the chunking?
... From a single result you get a single piece of semantics

mbodell: What can't you do with the array?
... in a non continuous world it's not a problem

bringert: how do you see this from the UI point of view?

mbodell: you concatenate them each time
... The intent was to have the exact sequence
... if you have 3 results and one gets modified you get them
... you get a normal result anyway

ddahl: how does it work with nbest?

mbodell: I can see an API where you have results and one is wrong and gets replaces
... I agree it solves the simple use cases and it is more complicated for others

bringert: How do I know when to to interpret the actions? We should get a final for everything

Charles: the UI wants to show just finals. You want finals to come at a reasonable pace

mbodell: there's a tradeoff

<burn> s/Charles: the UI/Charles: maybe the UI/

Milan: I know it's not going to change but maybe there's an external input that might change it
... if the user changes one word or another it might trigger more changes and might send and updated array

burn: the reason I like final is that I can archive them

glen: If you have a command based on what the user said it might not be undoable
... there's the use case where the user doesn't care about preliminary. They just need the final

mbodell: this is more for online improvements

glen: the rerecognize should solve our problem
... if you have 8 hour dictation and you have a correction in the first sentences, are you going to send the whole 8 hour?

mbodell: we should put a limit. We can solve this in the protocol

Milan: All I'm asking is a definition of final

bringert: we should define final as something that never changes

Milan: we add a proprietary API it would be a spec violation

mbodell: I'm not sold into final is final but I'm OK

Milan: what would be the language in the spec then?

bringert: Final will be final and in the future we would add a correction event
... we would add another call back

mbodell: it is possible we can represent the result array as read only array in the result event

<smaug> evt.results would become evt.target.results

bringert: I'm fine with the way is proposed right now
... would this be only in continuous?

mbodell: in one shot the index would not be larger than one

bringert: Maybe we should sent different events for continuous and for one shot
... for one shot works and have to have a boolean
... this is more the state of the request
... the nice thing about having the in the request is that you don't have to look in the event
... do we have any outstanding issues?

glen: what's outstanding is that the reco object would look like. Working on the proposal

TTS

<mbodell> TTS element section of proposal: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/att-0008/speechwepapi.html#tts-section

<mbodell> TTS JS API (not really filled in): http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Sep/att-0008/speechwepapi.html#speechoutputrequest-section

bringert: this is basically extending the HTML5 media element

mbodell: the media element is missing the mark

bringert: in the proposal we submitted I added another attribute "last mark"

<bringert> Bjorn's TTS proposal: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Feb/att-0022/htmltts-draft.html

mbodell: the time mark event is something new?

bringert: the last mark is new
... you seek by time. Should last mark update then?

mbodell: in voice XML there's something similar

bringert: you will only update when you get a last mark
... if you want clock time you can store the marks yourself
... What if I want to send text? I guess I can use a data uri
... in my proposal I had a value element
... I would propose to keep the value element in the spec

mbodell: the issue is that media requires a source

bringert: it's a media element we extend
... we have the use case where we need to send the text
... I'm fine with the way it is right now

mbodell: If we can avoid it would be great

bringert: we should loop in the HTML5 WG for advise
... there more language that has to be added

<mbodell> add "Implementations should support at least UTF-8 encoded text/plain and application/ssml+xml. " and othe likewise text

mbodell: do we need a SpeechOutput object?

burn: SSML 1 or SSML 1.1?

bringert: 1.1 is an extension, right?
... We should state that implementation should support 1 and 1.1

burn: 1.1 gives more flexibility and has more inteligence
... when is a good time to start the sync discussion between API and protocol?

mbodell: next week

Robert: if people send questions before hand would be better

- DRAFT -

HTML Speech Incubator Group Teleconference

29 Sep 2011

Attendees

Contents

Web API

TTS

Summary of Action Items

Scribe.perl diagnostic output