London Face-to-face Meeting

HTML Speech Incubator Group
23-24 May 2011

Schedule

Local logistics (member only)

Location: 4th floor
Google
Belgrave House
76 Buckingham Palace Road
London SW1W 9TQ

Minute Takers

MondayTuesday
AM1Milan YoungMarc Schröder
AM2Dan DrutaPaolo Baggia
PM1Debbie DahlSatish Sampath
PM2Robert BrownMichael Johnston

Monday 23 May

AM1

0830-0845 Welcome, Introductions, Logistics, Minute takers

0845-0900 Welcome from our host(s)

Goal of this meeting:

To establish the general direction

0900-1000 Working session: Establish "crucial decisions list"

A "crucial decision" is one that could a have sigificant impact on the architecture and thus should be completed before starting work on the API itself. A strawman partitioning of the discussion topics is:

Crucial decisions
  • Do apps need to work consistently across engines, and if so, what does "consistent" mean? If this is about the user experience, what does "consistent user experience" mean? Who is the user -- the webapp author and the end user?
  • What is the fallback behavior, if any, when capabilities are not supported?
  • What is the mechanism to use capabilities outside of the mandated interoperable set, and where (default engine, remote engine, or both) can they be used?
  • List the areas and/or situations in which interoperability is not guaranteed. An initial list is: reco lang, synth lang, synth voice characteristics, reco performance, synth performance, exact timings
  • What to do about continuous (ongoing) recognition. Is this different from ongoing keyword or hotword identification?
  • Are ASR and TTS separate APIs? What sharing is there between the two, and how does the linkage for, say, barge-in handling work?
  • Should we support mulitple simultaneous grammars?
  • Do we support simulated reco input (semantic analysis of text)?
  • Do we support re-recognition?
Crucial decisions partially discussed
  • How to get audio/microphone/capture access -- discussed 5 May 2011
  • Which, if any, audio codecs are mandatory to support -- partially discussed 12 May 2011
  • Do we support audio streaming and how? -- partially discussed 12 May 2011
  • What is meant by "start of speech", "end of speech", and endpointing in general? How do transmission delays affect the definitions and what we want in terms of APIs? -- partially discussed 5 May 2011
  • How do we talk to a remote speech service, and how much do we specify? -- discussed 19 May 2011
  • How much of the detail of the API between the user agent and a speech service should be visible to the web application API and/or author?
Non-crucial (but still important) decisions
  • Do we support recognition from file?
  • Do we need a recognition context block capability?
  • How TTS fits in (building off media playing versus its own thing)
  • User preferences, privacy, and consent
  • How will eventing work in Javascript? What about event bubbling?
  • Review CSS3 for relationship to our work, if any; perhaps schedule joint call. See the initial email threads on this topic from Doug Schepers, TV Raman, and Daniel Weck.
  • Parameterization of speech recognizers. How does it work?
  • Which ASR parameters should we have (timeouts, speedvsaccuracy, etc.)? Which are optional?
  • Can app author provide feedback on reco result? For example, which n-best result was correct.
  • API or protocol to speech engines, particularly remote.
  • What are the builtin (standard set of) grammars and how can they be parameterized, if possible?
  • What is the mechanism for authors to include grammars directly within their app HTML documents? Is this inline XML grammars or data URIs, or something else?
  • How do we specify default recognition?
  • How do we specify local recognizers?
  • How do we define "recognizer"? What parameters can an author specify on abilities of a recognizer (TTS also)? What is the boundary between "selecting a recognizer" and "selecting the parameters of a recognizer"?
  • Should it be possible to have a user interface in the web app (in addition to or instead of in the browser chrome)
  • What level of customization of the user interface is allowed? One question is whether customization can affect how audio capture is indicated -- there can be security issues if the app can customize that notification. -- partially discussed 5 May 2011
  • What happens with the *audio*, *sound*, and *speech* events in the case of an error?
  • Sensitivity and timeout parameters for ASR
  • Do we want to support (the audio contained in) video streams?
  • What is the behavior of onspeechend if onsoundend is detected first?
  • Do we permit repeated parameters in the URI, body, or both? What do they mean?
  • What subtype(s) of multipart do we support for POST bodies?
Scribe
Milan

Dan: Review list of critical vs non-architecture changing decisions

Milan: Agreed to move recognition parameters to critical list

Architecture review

Bjorn: Summary from Dan's agreed design prinicples document

Dan: Web app author does not get to choose audio src

Debbie: If no audio capture API, then what?

Dan: We can still deliver audio, just that we don't want to design the general API

Dave: What about Bjorn's IDL?

Dan: Perhaps too soon given the number of dicussion topics in flux

Bjorn: Good way to make consensus concrete

DanD: Other components of archtiecture that were missing from the IDL
... for example user privacy

Bjorn: OK as long as the contents of the IDL are accurate

Dave: Time permitting would like to work on the IDL

Debbie: Might be a good break if we get numb

Dave: Let's do 30min tommorow for IDL
... just to get discussion started

Milan: <agreed>

Do apps need to work consistently across engines?

Michael: What are engines

Milan: I interpreted engines as speech engines

Bjorn: For web app developers need consitancy across browsers and engines depending upon type
... Let's enumerate consistency
... eg results that can be expected

1) Consistancy between user agent + default engine

2) Across specified engine + user agent

3) Across specified engines

Robert: Another way to look at it is what developers are allowed to modfiy

Marc: What about consistancy between default engine and remote?

Milan: Agree that default and remote engines should use the same API.

Milan: <agreed>

Robert: Some things like confidence threshold and ovbiously not be consistent

Dave: In VoiceXML builtin gramars were different
... because grammars were so small
... but large gramamrs are similar (like search or dictation)

Dan: In favor of API for developers that doesn't know speech, but also useful when developers figgure out speech

Robert: Perhaps better way to rephrase is to enumerate what could be different

Milan: The types of things that will be consistent will be the same across the categories

Milan: The categories are:
... 1) Consist between different UAs using default engine
... 2) Consist between different UAs using web-app specified engine
... 3) Consist between different UAs using different web-app specified engine
... 4) Consist between default engine and specified engine

1000-1030 Morning break

AM2

1030-1150 Working session: Make crucial decisions

Scribe
Dan_Druta

<bringert> restaurant: http://kazan-restaurant.com/

Previous session wrap up

Milan: How are we going to define the parameterization?

mbodell: It's undefined

bringert: ... Will be in the API

mbodell: eventually will be standardized

Milan: we're trading off functionality for consistency

bringert: not necessary. It will be supported

burn: I thought we want to specify a feature without specifying what engine

Robert: Number 2 is hypothetical.

bringert: Cases are : 1. Use default 2. specify feature but use default 3. Use specific engine

Raj: You should be able to specify URI and let everything default

Robert: you have to specify language

bringert: language should be a parameter in the API

mbodell: Why not use the web app language?

burn: Any language you specify is specific to an instance of an engine (in Voxeo)

satish: so it's an implementation detail

Robert: Engine and language are tied
... There are multi language services

burn: engine is a service and should switches behind the scenes for the language
... We really only have two choices: default or specified

Robert: When use default if you don't want the control of the service and specify when web developer wants control
... A dispatcher is required anyway
... Service is what will expose the engine

bringert: Cross site might be an issue. Allow access only from the developer's web site

<smaug> http://www.w3.org/TR/cors/

mbodell: the default engine might be a proxy itself

bringert: We don't have agreement on the language
... Privacy is an issue in itself

burn: We should not have a topic on privacy

<raj> Correction to my comment: Developer should be able to specify characteristics of the engines/services such as language, any app-specific preloads etc...in addition to having the ability to specify the URI of the engine if known apriori

Milan: if we have one that specifies a service and another one with a different service with a different protocol are they supposed to be consistent?

bringert: No.

mbodell: it is possible you can code it so it is consistent

List the areas where interoperability is not guaranteed

Robert: Max number of grammars to support

Bringert: the list is fine

<ddahl> performance can include grammar size, number of grammars supported, accuracy, latency

<ddahl> performance might also include maximum size of nbest max

burn: are we talking about the list of inconsistencies or the list of guarantied interoperability
... Just want to make sure we're talking about the same thing

bringert: the web developer writes and app and should know what happens in the user agent

mbodell: We're really trying to capture expected inconsistencies

satish: The same language used across services is not saying is inter operable

burn: I would not be surprised if you will want on purpose to have TTS voices to have accent

mbodell: Size and complexity should be added to the list

Milan: that's more a performance thing

Robert: Sensitivity is something that we don't have control

bringert: exact semantic of parameters including sensitivity and thresh hold are expected to be on that list

MIchael: however it happens in the browser is not expected that the performance will be the same

MIlan: I don't agree if it's not inconsistent is assumed to be consistent

bringert: there should be an api for fallback and that should be consistent
... Time it takes to return and latency should be on the performance list
... We have not specified any optional events

mbodell: we should not specify the events yet

Milan: synthesis results should be in the performance list

Michael: One company builds their web site using one provider but will be depended on the user agent and not guarantied that input format is the same

bringert: the user agent might be able to use different codec and sent different format to the same engine on different platforms

PM1

1330-1530 Working session: Make crucial decisions (continued)

Scribe
Debbie_Dahl, ddahl

logistics page http://www.w3.org/2005/Incubator/htmlspeech/group/2011/05/f2f_logistics.html#MondayDinner

continuous recognition

burn: what is that?

bjorn: you want to dictate, continuous translation

robert: open mike scenarios, e.g. driving and issuing commands

bjorn: then you don't need streamning

milan: but you need to be listening all the time

satish: if there are disjoint utterances you could do that transparently

bjorn: in the command case there are natural pauses
... there's "always listening" vs. "streaming recognition"

burn: results streamed back is stronger than we actually need
... the results themselves are actually discrete, may be a sequence, that is, incremental results
... audio you have to send at regular intervals

glenn: audio is streamed, but results are not

bjorn: you could be sending complete results, longer and longer each time

dan: you really need to send the whole stream

bjorn: there's a web app API "on partial result"

michaelB: what does that partial result look like

robert: that doesn't necessarily make sense for dictation, which might be long

bjorn: that's the simplest API, might not always work

burn: one alternative might be a diff

michaelB: a structure like an nbest

bjorn: what if the web app can say, "snapshot"

robert: in most case you don't want hypotheses, just final result, whenever the recognizer decides, two events, "hypothesis" and "final"

burn: "final" is fixed

bjorn: it's an addition

burn: in a two-hour call, at some point you'll send a "final1", what is "final2"?

(everyone): everything since last "final"

michaelB: "final" or "hypothesis" is everything since the last "final"

paolo: it's possible you could change the past

burn: then you shouldn't send "final"

michaelB: it's a tradeoff between how often you send "final"

robert: you could do a rereco with a different LM if the topic changes

glenn: at some point you have to prune the nbest, don't want to keep appending

bjorn: you should get a final nbest with a final, and it doesn't change

robert: the recognizer has context, it doesn't necessarily report to the app
... what is the use case for sending hypotheses

michaelJ: you could have other ways to index results, e.g. id's, or we could just rely on ordering

bjorn: would prefer just to use ordering, point of "i's" is just to show interim feedback.

milan: suggest removing the "i's"

robert: why do you need "i's"?

paolo: user feedback

milan: users find interim results confusing, we removed it

bjorn: we don't have to require that "interim results" are fired

robert: for dictation, we should worry about correction, e.g. should we expect to look at the nbest, or just overtype?

michael: there is a purpose of defining them even if the web app doesn't display them, for talking about how much data needs to be sent.

robert: any decent recognizer will be able to decide when to send the "f"

milan: could send in a parameter to the recognizer that says how often you want results

michaelB: there are some use cases for getting intermediate results

milan: maybe we should reserve i's for v2

danD: another use case is to know that the recognizer is processing, the events may be useful

robert: then you could send events like "I heard speech", or "paused", or "got noisy", that would seem to be of a higher priority

michaelB: in API there may be structures that are built up over time
... e.g. for correction

bjorn: or web app could do this

michaelB: doing automatically would make it easier

bjorn: we all agree on "f's", not sure about "i's"

robert: the "i's" don't even need to look like "f's"

michaelB: if you only have "f's", then you might need to be able to correct results from previous "f's".

satish: i's don't need an nbest

bjorn: we don't need to specify "i's" at all

michaelJ: say you have dictation. say something, then pause, at the end you send "overall message, here's my final answer"

michaelB: if we only have "f's", they'll be narrower in scope, and you may need to correct "f's"

milan: "i's" were incremental, would be ok if you could replace an "f"

michael: the thing that drives the simple "f" story means that you don't have to have a structure that let's you correct things

milan: you can explicitly correct results in previous sentences

michaelJ: people have done things like change the name of a restaurant after you get the city and state

danD: what if the recognizer sends "correction available"

milan: that's complicated

robert: you're reserving the right to change previous "f's"

paolo: in any case, timestamps are needed, for example to coordinate display

burn: I would like to see more concrete examples of what's going into the f's.

milan: should not give that responsibility to the web app, too complicated

michaelB: it's up to the recognizer to say what's changed

bjorn: every result event has an id, so that recognizer can refer to them

michaelB: every time a recognition happens it changes the result array

burn: : what if you can't allow results to change? e.g. live closed captioning, it's gone before you can change it

bjorn: web app could just ignore those changes in this use case

burn: only the recognizer knows what to change

marc: web app knows what to do about the change

bjorn: we have result, correction, and array, then web app knows what to correct

michaelB: that array would be useful, even if the web app could construct its own array

bjorn: the array is just a convenience

michaelB: it would be very convenient

satish: how is the web app going to get the array?

michaelB: in most cases you'll want corrections to be displayed

robert: we have to discuss how much we want the UA to do on the web app's behalf? what about using hotwords to make correction, because many users will not be able to touch the keyboard

bjorn: there could be "enable correction commands"

robert: it's hard to write the app that does corrections

glenn: this gets into maintaining the user experience across apps

michaelJ: what use cases are we including? do we include recognizing broadcast news, or rerecognizing people in a meeting? because those use cases bring up new requirements

danD: I would rather not get into complicated use cases

robert: we have to decide if we care about dictating documents; in that case we have to think about correction

michaelB: textArea could be long

bjorn: we have to limit the scope, we could say that correction is up to the UA

satish: would you want corrections to be different for different web apps?
... the hotword should be similar across web apps

bjorn: two use cases -- I want a field that users can speak in vs. a complete app

satish: you need to have a correction method across web apps

michaelB: you could have a hotword and then speak correct text

marc: correction isn't the only use for hotwords

danD: i would like to pass context as a hotword to the application

milan: you could have a manual process, or the fighting between the UA and the engine doing the correction, the speech engine should do it

burn: if you make the service responsible, when does the UA know that the user is done?

michaelB: in a multimodal context, correction could be done by the user in a different modality
... feedback from user might improve future recognition

milan: Google proposal had a feedback mechanism

satish: instead of making it voice-driven, you could allow the user to specify correction multimodally

bjorn: in the simple case you don't have voice correction

burn: your web app can always write some completely other thing

bjorn: in the single-shot case the result won't change, we can feed it back to the recognizer, but that won't change the result, what if we just made continuous recognition a sequence of these?
... the app keeps sending audio, it sends back results

michaelB: it makes sense for it to be the same structure, but not sure that they shouldn't change. in continuous reco, i think the results should be able to change

satish: we could work out the simple case and then add continuous case

burn: we want to make sure that we don't prevent the more complex use case, but we shouldn't spend two days working out the details.

milan: would you get the same events in the continuous case?

bjorn: you get a sequence of events, what can't we do with that? you wouldn't have continuous dictation support, but we could add that
... user corrections should be out of scope

milan: or engine should implicitly understand how to make voice correction

glenn: this is a simple API that we could build on

marc: how often would you send array of recognized results
... after every sentence?

michaelB: or more often

bjorn: simplest way is to eliminate corrective frames

robert: what about punctuation?
... not in SRGS
... service probably isn't going to add punctuation

bjorn: it's up to the engine to decide if you want to speak punctuation

marc: it depends on the grammar

milan: or add a parameter

glenn: does this matter for API?
... probably not

burn: this is a case where someone, perhaps bjorn, should write up a description of how continuous recognition should work in this simple case

michaelB: there are interesting parameters, such as whether you're more interested in semantics or utterance, also should I try to clean up text or just transcribe literally

bjorn: we have utterance and interpretation already

burn: it should be up to the service to return utterance or interpretation

glenn: should be a parameter to the recognizer, not two recognizers

burn: I don't think all recognizers will support this

paolo: found a use case for the "i", in the case of dictating a document, you might use commands, but the recognizer isn't sure if it should understand command or do continuous recognition, but has to wait until web app decides

glenn: have two grammars

milan: "final" are not final until the end

danD: after you receive the final "final" you clear the buffers and then you're done

michaelB: a different use case is speech-enabling a common web form, you may have both text areas and single field inputs. you want to be able to speak to the page, not click on each box indiviidually
... for example, flower delivery application
... this would be complicated

satish: speech-enabling existing forms will be difficult

bjorn: someone could write a Javascript library to do this, and you could do a different one if you wanted

michaelB: also need to handle bindings. I want to have the user just have a natural speech experience, and it just figures out how to fill out fields.
... if each of the fields have a grammar, then app could figure out how to do bindings, this could be done by standard

paolo: if there's an open text then there's a problem

milan: at Nuance we reflect the DOM in the speech server

bjorn: this could be done by a Javascript library

robert: I would be skeptical if we thought that we could do this as part of the standard. we should just provide API for someone to build that app

michaelB: your continuous recognition case doesn't just mean speaking into a text field; it could be a combination of single shot case and continuous dictation

bjorn: do we get the SISR semantics for continuous, it needs to divide it into semantically meaningful units

michaelB: could have the same text with different semantics

bjorn: what if you have a grammar for the continuous case? you could have a grammar as well as an SLM

robert: that would be another special ruleref

michaelJ: you'll get all these events for fields, when doing dictation what is the grammar

michaelB: I think there should be a lot of default behavior but you can override
... if you want to have multiple grammars active, the match could be tied to a grammar

satish: what if you have two forms in the same field?

bjorn: you can't speak into multiple fields at the same time

michaelJ: whatever other mechanism that we have that would enable you to replace results, we should handle the simple case and not have to think about structure of complex events
... you have to be able to do simple recognition without having to think about the complex

satish: filling multiple forms on the same page isn't that simple, you might have to use multimodal interaction to choose the form that you're filling out.

glenn: how much does the web developer enable

burn: any other continuous recognition topics?

raj: continuous recognition isn't for dictation only

robert: could have continous dictation for SRGS grammars, e.g. games, and i would want to activate and deactivate grammars continuously
... all time-locked in some manner. stream of audio needs to be synched up to when grammars are active

milan: could have race conditions

robert: yes, would have to be a best effort

michaelB: this could be the case of a change in any parameter

burn: every chunk has its own set of active grammars, it doesn't change in the middle of a result

paolo: even the application can decide that it wants to add more constraints

milan: you need to know which grammars are active

michaelB: next version of EMMA will have this

michaelJ: does this include recognition of e.g. broadcast stream, or is this just a single user in front of a screen

burn: we are assuming that everything comes from a single speaker, but we don't expect our services to know that

michaelB: this might change in the case of microphone arrays

raj: you could have a game with several participants

burn: overlapping speech is the problem

michaelJ: if you have a service that's capable of differentiating speakers, you wouldn't want to stop anyone from doing that

marc: if you have a dialog between the machine and the user, you have TTS output, with an open microphone, how do you stop the engine from attempting to recognize the TTS?
... is this in scope?

michaelB: two aspects, do we have to worry about this for TTS from our API, or what if the UA is producing sounds?

robert: we have to leave this up to the UA

glenn: are you going to feed audio from, e.g. a teleconference, back to the speech service?

burn: maybe the UA doesn't even know what's going on

robert: it's not us

dadD: this might be a reason not to separate ASR from TTS
... if it knows about what it said, it could separate the TTS from what it hears

michaelB: we have to figure out how to change parameters. we talked about both open mike and dictation, is there anything we have to talk about with the open mike?

burn: have not captured any agreements

bjorn: writing up different kinds of eventws

robert: we might want to change parameters over the course of a recognition

milan: that information would be transferred to server.

michaelB: but the parameters would not change in the middle of a result

michaelJ: there's what we would like people to be able to do, and there's how to make it happen, can you be streaming audio, buffer it a bit, change parameters?
... we have to have that discussion

michaelB: back to open mike/hotword. there will be long periods of time when you're not getting matches, what happens when you don't get a match?
... do you actually fire nomatches?

burn: in VoiceXML you don't fire nomatch

michaelB: how you you handle nomatch?

glenn: what if multiple results, one on hotword and one on dictation? would that be helpful for corrections? in hotword case it might be interesting to know there was speech

michaelB: it would be more convenient for results to be together

michaelJ: we have that in EMMA

michaelB: it's still interesting to consider nomatch in the open mike case

milan: maybe you should just ignore it

burn: agreements -- for continuous recognition we must support changing parameters

milan: didn't we say we didn't want to deal with echo cancellation?

???: not sure we agreed on that?

milan: we agreed that we wanted to handle continuous SRGS

1530-1600 Afternoon break

PM2

1600-1800 Working session: Make crucial decisions (continued)

Scribe
Robert_Brown

fallback behaviour

Bjorn: what are the possibilities specify other things or get an error?
... different from errors
... and stuff

Raj: fallbacks for specified language and specified speech service

DanD: who's in charge of the fallback?

Burn: fallbacks should be to things we know actually exist
... where there are a set of potential fallbacks, they should be author selectable
... fallback in SSML where a requested language isn't available is one example of a precedent

MichaelB: codecs don't necessarily have fallbacks due to other restrictions

Burn: SSML always tries to say something even when the requested resource isn't available. this is partly a result of VXML's queueing

MichaelB: not only might the requested speech service not be available, but also the default service

Bjorn: but in the HTML IMG tag you don't say that if this image doesn't work here's the fallback
... better to throw an error

DanD: part of the API is to have a discovery API

Burn: if you throw an error, you need a discovery API. if you have a fallback, you don't

MichaelB: the fallback for a specific service may not necessarily be appropriate
... if you want to fall back to the default, you can catch the event and swap services. the default should be to not have speech

Bjorn: Propose: if you request a service, and it is not available, an event fires that the app can catch the event and select another service

MichaelB: there is no fallback to default service

MichaelJ: would there be an automatic behaviour that greys out the microphone icon? or should it only happen through script?

DanD: it's not fallback on the service, it's fallback on the user experience

Olli: querying a service beforehand to see if it's available could expose privacy issues.

DanD: so enumerate all service capabilities

MichaelB: except that in most cases it'll just discover the user's preferred language if used against the default recognizer

Burn: propose: the web app API should provide a way to determine if the service is available before trying to use the service

Raj: what does that buy you? it could still go down

Bjorn: it could remove the UI
... do we ask "what languages do you have?" or "do you support this language?"

MichaelB: or just try it and see if it fails, although that would be bad

Burn: what other potential fallbacks should we discuss?

Milan: protocol? continuous reco may not be available

DanD: checking for capabilities gets two birds with one, since it also enumerates capabilities

<smaug> Would API like this work: Speech.getService(ServiceURL, { configuration: foobar}, success_callback, error_callback);

Satish: is there a precedent in HTML to change the UI when a service isn't available?

Bjorn: the google translate app only shows UI when TTS or SR are available

Debbie: the difficulty is that you can't see what failed

Milan: ok if we have defined error codes

DanD: inefficient to keep requesting over and over until finding a parameter set that works

MichaelB: because it's a method off the object, it can test with anything on that object that you can set. e.g. the grammars

Debbie: is a service obliged to tell the truth about its capabilities?
... e.g. you want en-au, but it only has en-uk, but it decides this is okay

Michael: return value could give some sort of qualitative capabilities measure

Burn: easier to do vendor specific stuff through polling than through a get capabilities/languages/etc API

DanD: boxing ourselves in if we try to enumerate a specific set of capabilities. need something extensible

Bjorn: HTML APIs normally have separate methods, rather than one huge structure

MichaelB: the protocol could still just query capabilities in one chunk, and the API could break it up

MichaelJ: so if you want to check whether a grammar is supported, should the service download your huge grammar?

Burn: could be a prepare/prefetch function

Bjorn: grammar should be from the same site as the page

MichaelB: disagree. just like images, etc, apps can fetch grammars from 3rd party sites

MichaelJ: will have a rich set of define errors

Burn: the API must provide a way to query the availability of a particular capabilities of a service
... and the API must be able to enumerate the capabilities of a service

Milan: since querying language is only an issue in the local case, maybe we should only restrict this locally, but allow it remotely

Bjorn: or there's a setting in preferences, or not in incognito mode

Satish: don't we already need to get the user's permission prior to doing reco? if so, why not require permission before returning capabilities?
... should be governed by the same privacy policy

Bjorn: don't want to pop a permission UI just to know whether or not to render a microphone button
... up to the user agent how to interpret the request

Milan: if the user agent doesn't want to reveal the private info, it should say so

Burn: web app only talks to UA. UA discovers capabilities of a service, and decides what to expose to the app

Olli: use same permission/security/policy API as whatever gets specified somewhere

<burn> tentative text: The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to give incorrect information to the web app or decline to answer.

MichaelJ: still worried about error codes. there's an infinite number of ways something can fail

Satish: error codes are specific, text isn't because it needs to be localized

Milan: the user agent should not provide incorrect information. better for it to say "no comment"

Marc: is it just the language that's a privacy issue? e.g. list of installed voices

Debbie: haven't spoken about speaker dependent recognition, but exactly who the person is may be private

MichaelB: there are also things in the recognition result that may be private. e.g. a guess at the speaker's gender

Raj: why enable the user to trust only a specific set of speech services?

MichaelB: because the web app doesn't necessarily have access to the audio

Bjorn: if we give an app reco results, we could give them the audio, gender, age, etc, given that there are APIs coming that will allow direct access to the mic

MichaelJ: there are subpoenable issues. having a transcription of somebody allegedly saying something is different from having the audio recording

Bjorn: user agent can hide capabilities or deny the use of capabilities it doesn't like
... or services

MichaelB: or web apps

Debbie: different user agents could deny different things, which would affect consistency

MichaelB: consent could be parameterized on different dimensions

Burn: comparable to flash on ipad

multiple simultaneous grammars

Bjorn: wouldn't it be nicer to have one grammar that refers to the others

Milan: more convenient to activate/deactivate specific grammars

Bjorn: how do you specify weights?
... how about semantics from each?

MichaelJ: weights don't work the same

Debbie: or you want to reuse somebody else's grammar, or a built-in

MichaelB: can't ruleref to an SLM
... but can run them both at the same time
... and if you have grammars for each field, could enable/disable grammars depending on what's in focus

Burn: multiple simultaneous recognitions is not the same as multiple simultaneous grammars

Bjorn: would really like to ruleref to SLMs

Burn: these topics have come up repeatedly in voice browser working group (new rulerefs, and SLM format)
... SRGS doesn't say that it doesn't allow SLMs for builtins

MichaelB: it sounds like we all agree that multiple simultaneous grammars is okay

Bjorn: no impractical limit on number of SLMs & SRGSs

Milan: and we agree it should support weights

MichaelJ: multiple requests to different services are different recognitions

Milan: and we must support multiple simultaneous requests to services (same or different service)

Satish: wouldn't be hard to implement in Chrome

Glen: what's the user case for this?

Satish: there's no requirement that the audio for both requests start at the same time

Burn: nobody objects to doing this in the API as long as they are logically independent

DanD: what's the limit?

Burn: practically, there's a limit, but the spec doesn't need to state it

Glen: does this complicate consent?

MichaelB: just need to indicate that something you consented to is listening, not exactly which ones

MichaelJ: need to discuss how we specify weights

Tuesday 24 May 2011

AM1

0830-1000 Working session: Make crucial decisions (continued)

Are ASR and TTS separate APIs?

Robert: what does it mean?

bjorn: we discussed on the call that it should be possible to implement ASR without TTS or vice versa
... Microsoft proposal used some shared parts of the API
... reason for linking them is barge-in
... but not much more

burn: there are common use cases for only-tts output
... not so many use cases for only speech input

bjorn: voice search

robert: the scenarios where they work together are where we need to find out if there are any weird interactions
... so asr and tts should be part of the same api family

danD: where should they be closer together: on the developer side, or on the engine side?
... is it just a convenience for the developer?

robert: hopefully the apis are loosely coupled

bjorn: except for barge-in or starting speech output after speech input, are there any reasons for coupling the two?

satish: not different from playing non-speech audio, or video, --- what is special about tts?

burn: noise cancellation, expectation that people have that when you speak to a machine, that it will stop speaking when you speak to it.
... that's different for a youtube video playing at the same time.

robert: as a developer, what do I need to build a good interface?

michaelB: maybe too early to discuss the topic

bjorn: what do we want to be able to do? I start speaking, something that is playing stops

burn: ... within a certain time frame

michaelB: and we need to know the marks from SSML

danD: should asr and tts be allowed to be different services?

burn: yes, can be completely different URIs

michaelB: you may want to coordinate asr and tts, for example the tts output should wait before it starts until a grammar is loaded.

burn: no requirement to have the same service for asr and tts.

bjorn: the service, or the UA could coordinate the two. shouldn't have to be the webapp to do this.

danD: so if the URI for asr and for tts is the same, should the two be configured separately?

burn, bjorn: yes

bjorn: would the barge-in be already solved using current events and javascript interfaces?

burn: in mrcp, we had a requirement on ordering of events
... where the same service did asr and tts, the service could do a faster coordination between the two internally.
... but I don't think we should impose such a requirement here.

robert: usually when thinking about barge-in, we think about low-latency contexts. but on the web, we have long latencies usually
... so just coordinating events might not solve our issues.

bjorn: we talked about requiring a low-latency speech detector in the client. would that be sufficient for barge-in?

michaelB: for certain cases of barge-in, you need the more advanced speech detector in the speech service.

robert: what is the scenario?
... 1. while the tts is speaking, interrupt as soon as there is any sound
... 2. a lot of background speech all the time, but then there is a keyword

bjorn: what is an acceptable latency?

burn: as fast as possible

milan: a round-trip to the server and back to the client is unacceptable

burn: agree

bjorn: if the speech server is taking the decision to stop output, that round-trip is inevitable

joint clarification: two round-trips would be unacceptable, but stopping it locally after the speech service takes the decision to stop is fine.

milan: need to preserve timing.
... asr service needs to tell tts when barge-in occurred, so tts can check how that relates to its ssml marks

glen: the only thing that matters is what audio the client has played, not what the tts service has sent.

robert: for both output and speech input can keep track of times, put both on the same clock.

michaelB: if client means UA, I'm fine with that. shouldn't have to be the webapp.

bjorn: how fine-grained a timing we need?

burn: in voicexml 2.0, we had mark; in voicexml 2.1, we added offset from marks, in milliseconds.

milan: relevant user perception is around 30ms precision

glen: what is the use case for knowing the timing?

burn: it would be nice to know which word that was when the user barged in.

michaelB: if you hear a list, you want to know if you are 2 ms into the list item, or 2997 ms into the item

glen: don't see the reason for wanting to state "1.7 seconds after the mark"

burn, michaelB: disagree strongly. this is exactly how humans work.

burn: what should be the granularity?
... for the client it would be easy to specify to the millisecond.

bjorn: ok, so what API do we need?

robert: asr provides a client-adjusted time stamp for a recognition result.
... then it should be possible to ask tts, "give me the mark and offset for this timestamp"

michaelB: this could be done by the UA rather than the webapp

robert: agree, if webapp author makes it explicit that "this is the tts I care about", this can happen behind the scenes.

bjorn: this could be done in a javascript library, on top.
... say you have two tts...

burn: as long as the timing information is available, this can be done.
... client needs to tell speech service, "in my local time, this is when it started"

michaelB: not convinced, the simple case of one-stream-one-stream barge-in coordination should be part of the API

burn: I could live with either choice.

robert: tend to agree with bjorn, because we have the two types of barge-in. the UA cannot be expected to keep track of the hot word case.

burn: it would be very easy to tell the UA: on speech start, stop tts output.

michaelB: agreement that we want to be able to support the complex interactions of multiple asrs and ttss
... would this complicate the API?

bjorn: it might complicate the API but simplify the apps

robert: have implemented this API quite a while ago.
... relatively simple api.

michaelB: make the 99.5%-of-the-cases scenario really simple.
... bjorn: but this would be a single line of javascript code...
... do I understand correctly that this would be a convenience, not offer new functionality that you couldn't do otherwise?
... yes, to make web app developers' lives easier.
... probably doesn't make the architecture much different.

bjorn: right, so this is no longer a crucial decision.

<paolo> s/^... bjorn:/bjorn:/

burn: it's like a "play-and-recognize"

olli: similar case happened in other cases: method appeared first in script libraries, then the browser vendors implemented it.

bjorn: different design strategies: I prefer providing the simple basic stuff as a starting point.

robert: when we (MS) prepared our proposal, we wrote asr and tts parts separately, then discovered some parts are exactly the same.

(dan is capturing agreements reached)

<burn> Decisions added during this session were:

<burn> - We disagree about whether there needs to be direct API support for a single ASR request and single TTS request that are tied together.

<burn> - It must be possible to individually control ASR and TTS.

<burn> - It must be possible for the web app author to get timely information about

<burn> recognition event timing and about TTS playback timing. It must be possible for the web app author to determine, for any specific UA local time, what the previous TTS mark was and the offset from that mark.

<burn> - It must be possible for the web app to stop/pause/silence audio output directly at the client/user agent.

<burn> - When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.

1000-1030 Morning break

AM2

1030-1150 Working session: Make crucial decisions (continued)

Scribe
Paolo_Baggia

In Crucial Decision:

"What is the mechanism to use capabilities outside of the mandated interoperable set, and where (default engine, remote engine, or both) can they be used?"

Milan: Protocol impact

Dave: design API first then protocol

Michael: We wrote something on Post body

DanB: Yes, but we can have more
... we touched this in other discussion

Bjorn: One thing is parameters
... or we can put in the URI ,,,

DanB: I don't like only there

Milan: In POST body
... Naming scheme?

Bjorn: If it is in URI don't matter

Satish: It will be server specific

DanB: I don't like call them 'extensions', I prefer vendor- or server-specific

MikeB: More interesting is to figure event in return

ddahl: Important to define the whole set of options

MichaelJ: If it is round ASR, emotion detection, prosody detection ...

Bjorn: Most of the can be in EMMA, if they are at the end

MikeB: If you need to coordinate event based is difficult to understand how to have in EMMA

ddahl: You can configure server

MichaelJ: In EMMA you can add more in the content

Marc: We talked to filter EMMA

MichaelJ: Filtering is to do ???

MikeB: For the extensions you need coordination and passing, so UE is free not to use them, like emotions, or ther stuff
... The UE has a role to be active or accept extensions

Olli: Handling events will be trivial, speech service specific events are more complex

Satish: Add extensions to events to handle to the service
... extension properties to be sent to the service

DanB: Aside from naming, are there other questions

MikeB: There was some talks to be strings, but sometimes might be binary or structured data.
... not clear string are enough

Bjorn: Do we need vendor specific events?

DanD: you have components that you want to parametrize ... how you specify them for standard service
... there are multiple types, for certain URI might be best, especially for initialization

Bjorn: I don't say URI only ...

DanB: We should not limit extensibility

Bjorn: How much provision in the API put in UA for parametrizing server

MikeB: vendor specific handler

DanB: We have API and we have events that are received from web app author

Milan: What about send a grammar, or a property
... if we allow any parameter at any time

Bjorn: Is it enough any parameter with free name and value, but not event back or call any method

Olli: Generic extension events might be useful, than you can have binary

Bjorn: We propose three:
... 1. set parameter with name and value
... there might be standard names and vendor specific "x-"

Discussion on agreement on names

scribe: about values: objects or strings?

DanD: the actor is web developer

Olli: value will be JS object

<agreement>

Robert: standard way to serialize the value

MichaelJ: The action and methods is recognize, if we do verification the answer will come back in EMMA ...
... after speaker recognition I look in a Db and put in EMMA results ...
... return HTML in the panel to change the interface

DanB: We don't need extension for EMMA, because it is extensible

Bjorn: number 2
... ok for speech service to add whatever extra in EMMA

Milan: Must be in EMMA?

MikeB: Yes

MichaelJ: different way to do in EMMA: put in Interpretations and use emma:info for additional annotation
... or to have derivation chains
... For recognition, it might return emma:literal, but not in API. This is bad.

Bjorn: We can put in protocol or leave to UA
... Web app don't need to look to EMMA

MikeB: Web appl might not reflect emotion easily like confidence or other simpler stuff

MichaelJ: Service return both EMMA and special fields?

Bjorn: I don't mind

DanB: You can put application specific in EMMA today, so we are recommending that

MikeB: We are not specify this is the specific EMMA, use EMMA in the way you like

DanB: It doesn't add beyond EMMA

MichaelB: If you have EMMA ...

Paolo: is there agreement to have EMMA and also other parameters utterance, conf?

Bjorn: yes it is

Milan: Put EMMA in EMMA

Bjorn: Yes

Milan: it seems asymmetrical not to have multipart
... you have to encode

Bjorn: You have to encode in any case
... it seems a special case to add binary data, but it is doable

<agreement on point 2>

Bjorn: point 3
... event that come from the service to web app a part from standard ones
... we need to have register event listener

Olli: No normal DOM event
... it is arbitrary

Bjorn: Browser might reject

Satish: Allow browser to register a name and an event

Bjorn: to avoid conflicts in future

MikeB: If through the extension events, I want to send a standard event

Satish: only difference of a single listener

DanB: All listening to the same event and parsing

Bjorn: You set multiple event handlers to look to type ...

Olli: It is annoying

Bjorn: If we have a method like set-extension-event handler

Olli: You have in DOM tree events, in Mozilla most prefix events ...

Bjorn: We can use that

Olli: We can use custom event in DOM3 event
... very flexible

<smaug> http://dev.w3.org/2006/webapi/DOM-Level-3-Events/html/DOM3-Events.html#interface-CustomEvent

Olli: you can specify any event

<bringert> http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions

Correct URI is: http://dev.w3.org/2006/webapi/DOM-Level-3-Events/html/DOM3-Events.html#interface-CustomEvent

Milan: We might want to send timings
... track timing in client might be more robust

Bjorn: It is more protocol, when in protocol start recognition, the UA should send to the speech service the begin of capture audio

DanB: it seems a protocol detail

Bjorn: I think it is important to agree on it
... Agreemtn on point 3 is to use DOM3 Extension Events

Marc: Very thiny events tight to audio

Bjorn: you could send event with timestamps, sent in advance

<ddahl> s/Agreemnt/Agreement

Marc: you have two indepedent in case of barge-in

MikeB: It is doable, but the bar is raised for doing it.
... At this time fire this event

Bjorn: Fine
... Agreement: speech service should be able to instruct the UA to fire a vendor specific event at a specified time in the future

Marc: not specific time, but on streaming audio

Bjorn: It have to be relative to the a client time, or multiple times absolute

MichaelJ: The audio and video might be done by a single object, to avoid problems

Marc: Ideally yes

Bjorn: Requires to have EMMA extrastuff for TTS

Marc: Has to be possible to be played through something else

Robert: we can agree to fire event at offset of streamed audio

MichaelJ: Use TTS to point at URI which mixes timestamps, etc
... ship arbitrary requests

Bjorn: You don't use TTS API
... use Audio tag

MichaelJ: I have two more:
... a. do audio/recognition with string back, user corrects by MMI, do we provide support for text?

Bjorn: There is a point: "Do we support simulated reco input (semantic analysis of text)?"

MichaelJ: second
... b. You want to send other stuff more than audio, people talks and click to send.

Bjorn: Could you hijack click at place
... you could put in parameter

MikeB: Use EMMA

MichaelJ: But EMMA is for input
... for instance UA capture InkML

Bjorn: to be sent to the service with voice

MichaelJ: We can send parameters not only at the beginning for that.

Bjorn: We can do that to intersperse things ...

DanB: It seems out of topic

Bjorn: We can do a protocol to do that

Milan: timing information need to be sent

Bjorn: This is not related to number 1, this is a protocol issue

DanB: Any parameters sent from UA to service should include a timestamp

<agreement>

MikeB: Parameter if it is at the begin, an event in the middle of something

MichaelJ: In JS API there is a set-parameter to send encoded image

MikeB: yes

MichaelJ: Maybe sending is multipart ...

Bjorn: You can chunk the audio and intersperse with that. It is not standard, but doable.

DanB: A good topic for PM is to talk of protocal

Bjorn: I'd like all the crucial Web app API first

DanB: yes if we have time

DanD: I have classes of parameters: authentication and privacy (who is able, at what time, ...)

<ddahl> this general parameter-setting mechanism could be very general, e.g. you could send a parameter like "the sun is shining", and since parameters can be sent at any time, you have a general eventing mechanism

Bjorn: I don't see any privacy to be standardize

MikeB: It might be the case that I don't want to communicate to service that are not secure ...

Bjorn: It seems a UA specific setting

Milan: https URI

Bjorn: User can check box, like block images in pages

MikeB: or cookies secure

Bjorn: Web appls loaded on secure connection should use only secure connection to speech service
... there should be some recommendation

Robert: The bar should be the same of other context

MichaelJ: We have to say, we must support https in addition to http

<agreement>

DanD: If the web apps, use https ...
... if web app using https and the URI isn't https should pop-up

Bjorn: We cannot standardize it because it is part of browser
... important to capture:
... 1. must support https
... 2. the default speech service implementation shouldn't be using unencrypted when secure communication

<agreement on point 1>

Discussion on point 2

Milan: Should not use unencrypted

<burn_> discussion agreements during this session were:

<burn_> - It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.

<burn_> - Every message from UA to speech service should send the UA-local timestamp.

<burn_> - API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.

<burn_> - EMMA already permits app-specific result info, so there is no need to provide other ways for service-specific information to be returned in the result.

<burn_> - The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.

<burn_> - The protocol must send its current timestamp to the speech service when it sends its first audio data.

<burn_> - It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.

<burn_> - HTTPS must also be supported.

<burn_> - using web app in secure communication channel should be treated just as when working with all secured sites (e.g., with respect to non-secured channel for speech data). -

<burn_> - default speech service implementations are encouraged not to use unsecured network communication when started by a web app in a secure communication channel

1150-1230 Lunch in Cafe Royle (5th floor)

1230-1330 Break, coffee in the Atrium (4th floor)

PM1

1330-1530 Working session: Make crucial decisions (continued)

Scribe
satish

IDL

Bjorn: started writing the IDL because it seemed like we agreed on a lot of things and an IDL would make it clearer.

<satish_> MikeB: are we reviewing the proposal and making corrections to it?

<smaug> ping

<burn> test

Bjorn: for the JS api, grammar should be a sequence of objects and each with its own weight.

Dave: grammer should not be optional and there should be a built-in value

DanB: perhaps better not to get too much into implementation details

Bjorn: agree with Milan that there should be a language attached to grammar so that you can specify a language for built-in grammars too

DanB: similar to voice xml

Robert: should built-ins include language?

Bjorn: perhaps better to not have part of the URI

Robert: If we are making up a scheme for built-in URIs, we could include language in there too

MikeB: In the HTML API it may be understood implicitly based on which element the binding is done to.
... we need a sequence of objects, the object have properties etc.

Robert: I see 'maxresults' property and there is potentially more like that.

Bjorn: should we have a 'setParameter' method or have them as separate properties?

Robert: prefer typed programming paradigm
... for well known properties have them as named attributes and have an additional setProperty method for other attributes.

Milan: could have a properties object which has all properties on it.

Robert: typical to have a flatter API

MikeB: supports property type apis as well instead of a setProperty method

Bjorn: agreement that standard parameters would be settable as dot-properties but they can also be passed in to setParameter method.
... Should we discuss other standard parameters to include in this IDL?

DDahl: easier to think about after this f2f and discuss over later.

Bjorn: the start/stop/cancelSpeechInput methods were in general agreement

Robert: the names seem verbose, could shorten in a later iteration

Bjorn: <describes the events>

Dave: is there a strong requirement for onsoundstart and if not could we trim it?

DanB: these were agreed upon in past discussions

MikeJ: could we have an event for energy level changes?

Robert: seems like a inventing a microphone api and we should rather use an existing api if available

Bjorn: once such an api is available we plug that into the SpeechInputRequest object etc.

Robert: Is a start guaranteed to also fire an end eventually?

Bjorn: yes, eventually
... the email I sent yesterday expands on the 'onresults' event and the event object received.

MikeB: We haven't decided on how to handle NO_MATCH case, could either throw an error event or an onresult event with an empty list.

Bjorn: result event has emma document for each result item as discussed earleir.

Dave: the IDL looks quite comprehensive, we perhaps need a pruning exercise to reach a core api instead of solving all the issues

DanB: I have the opposite concern, don't want a perfect api but also don't want a toy api which only works for very few cases and having to revisit soon.

Dave: worried that browser vendors may not adopt a complicated api

Bjorn: my only issue now is when we get to the protocol which may be tricky
... had the same concern as Dave initially but things have been going smoothly

Discussion about Bjorn's email writeup about continuous recognition

Bjorn: 3 different events described, 'result', 'intermediate' and 'replace' events.

MikeJ: should the replace event id be a range of individual?

Bjorn: send separate replace events for each id

MikeJ: problematic if the service segmented by mistake into 2 utterances and if the final result should be a single utterance

MikeB: could use empty utterances to achieve this, send one full utterance and then an empty utterance

Bjorn: that wouldn't work if we were to segment into more pieces, e.g. 3 utterances instead of 2.

MikeB: having more options than a simple replace is perhaps necessary

DanB: the current replace api works fine for the common case, could add insert later if required.

Bjorn: one wrinkle: if you change parameters such as grammar in the middle, audio since the last result needs to be buffered and recognized again.

Milan: there is no feedback mechanism/method in this IDL

Bjorn: Yes it is missing from this IDL and needs to be added, perhaps to the Result object

DanB: perhaps this is something which should be done using vendor specific extensions and test

Bjorn: after this discussion, do we want the intermediate and replace events?

Robert: no

DanB: usefulf or demos but perhaps not required in real use cases

MikeB: ok to not have intermediates if we have replace

Bjorn: ok to have replace if the web app doesn't have to always handle them. e.g. services sending empty results and then replacing them

Robert: if we have partial/intermediate results then we don't need replace.

Satish: replace makes state management heavier

Robert: if we are looking to cut something at a later stage, replace is perhaps a candidate.

<agreed to keep all 3 events>

Do we support simulated reco input?

DanB: i support it

Bjorn: seems like simple to support

DanB: there is a difference between simulated reco and semantic analysis of text

Robert: some use cases such as video games have all words made up, languages like japanese have a lot of imported words etc.

Bjorn: why is this a problem for testing/analysis of text?

MikeB: could have a separate call to pass the text or just pass it through the input stream to the service

Robert: this affects the api

Bjorn: we should be concerned about things which are affecting the web app api

MikeB: if there is a web app specific text such as product name which should be pronounced in a particular fashion and the grammar is designed for that, the web app should take the typed text and convert it to how it should be pronounced in that domain before giving to the speech api.

MikeJ: focus on the simple use cases which cater for most web apps and punt on the pronunciation issues

DanD: user has low expectations about reco if they were typing it in the first place

MikeB: don't agree that adding support for pronunciation in this case doesn't complicate the api

Bjorn: can we specify a different grammar for simulated reco with text?

Robert: we don't want to have different grammars for speech and text

Bjorn: most of the use cases heard so far are for test tools, better not addressed via our api
... it is up to the recognizer to decide how to take audio and text input and return results, let the web api not provide anything specific for it.

General agreement that there would be a method to give text input to the request object and a parameter to specify the type of text matching - fuzzy or exact

We had a short discussion about protocol and decided to look deeper into the latest draft of WebSockets. We'll revisit the discussion after reviewing the draft, possibly over email or with a smaller group.

Do we support re-recognition?

Bjorn: what are the use cases?

Milan: change language and re-recognize because past recognition returned no results

MikeB: change grammar and re-recognize

Bjorn: can't we use parallel recognitions?

Robert: wouldn't work if we want to change grammar based on previous recognition, e.g. if London was identified then re-recognize with London specific grammar to recognize the local address.

Bjorn: isn't this just a fancy language model?

MikeB: no, this is about keeping the audio around and doing better recognition

DanB: very similar to text input where instead of text input send audio
... could implement both recognize from file and recognize from past session through this api

Bjorn: if we have a rerecognize() method in the SpeechInputRequest object and if that re-recognizes the previously recorded audio, that would cover the common case.
... web app can change grammar and other parameters before re-recognizing

Robert: requires state, better to have web app specify earlier that it may re-recognize with a setParameter call so that most common cases don't end up storing state and audio which doesn't get used

Olli: if the api is based on Streams where the recorded audio can be obtained as a stream, don't need to set this property in advance

Robert: if network failure could even try again with the local recognizer with this method.

DanB: The client must store the audio in this case if the web app asked for it but as an optimisation it may not retransmit the audio if the server had indicated that it has a copy of the audio.

MikeB: we should also think about the case where stored audio e.g. voice mail is played out and re-recognized

Bjorn: we could wait for a microphone capture and audio processing api before supporting this

DanB: if the client should already store audio locally when requested, why is it difficult to support this scenario?

Bjorn: because there is no standard api to handle and process such stored audio today and we should wait until the audio WG's api proposal gets adopted

DanB: I'd like to think about this use case and design the api with it in mind

<Raj> Are there any privacy issues with providing access to VM etc..and submitting it to a WebApp..that need to be considered....

Bjorn: is re-recognition important enough to think in detail today and not wait until an audio api is available from the audio WG?

Robert: yes, we do a lot of re-recognition

DanB: it is not the highest priority items but it is one fo the most requested items in voice xml

Bjorn: I'd like us to do less than voice xml, not more

DanB: Lets leave it as discussed above with the rerecognize() method in the request object.

Bjorn: could potentially pass in a stored data from local files to the rerecognize() method.

Robert: should we also think about URIs and not just file/blob data?

Bjorn: raises concerns if the URI points to data on the intranet and the web app is outside the intranet
... should the client download or the service download?
... won't work if the download of data requires a cookie and such data can't be sent to the remote recognizer.

Fetching data from given URIs for re-recognition or other context such as grammar

Robert: Any URI specified must be accessible by the service without cookies and other related stuff

DanB: don't see this working. In MRCP the cookies were forwarded to the service, e.g. rule refs in the grammar would break without that since the service can't process the full grammar.

Robert: probably ok if we say the service can access the data it should be sufficient

Bjorn: perhaps a unique URL with a key should be given to the service which can let it get the resource

DanD: sees two types of cases - one where the grammar is only accessible to the service, other requires user authentication

Robert: a third case is when the grammar has embedded refs to user data such as contacts

<Raj> Yes..here again...on mobile devices..there is user action required for providing access to such client data....

<Raj> besides technical certifications for access to privlieged info...

DanB: unlike the web api, this is perhaps a candidate for a smaller group to go and figure out how the authentication would work

MikeB: also need to decide on how the dynamicism works

Robert: example is a huge SRGS which has rule refs to the user's contact list, department names etc.

Bjorn: how is this not authentication?

Robert: could be that you trust the service and given permission already

DanD: wouldn't you have to compile for the user anyway?

MikeB: you can have the large grammar in a precompiled state and have the smaller grammars compiled at runtime and referenced.

Bjorn: would it be sufficient to pass a substitution map?

<Raj> But, the user agent cannot automatically provide access to contacts etc..on the device to a web-app without some privileged access to such Device API often involving explicity user-consent

<Raj> to a webapp

DanB: this adds a requirement to the speech services

Robert: not that complicated, every engine has a resource loader and that needs to support this substitution map
... have to come up with some sort of virtual rule ref mapping to a token

DanB: does the web app have to parse the grammar?

Bjorn: no, the web app only needs to know the place holders and values for them

DanB: may have pieces of grammar authored by different people and hosted in different locations and the person using the grammar may not know enough about all levels to replace

MikeB: either standardise the uris on how to replace, or the web app need not provide all the values, there could be defaults.

Robert: simplest way to do it is to take a rule ref and add a segment 'virtual=PLACEHOLDER'

Bjorn: better to add a new URI scheme that can be used in rule refs and grammars

DanB: we should also think about feeding this into IETF as a draft and register
... and if this is so generic why doesn't it exist already?

Bjorn: because there are not a lot of other contexts where there are replacements

MikeB: will send a proposal

MikeJ: we are talking about getting people's contacts.. should we talk about the security and privacy concerns due to the composition of these?

Bjorn: thats not speech specific, if they can access your contacts they can send to random services already.

1530-1600 Afternoon break

PM2

1600-1800 Working session: Wrap-up and next steps

Scribe
Michael_Johnston

Planning discussion!

DanB: (presenting on board)

To complete:

- remaining important topics

- protocol between UA and Service

- Web API

- Markup API

DanB: Laying out calendar
... Have until 25 Aug (end of XG)

<Raj> Co-ordination with any Device API activities as well

<Raj> as access to a lot of microphone and other device attibutes depend on such access

DanB: Forthcoming weeks impact on telecons 26 May no meet, 23 June, VBWG, Jul 21 IETF 81, Aug 11, Stek NY
... about 10 telecons to go
... Not enough time for Protocol / Web / Markup
... Need to complete discussion and design agreements

Bjorn: Remaining topics?

DanB: As listed in Dan's topic list

DaveB: Hoping with draft proposal of Web API and markup
... Is it still possible to get these out

MichaelB: Need to address the protocol also to extent it impacts the API

Satish: 2 Level API?

MichaelB: No,

DanD: Needs to be interoperable

Satish: Do default case?

RobertB: Already had this conversation

Bjorn: Finish web API, start protocol

MichaelB: Close to proposal

RobertB: Lets see if there is a way to finish these

DanB: Reach point where group agrees on level of specification

Bjorn: Want to get to point that experimental implementations can be

<scribe> completed

DanB: Concern about press release of chrome as implementing this standard

DaveB: Marketing snafu

MichaelB: More work needed on protocol as it will influence the web api and markup

Bjorn: Want to experiment to see if the protocol works

Robert: Speech service providers need to start on implementing backend

DanD: Web api and binding will impact the protocol, natural order

DaveB: Challenge is web API
... Protocol as subset of MRCP

<Raj> Yes...subset of MRCP is the fastest way to REC

<Raj> otherwise, it will become chicken & egg with both parties ( UA & SS ) waiting on each other to finalize

MichaelB: Need to deal with the protocol in order to deal with web API

RobertB: Suggest breakout to groups

DanB: 4 Aug to decide what we do after this XG

MichaelB: Options, End, 1 year extension, recharter as working group

DanB: Recommendations from w3c that would be better to be in its own working group
... html is too big
... better to have groups on separate topics

DaveB: Agree,

MichaelB: Get proposal

Olli: html not best match, webapps, more comments
... Seems like we need a separate working group

<Raj> perhaps aligning with MMI WG can also be considered

DanB: Need to work this out by Aug 11

Milan: Could have additional calls

DanB: need to keep the group public, take minutes etc

Milan: Minutes?

MichaelB: Can be summary

Robert: Need to assign leaders to each group

MichaelB: Report document is in place

DanB: Fill in sections,

MichaelB: Editor for each section

DaveB: interdependencies between javascript api and html binding, need to be one group

Satish: Need to be single group

Olli: concerns about separating markup binding and web api

Bjorn: Want html markup binding

MichaelB: Report could list issues

Robert: markup binding is remaining topic

Milan: Wants to be in protocol team

MarkS: TTS not a focus, if split out this way wont come to it

Bjorn: There is agreement on TTS between the draft proposals, TTS is separate

MichaelB: Subsections on asr and tts, not separate

<Raj> I can volunteer for UA to device access & protocol sections work..

DanB: show of hands for participation 6 protocol, 7 js api 4 markup

DanD: other topics not listed, e.g. security and privacy considerations
... End to end architecture

<Raj> Dan..Hope you included my show of hand remotely for protocol

Milan: Have important item discussion of markup binding before making it a group?

Robert: Plan to come back by 9th of June to indicate how to handle

DanB: straw poll on choosing one
... Planning, dates Jun 2: markup binding, etc
... June 9 plan of both, June 16 topics

30 June Web API

Dan B: July 7, protocol

DanB: Robert as protocol lead

Bjorn: would like to lead web api, but heading on paternal leave

DanB: Michael B agrees to lead web API, bjorn supports
... Delay decision on markup binding group

Robert: Wants people to put time into doing the work

<Raj> Protocol work needs Speech Engine Companies to work closely...( so loquendo and Nuance )..to that we can start doing the work as robert suggested

DanB: Requested slot at tech plenary for fall
... for web API: MichaelJ, MichaelB, Bjorn ...

DanB; for protocol: Robert, Milan, Michael, ...

Bjorn: Issue of request for comments from HTML5

DanB: Left off because of time limits

MichaelB: Specific request

DanB: Nothing for us to say because of timing

MichaelB: If there are things we know are and issue could say now

DanB: Send official response if there are things we run into in the subgroups that relate to html5

<paolo_> Timeline:

<paolo_> 26 May - No meeting

<paolo_> 2 Jun - markup binding

<paolo_> 9 Jun - plan for both

<paolo_> 16 Jun - topics

<paolo_> 23 Jun - No VBWG f2f

<paolo_> 30 Jun - WebAPI

<paolo_> 7 Jul - protocol

<paolo_> 14 Jul -

<paolo_> 21 Jul - (IETF 81)

<paolo_> 28 Jul -

<paolo_> 4 Aug -

<paolo_> 11 Aug - (STK NY)

<paolo_> 18 Aug -

<paolo_> 25 Aug -

<paolo_> End of XG

MichaelJ: Is goal to determine what can accomplish by end of time, or what meets requirements?

DanB: Date, features, time
... Plan to have meetings and discuss topics

Milan: additional to topics for discussion, preloading resources

<Raj> Isn't preloading resources, part of the messaging between WebApp and SpeechService

DanD: Any expectation of interoperable example

MichaelB: More at level of implementation reports

Robert: Will go and develop proposals

MichaelB: Wont have interoperability by end of XG

<Raj> Agree with Robert on developing interop and implementation proposals

Markup binding

MichaelB: Useful things, do default actions
... Several benefits to connecting to markup, most things could be done with javascript directly, but useful to support common interaction patterns with html elements

Olli: How to bind to all the different input element types

Satish: Lots of comments on this

Robert: Text easier, less on how to use checkboxes, drop downs etc
... input patterns are also an issue, patterns and types are not necessarily compatible
... Can we reliably generate grammars

Bjorn: Security is the only reason, patterns etc, just so can click and start speech input

DaveB: Action based permission
... Kickback about info bubbles, more and more apis, more and more info bars
... Does not scale

Robert: Up to UA to decide how to enforce security, may involve popping up

Bjorn: translate on google.com, want to just click and then gather speech

Robert: click on it, onclick, browser policy checks whether this is permitted

Bjorn: Click jacking is a problem

DaveB: UI burned in, so will have some indication

Bjorn: Need action and UI in order to avoid system starting to listen

when you are away from the computer

Satish: Want to avoid confirming the confirmation

Robert: web page designer wants to change the UI

MichaelB: But if allowing the javascript could get past this anyway

Bjorn: have additional security on javascript api
... even if have info bar for js api, avoid for markup case

Olli: how does the button generated by the html markup matter
... can call click on an input element, special case for input type file

DaveB: You cant
... not in webkit

Olli: Mozilla has for backwards compatibility

Robert: still havent uploaded a file, not a security problem

DanD: Another argument for markup is to be less dependent on the browser chrome,
... add this to discussion of markup?

Bjorn: click on file upload,

Bjorn; didnt work in chrome

Robert: why not design a better policy experience around this

Bjorn: has not been possible
... different browsers could handle this differently

DaveB: Automatic filling text box is interesting

Robert: scripting api only work in installable extensions
... does not sound good

Bjorn: need chrome team to be happy with the security model or cant add it

DaveB: straw poll, ie mozilla, how to deal with this, ok with pop ups, infobars?

All: bizarre discussion about cats and mice

<silence>

Robert: general warning when go to page

Bjorn: concern is dont want this for everyone who goes to translate.google.com
... want to go to page, and be able to use speech without pop up dialog
... chrome designers unhappy with info bars

DanB: same folks have security concerns

Olli: what about geolocation input element

Bjorn: geolocation has a pop up

Robert: want to see the alternative better proposal
... if go to site once and have to permit speech

Bjorn: want a way to never trigger infobar

Olli: ... why do you need the button?

Bjorn: to make sure is user

Olli: trusted events
... Button can be hidden, set size or opacity

Bjorn: click, now recording speech thing comes up

Olli: agree notification pop up is bad

Bjorn: at least speech requires action compared to geo

Robert: more of a policy working group issue

DaveB: should have design that facilitates multiple approaches

MichaelB: more interested in grammar derivations from element etc, less convinced by security
... concerned about interoperability, js will only work in sandbox

DaveB: more about annoyance in UI, scalability

Bjorn: want a way to not trigger the info bar
... fine if some browsers do security differently

DaveB: impressive to very easily speech enable a web page

MichaelB: How hard is it to define the markup binding?

Bjorn: tried different approaches, e.g. speech attribute, input type=speech etc

Satish: this is a new kind of interface, divided responses, limits adding to text areas

Robert: need input type is speech
... how about textareas

Bjorn; speech element that is just the button, speech input

MichaelB: cant send it a javascript click event

Satish: if js api, could bind to other types

MichaelB: use for attribute

Robert: label, like for

Bjorn; could also have a drawing area on touchscreen

DanB: Where are we on markup binding?

Bjorn: speech input element

Olli: concerned about click jacking

Bjorn: have different policy

Olli: don't want things different

Robert: if not avoiding putting rest in sandbox, not as concerned

DaveB: Strong use case for starting speech when first go to page

Bjorn: open app

Robert: dont want to have to push a button for each speech input

Debbie: app with several pages, enable speech, dont want to have buttons on different pages

Bjorn: assuming will apply

DaveB: word spotting

MichaelB: often want to control the button
... new element, that provides the onclick, maybe for linkage maybe not
... olli skeptical off that also

Bjorn: gmail, have it ask you, do you want it ask you, say yes, then have it read, no button click

Robert: video games in browsers, speech as side channel, a use case

DanD: accessibility issues

DanB: may not need full call, but need to check on sandboxing issues

Bjorn: dont have agreement on having element

DaveB; want low key

DaveB: want to have low key way to add speech

Robert: roll into the web api discussion
... why need element for tts?

DaveB: elegance of specialization of html media element

Robert: some commonalities, also parts are irrelevant
... semantics of different input formats, how to apply media element to ssml

Bjorn; if similar to audio, should behave like audio

Robert: Copied bits from media element that apply

DaveB: No superclass to base it on, javascript API

MarkS: fine with just having js api

Bjorn: controls

Robert; but why needed (shuttle controls)

MichaelB: play now, or render,

MarkS: pause is useful but could do button for that

Robert: in existing ms translate example

All: agreement on not having a tts element