HTML Speech XG TPAC 2010 Lyon minutes

Initial discussion is a review of the charter and goals of the group. This group is an incubator group and will not produce recommendation-track documents, but will produce proposals for recommendations.

Requirements

Requirement R29. "Web application may only listen in response to user action"

Bjorn: the point of this is security and privacy, don't want the browser snooping without the user's knowledge. There's been discussion of what sort of user action is needed.

Dan Druta: How long does the consent last for? Notification might be better, that let's the user know that speech is activated.

Dave Burke: There is precedent in file access, camera access, microphone access, for how to do this. We should think about how it fits into the web platform. Can't be too annoying to the user.

Michael: We still need to think about what the user action is. Is it browsing to a page, clicking a link? There are cases where there aren't many visual cues possible.

Satish: could we leave it up to the browser what the user action is? In a handsfree app, the user action might be quite different from that in a normal web browser.

Debbie: But we do want to keep random websites from recording your speech. I had assumed that this was clicking a button.

Dan B: is this a requirement on the web application or on the browser? The web app may not begin recording speech without explicit permission from browser, where browser is the proxy for the user.

Bjorn: The requirement is on the web app. It is not allowed to start listening on its own.

Michael: If my home page is a voice search page, I don't want to get asked for permission each time I go there.

Michael: We may have agreement on the following: The user agent may not allow the application to record without user consent. We may want to say more than this, there may be requirements on how the user agent gets user consent, but we don't agree on this yet.

Dave: file access is a good example. A web page cannot open a file unless user clicks on button to indicate he wants file access. Don't need explicit permission beyond this.

Robert: In a medical records application, you will want voice control over everything. The app is built for you to talk to it.

Bjorn: So the question is whether you have to give consent each time, or is once enough? How long does consent persist. 3 levels: Have to click for every utterance, have to click once each time you load the page, or click once and grant access for ever. Another question is whether there is a button in the page, or one in the browser chrome.

Robert: Is this different from all the other things we have to set policy on it browsers?

Michael: Speech is a hybrid: it's sort of a resource like a cookie, but it's also a form of input.

Debbie: Consider case where you give a page permission to listen to me, you don't want it to listen to you when you're talking to your friend.

Bjorn: Only page that's in focus should be able to record. A background tab shouldn't be able to record. Could have a requirement that user sees recognized input before it is submitted to page, but that could get clumsy.

Dan Druta: We should have another use case covering initiation of speech including user granting permission.

Dan Burnett: To add another requirement, have to present specific use cases to go with it.

Requirement 27. "Grammars, TTS, media composition, and recognition results should all use standard formats."

Dan Burnett: Checking that "No one believes that we need more than what SSML provides for audio."

Milan: Maybe we should say "use standards when available" becuase there is no standard for statistical LMs, and you don't want to be banned from using them.

Dan B: yes we may have to add a "where available" clause in 27a and 27b.
We may want to say: implementations _must_ support SRGS and SISR and the API must not prohibit the use of non-standard formats.

Dan B: 27a: "implementations must support the XML format of SRGS and must support SISR"
27b: "implementations must support SSML"

Session 2 (raw minutes by Paolo Baggia)

Google API demo

Source code explanation

element input with attribute and activation of translate

Satish: There is a TTS API
... which returns audio

Rahul: The microphone picture is related to special attributes?

Bjorn: API add attribute to input
... speech boolean, continue boolean, grammar URL, maxresults for nbest, nospeechtimeout
... two events are added: onspeechchange and onspeecherrore;

Bjorn: two input event: stopSpeechInput, as user stop speaking, cancelSpeechInput, in case of an error;
... first for speech to talk

Bjorn: You get events and you can do all implementations. You don't have a dialog

Robert: There is an example at the end.

Olli: In which document there is the microphone ...

Bjorn: You have to click button to start speech.

Olli: If you have two iFrames with speech ..

Bjorn: What if you have multiple form in the same page.

MikeB: You may activate different speech.

Robert: you have a text box, why not define another button for start speech?

Satish: We had a discussion earlier, open to explore different, but this is clear
... same paradigm

Robert: You are speaking to a page, so speech is a different paradigm. you might be not using a text/input

Bjorn: Open to other options

Rahul: Why type is search?

Bjorn: It is an HTML5 new thing
... for instance type equal speech or text or email might have different SLM

Dan: Might be to specify an ontology of semantic types

Robert: You have pattern like dates, etc, to standardize them.

Bjorn: It might be grammars or SLM as a builtin. It makes sense

Satish: Doing text input you can constranit or to do free form recognition.

MikeB: you might have a pattern for builtin
... in HTML5 they specify types, sometimes very precise at word level, other are generic

Olli: About continue, how that work? Privacy issues

Satish: User can stop recording,

Bjorn: It is not solved issues

MikeB: about sproofing ...

Bjorn: this proposal doesn't solve all use cases, but many.

Debbie: Can process to do barge-in?

Bjorn: No TTS at all, it is a press to talk.

DaveB: Might add an event when pressing the button.

Milan: and barge-in false?

DaveB: You can disable the press button

Debbie: Is it only press to talk?

Bjorn: It is only the first time, then continue on will continue.
... If there is not only the button, but also a start event. It is an extension.
... One problem is that it requires a UI element.

Debbie: Stop TTS by typing, instead of speaking.

MikeB: Touch you can stop, type as well and also with speech

DaveB: annoying to have audio

Requirements - R27 (cnt'd)

Dan: Recognition result should be based upon a standard such as EMMA
... Are there concerns with EMMA?

DaveB: EMMA might brings a lot, inject XML in the DOM is an unnecessary code
... Prefer JSON

Robert: EMMA is extensible
... in the XML sense

MikeB: We have a desire to use EMMA and to have live easier to Web developers.
... Some might be JSON or HTTPrequest (?) or there might be multiple way to discuss it.

Bjorn: Seems more complicated to analyse

DaveB: It is an interchange format

Dan: EMMA is a format for representing input.

Bjorn: It should be easy to access from the web application.
... this should be easy other must be possible
... We won't agree on the full set, but on same.

DanB: Replace 27c and d
... access specific to recognizer

Olli: Argue to avoid to know a recognizer

DanB: reach consensus on 27 first

MikeB: I'll keep 27c should be EMMA and 27d an easy to process format

Satish: might not agree on a format today

Debbie: keep EMMA is fine

Bjorn: I'm fine to allow, but not to require it.

DanB: The recognition results must include information that will be available via EMMA

Bjorn: 27c It should be possible for the web application to get the recognition results in a standard format such as EMMA
... 27d It should be easy for the web appls to get access to the most common pieces of recognition results such as utterance, confidence, nbests

DanB: Even in VXML, you can get EMMA results, but we still like to have shadow variables for normal uses.

Bjorn: There should be a standard way to get that ...
... Whether complex stuff should be in EMMA

DanB: We have agreement on R27.
... R31. End Users not web application authors should be the ones to select speech recognition resouces
... Discussion might be open to who ...

Bjorn: The authors should not be forced to choose
... in email discussion for R31, it might sense that browser specifies the default one,
... on top we might or might not to allow to specify another one.

DanB: There are two different ways: (a) it might be possible for browser to specify, (b) the browser must specify a default

MikeB: R1, R15, R16, R22 are all correlated

Bjorn: The browser must have one, the author can give a hint or browser must implement

MikeB: Calling hint is implying ...

Bjorn: The browser gives the audio to the author, who does whatever he wants.

Robert: You won't be able to do good applications.

DanB: I prefer to start on the agreement part:
... We expect the browser have a recognizer to use

Bjorn: like 15

Paolo: This imply you can change if you want

Bjorn: that is anothe requirement
... It might be reworded as: R15 The browser must supply a speech service that web applications can use

DanB: Agreement reached

MikeB: So far there is agreement, but not sure for other

DanB: R16 agreement on wording?

Bjorn: what does it mean "to exclude"?

DanB: That the audio is accessibe to author and results are treated the same.

MikeB: One is to run another or to select among a list.

Olli: ??

Robert: You might not trust to send a piece of voice, application can do misuse, a third is there are big data implied

Bjorn: example of CSS: default, specify by author, and by user

MikeB: My experience for large grammars, you might have different grammar for different accents and tuning recognizer

DanB: Some authors don't care of the performance, but other care very much of the experience they present.
... They need to customize the customer experience and regardless the browser in use.

Dave: Most web developers don't care.

Bjorn: Example of Google search and bing.

DanB: Is it he user agent or the appl developer?

Satish: Don't want to have different experience on different applications

JimB: The user can state a different one and also make it mandatory, then the author can also modify it, ...

Bjorn: It is a variation of CSS

DanD: In current model you don't need a plug-in, if becomes more obvious the power for more accurate ...
... why local component talks to remote one

Bjorn: SRGS is small, but not on SLM large and proprietary

<burn> what is currently being said (but need to confirm that we understand and/or agree):

<burn> 1. Browser must provide default 2. Web apps should be able to request speech service different from default 3. User agent (browser) can refuse to use requested speech service 4. If browser refuses, it must inform the web app 5. If browser uses speech services other than the default one, it must inform the user which one(s) it is using..

Bjorn: Amazon example

Debbie: asking clarification on refusal by the browser the user selection

DaveB: You can in Chrome specify the search, similar the recognition client might test to interoperable
... It doesn't require a protocol

DanB: Go back. Which are the replacement of for?
... Current wording replaces: R15 (with new wording already captured);

Bjorn: objects R1 is inconsistent with (3)
... the browser can refuse to connect on network, etc

MikeB: 1-5 are agreed, but R1 is not covered competely.

DanB: Try to focus on substitutions:
... R16 covered by 1-5
... R31 covered
... R22 covered, but take content for 1-5
... Some of the content of the removed R should be captured for new 1-5
... R1 should be re-phrased
... R15 (new) is replacement of old R15
... Attempt to address R1, at least agree on changes to it.

Bjorn: If the web appls specify speech services, it should be possible to specify parameters.

MikeB: But you can specify parameters also in the default one speech services.

Bjorn: other difference is "network recognizer"

Robert: We discussed on local / remote

MikeB: There were use cases to be capture, there are things that might happen also in the network case

Dave: concerns on the need of the requirement
... seems to be redundant

Bjorn: New reqs: Speech services that can be specified by web appls must include network speech services
... Remove R1 because difference is capture by the last two new requirements

Thursday meeting at 8:30 in Level 2 - Saint Clair!

Thursday, 4 November 2010

Session 1 (raw minutes by Marc Schröder)

Requirements (continued)

Picking up discussion from email: Same-domain and cross-domain issues: sending data across domains is a privacy concern, also security or DOS attack issue. This is an area that will require more discussion.

R3: Ability to bind results to specific input fields.

Raj: If everything can be done in scripts, what do we need these requirements for?

Bjorn: Requirements about the output of this group; impossible to list everything one could do by speech, such as changing the colour of the page.

Dan: Maybe it's more about making sure that it's easy to do so, rather than "it's possible to do".

MichaelB: Agree. The list was drawn up for things that should be easy and direct.

Bjorn: We could distinguish "easy" and "possible". Would filling one field be "easy", and filling multiple fields be "possible"?

Robert: The demo you (Bjorn) showed the other day showed it was easy to fill multiple fields.

Robert: Is this related to mixed initiative (Use case 5)?
... maybe that's a different requirement.

MichaelB: Compare to keyboard input. Easy to fill one input field by typing on the keyboard. It could be done, using scripts, to type and thereby fill several input fields. Maybe speech could work in the same way.

Bjorn: I am fine with any wording here that doesn't require us to do what VoiceXML does, multiple binding to slots.

Olli: Don't want ASR results to be bound to any input field. In my experience, X+V was terrible because it required ASR results to be bound to input fields. It's about the API. Think of events coming from the server; scripts can do something with that.

Dan: Would it be ok for you to have an "automatic" binding, if it is not the only one?

Dan: I think R3 needs to be split into two different requirements, representing different use cases: Single field vs. multiple fields. Let's talk about a single input field first.

Bjorn: (1) It should be easy to assign recognition rsults to a single input field.
(2) It should not be required to fill an input field every time there is a recognition result.
(3) It should be possible to use recognition results to multiple input fields.

R33: User agents need a way to enable end users to grant permission to an application to listen to them.

On Tuesday, we agreed that it is up to the UA, and then the user just chooses the UA they want.

Dan: So maybe we don't need this requirement at all, it is already covered by R29.

Michael: As I remember it, we discussed R29 + R33, and decided it was up to the UA policy somehow.

Dan: We did change the wording of R29 to "capture". And R33 is about ways in which the user can give consent.

Wording proposed by Bjorn: "Web applications must not capture audio without the user's consent."

Separate issue is the *mechanism* how the user gives consent. There are different possible ways how the user can give that consent, e.g. by clicking ok every time, or by downloading the UA that does capture, or a range of options in between.

Bjorn: I don't want google.com to pop up a window saying "do you want to allow this page to capture your speech", because 99.9% of users will not want to, and I don't want to inconvience them.

Dan: Example of "file upload" button is good: It cannot be initiated by the web app.

Bjorn: Yes, but then once the user clicks the button, there is no window popping up asking "do you want to allow this?".

Michael: I may install a firefox plugin that warns me every time a character I type is sent to a server. I wouldn't want the spec to forbid that.

Jeremy: This seems specific to the UA to me. Let different UAs to experiment with different options.

Michael: Agree: The requirement is: It must be possible for the UA to prompt the user whether to allow this.

Bjorn: And, it must be possible to start speech input in response to user action such as clicking the microphone button.

Marc: Think of "informed consent" as in psychological experiments, as needed to get an experiment design through the ethics committee. If you click the button "send me 1000$", that should not trigger speech input. On the other hand, clicking a microphone button could be considered informed consent.

Dan+Bjorn: Notification and cancelling speech input must be possible, but that are different requirements.

Dan: "It must be possible for the user to revoke consent at any time, including while capturing".

Bjorn proposes wording: "While capture is happening, there must be an obvious way for the user to abort the capture and recognition process."

Abort should be defined in the spec, explanation such as "as soon as you can, stop capturing, stop processing for recognition, and stop processing any recognition results". Two aspects to this: privacy; and the app gets as little as possible.

Michael: Separate from this: "It must be possible for the user to revoke consent" (in general, not for the specific capture event).

Bjorn: And web apps needs a way to inform that UA that they support speech input.

Bjorn: "The spec should not unnecessarily restrict the UA's choice in privacy policy."

Raj: That's ok, as long as the word "unnecessarily" does not end up in the spec.

R4: Web application must be notified when recognition occurs

Relevant events (order not yet specified): audio capture starts; speech starts; end of speech; end of capture; recognition results available

Session 2 (raw minutes by Raj Tumuluri)

Planning

Jeremy: Decisions and rationale behind them need to be minuted for the benefit of the people who're not physically present at F2F..

Dan: Expected to read minutes carefully and seek corrections as necessary..
Detailed summary will serve better.

Raj: Email responses back-and-forth also are as unreadable..after few rejoinders...
So, at some logical point ( since we don't know when the end-of-discussion really is)..
a detailed summary of email-discussions also needed..

DanD: Seeking lots of proposals will imply spending lot of time normalizing them

RajT: Implementations are NOT part of the XG deliverables..so we don't need to push for that..

DanB: Raj makes a good point...implementations are not required by August 31, 2011..

DanB & RobertB: Conf. calls will be needed, following email discussions followed by F2F as needed to ensure progress..

HTML Speech XG meeting composite minutes

Tuesday, 2 November 2010

Session 1 (raw minutes by Jim Barnett)

Welcome and Introductions