HTML Speech Incubator Group Teleconference -- 28 Apr 2011

<burn> trackbot, start telcon

<trackbot> Date: 28 April 2011

<burn> Scribe: Robert Brown

<burn> ScribeNick: Robert

<burn> Agenda: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011Apr/0059.html

F2F logistics

Bjorn: nothing new logistically

Burn: will send revised schedule

updated report draft

<burn> final report draft: http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110426.html

burn: no new comments

new design decisions

Bjorn: previously only looked at intersection of proposals, is there anything that's in two proposals but not the third. e.g. continuous recognition

Milan: any requirement that we support this?

burn: will add continuous recognition to the list of topics to discuss

Bjorn: only removed it from Google proposal because difficult to do , and may want to do it in a later version

Michael: recapped two scenarios stated by Bjorn: 1) continuous speech; 2) open mic

Bjorn: proposed that we all agree this is a requirement

Milan: we were vague about what the interim events requirement meant, whether it included results

<bringert> burn: satish is trying to join, but zakim says the conference code isn't valid

Burn: [after discussion] proposes Michael adds this as a new requirement (or requirements) to the report

Michael: sure, but will also check to see whether we just need to clarify an existing requirement

Bjorn: this is also a design topic

<satish> burn: will do

Bjorn: Robert is there anything else in the Microsoft proposal that should be considered as a design decision?

Robert: nothing apparent, will review again in coming week

Bjorn: should we start work on a joint proposal then?

Burn: proposes that we now go to the list of issues to discuss and discuss them

Bjorn: more items for discussion from Microsoft proposal
... MS proposal supports multiple grammars, but Google & Mozilla only supports one

Olli: Mozilla proposal allows multiple parallel recognitions, each with its own grammar

MichaelJohnston: can't reference an SLM from SRGS, so multiple grammars are required

Bjorn: proposes topic: Should we support multiple simultaneous grammars?
... proposes topic: which timeout parameters should we have?

<smaug_> yeah, Mozilla proposal should have some timouts

<smaug_> timeouts

Bjorn: emulating speech input is a requirement, but it's only present in the Microsoft proposal

Michael: proposes topic: some way for the application to provide feedback information to the recognizer

Bjorn: does anybody disagree that this is a requirement we agree on?

Burn: proposes requirement: "it must be possible for the application author to provide feedback on the recognition result"

Debbie: need to discuss the result format

Michael: seems like general agreement on EMMA, with notion of other formats available

Olli: EMMA as a DOM document? Or as a JSON object?

MichaelJohnston: multimodal working group has been discussing JSON representations of EMMA
... there are some issues, such as losing element/attribute distinction
... straight translation to JSON is a little ugly

Michael: existing proposals include simple representations as alternatives to EMMA

MichaelJohnston: For more nuanced things, let's not reinvent solutions to the problems EMMA already solves

Milan: would rather not have EMMA mean XML, since that implies the app needs a parser

Debbie: sounds like we agree on EMMA, but need to discuss how its represented, simplified formats, etc

Milan: a good idea to agree that an EMMA result available through a DOM object is a baseline agreement

Bjorn: it's okay to provide the EMMA DOM, but we should also have the simple access mechanism that all three proposals have

Burn: would rather have XML or JSON, but not the DOM

Michael: if you have XML, you can feed it into the DOM

Burn: it's a minor objection, if everybody else agrees on the DOM, I'm okay with that

Bjorn: maybe just provide both

MichaelJohnston: EMMA will also help with more sophisticated multimodal apps, for example using ink. The DOM will be more convenient to work with.

Burn: proposed agreement: "both DOM and XML text representations of EMMA must be provided"
... haven't necessarily agreed that that is all

Bjorn: we already appear to agree, based on proposals: "recognition results must also be available in the javascript objects where the result is a list of recognition result items containing utterance, confidence and interpretation."

Michael: may need to be tweaked to accommodate continuous recognition

Burn: add "at least" to Bjorn's proposed requirement
... added a statement "note that this will need to be adjusted based on any decision regarding support for continuous recognition"

Milan: would like to add a discussion topic around generic parameters to the recognition engine

Burn: related to existing topic on the list, but will add

Milan: also need to agree on standard parameters, such as speed-vs-accuracy

Burn: will generalize the timeouts discussion to include other parameters

MichaelJohnston: which parameters should be expressed in the javascript API, and what can go in the URI? What sorts of conflicts could occur?

Bjorn: URI parameters are engine specific

MichaelJohnston: for example, if we agreed that the way standard parameters are communicated is via the URI, they could come from the URI, or from the Javascript

Michael: need to discuss the API/protocol to the speech engine, and how standard parameters are conveyed

Bjorn: we need to discuss the protocol, it's not in the list

Burn: will add it to the list

Milan: are the grammars referred to by HTTP URI?

Burn: existing requirement says "uri" which was intended to represent URLs and URNs

Milan: would like to mandate that HTTP was for sure supported. there are lots of others that may work.

Robert: should we have a standard set of built-in grammars/topics?

Bjorn: in the Google proposal we had "builtin:" URIs

Burn: "a standard set of common tasks/grammars should be supported. details TBD"
... need a discussion topic about what these are

Robert: what about inline grammars?

Bjorn: data URIs would work for that, and perhaps we should agree about that

Charles: would like to see inline grammars remain on the table

Burn: will add a discussion about inline grammars
... we all agree on the functionality that inline grammars would give

MichaelJohnston: one target user is "mom & pop developers" who would provide simple grammars

Burn: discussion topic: "what is the mechanism for authors to directly include grammars within their HTML document? Is this inline XML, data URI or something else?"

Robert: use case: given that HTML5 supports local storage, the data from which a grammar is constructed may only be located on the local device

Bjorn: proposes that we mandate data URIs, just for consistency with the rest of HTML

Burn: no objections, so will record as an agreement

Michael: need to discuss the ability to do re-recognition

Burn: related to the topic of recognition from a file

Bjorn: both are fine discussion topics

Burn: [discussion about whether there's anything to discuss around endpointing], already implied in existing discussion topic

Bjorn: context block?

Burn: discussion topic: "do we need a recognition context block capability?" and if we end up deciding yes, we'll discuss the mechanism

Milan: how do we specify a default recognizer?

Bjorn: don't specify it at all
... since it's the default

Michael: need some canonical string to specify user agent default, so we could switch back to it (could be empty string)
... Whereas how we specify a local one may be similar to the way to specify the remote engine

Bjorn: for local engines do we need to specify the engine or the criteria?

Burn: SSML does it this way

Bjorn: is there a use case for specifying criteria?

Burn: in Tropo API, language specification can specify a specific engine
... this is a scoping issue. e.g. in SSML a voice is used in the scope of the enclosing element
... in HTML could say that the scope is the input field, or the entire form

Bjorn: in all the proposals, scoping is to a javascript object
... are there any other criteria for local recognizers than speed-vs-accuracy?

Charles: different microphones will have different profiles

Raj: how do we discover characteristics of installed engines

Michael: selection = discovery?

Burn: in SSML, some people wanted discovery

Bjorn: use cases?

Michael: selection of existing acoustic and language models

Robert: there's a blurry line between what a recognizer is, and what a parameter is

Michael: topic: "how to specify default recognition"
... topic: "how to specify local recognizers"
... topic: "do we need to specify engines by capability?"

Raj: or "how do we specify the parameters to the local recognizer?"

Burn: want to back up to "what is a recognizer, and what parameters does it need?"
... call something a recognizer, and call other things related to that a recognizer

Bjorn: the API probably doesn't need to specify a recognizer. speech and parameters go somewhere and results come back

Burn: what is the boundary between selecting a recognizer and selecting the parameters of a recognizer

Milan: we need to discuss audio streaming

Burn: topic: "do we support audio streaming and how?"

<Milan> Milan: Let's discuss audio streaming

- DRAFT -

HTML Speech Incubator Group Teleconference

28 Apr 2011

Attendees

Contents

F2F logistics

updated report draft

new design decisions

Summary of Action Items

Scribe.perl diagnostic output