HTML Speech Incubator Group Teleconference -- 05 May 2011

<burn> trackbot, start telcon

<trackbot> Date: 05 May 2011

<burn> Scribe: Charles_Hemphill

<burn> ScribeNick: Charles

<burn> Agenda: http://lists.w3.org/Archives/Public/public-xg-htmlspeech/2011May/0001.html

F2F Logistics: Any updates on attendance, hotel bookings, and questions or details from Bjorn.

Bjorn: no updates on F2F

Burn: will send out schedule in the next few days.

Review new text in updated "Final Report" document [1] to ensure it matches what people think we agreed upon in our last teleconference.

<burn> document is http://www.w3.org/2005/Incubator/htmlspeech/live/NOTE-htmlspeech-20110503.html

Burn: comments on the document - added general design decission - 17 new discussion bullets.

Determine if we already have other agreed-upon design decisions.

Bjorn: discussion topic about mic capture access. Propose design agreement - should be possible to start speech reco without selecting mic - just pick default.

Burn: default vs. what you can do - two things.

Bjorn: There should be a default mic. Perhaps the only option.

Born: saying explicit determination of mic should not be required.

Bjorn: Should not need to enumerate mics before starting.

Robert: Think we let you mic other mics.
... that's a reasonable interpretation.
... By default, mic provided by user agent default device.

Bjorn: Need to discuss second sentence later - picking a mic.
... should be able to start reco without selecting mic - confirming agreement.

Robert: Assuming that the default will be used for mic.

Burn: notion of default mic.

Robert: Issue of user interface. Shows speaker activity. Is there a default user interface? Can the application override.

Bjorn: Have that requirement for default user interface.

Robert: RE: default user interface - shows it's listening and lets user cancel.

Olli: What is the default user interface. Something in the browser.

Bjorn: Should only user browser user interface. No Web app user interface.

Olli: More security or privacy concerns otherwise.

DanD: worried about limitations of only in the browser.

Robert: Don't think that's true.
... Default user interface. Can it be overridden. Where does it live. 3 discussions.
... Google right in the Web page where the user clicks. Up to user agent to decide how to render.

Bjorn: Have a default interface now.

MichaelJ: Fine for default. Want APIs to allow someone to build their own. Different user experience. Allow this. Useful to have default. But now always appropriate.

Bjorn: Have agreement on default. Have disagreeemnt on your own due to security reasons, etc.

MichaelJ: Very limiting otherwise.

Bjorn: Should start speech by custom ways including JavaScript. Can hide that you're capturing audio if custom UI.

Robert: Compromise - default UI parameterized? Provide feedback to the user. Style sheet. Look at customizations.

MichaelB: Up to user agent to allow customization. Part of permissions API.

Burn: Should be a default user interface.
... Should there be customization and what level.

DanD: Not all use cases in browsers. Different security concerns if rendering engine used. Should not be forced by HTML spec to have a particular UI.
... Don't want to prevent annimated character app that is listening to you.

Bjorn: Talk about browser case. Need to be clear tha the browser is capturing the audio.

Dand: COuld be a matter of security settings.

Bjorn: Don't say that we disallow customization, but don't require this.

DanD: End up with fragmentation. WOn't work cross browser.

Bjorn: Allow for non-browser apps.
... Note for future discussion.
... Allow customization of the user interface that show audio capture is happening.

Burn: Have a discussion topic of the level of customization allowed.

Bjorn: SHould have customization for the UI for starting recognition. Have discussion topic: customize UI for showing that audio is being captured.

MichaelJ: Waveform, traffic lights?

Bjorn: Can app customize what the app looks like?

MichaelJ: Can customize one that show up in the UI.
... Multimodal tap and talk API. Want creativity. Activate recogntition button. DOn't want to rule out certain kinds of APIs. Dont' want built-in browser feedback to interfere.

Burn: come back to this discussion later.

Begin discussing issues listed in the Appendix.

Burn: Have time to discuss a serious topic. Can work out serious issues at FTF.
... Determine which topics have more meat. Start with audio.
... 3 audio related topics. How to get audio capture access. Manditory audio codecs. Audio streaming support and how.

Bjorn: 1st unrelated to 2nd two. 1st is API. 2nd two how audio is sent form browser to implementation.

Burn: How to get audio mic capture access.

Bjorn: MS proposal has mic selection. What are use cases?

<burn> "audio mic capture" is "audio/mic/capture"

Robert: Browser going to have mic API anyway. Avoid 2 mic APIs. 1 in speech and anothe unrelated (explicit). Want speech API to integrate with browser API.
... Many devices will have mult. mics. Improtant to select the one you want. Maybe app or user through prefences.
... May want to configure mic settings. Use for things other than speech. E.g. video app that does speech reco.
... MS API allows this. Can get audio strem to reco. Look at multimodal scenarios. Need for integrated API there. Speech API should integrate.

Bjorn: Can buy most of that.
... If there is one there, should be able to use for speech. But no such standard API yet.

Robert: Pushing capture API heavily. With michael. IE team thinks this is a sound approach.

Burn: Agree ability to select diff. audio sources.

Robert: Not quite it. If browser has mic API - we should be able to use it.

Bjorn: Agree. But if not one, don't want to come up with one ourself.

Olli: agree.

Bjorn: If HTML standard has one, we should be able to use it.

Robert: Fine with HTML rather than browser.

Burn: Meta decision. Use HTML if exists, but not create one.

Robert: Have requirements for such an API?
... Latest draft doesn't have notion of stream of endpointing. And we care deaply about these for mic API.

Bjorn: Why does mic API need endpointing?

Robert: Can be a long way between mic and endpointer.

<burn> should "stream of endpointing" be "stream or endpointing"?

Bjorn: Requirement that endpointing be available for things other than speech.

Michael: Hopefully, have agreement - will work with people designing the API and express requirements.

Bjorn: Seems fair.

Olli: Capture API in HTML draft or draft working group.

Robert: Mean the one in the DAP working group.

Bjorn: Think we should work with HTML.

Burn: 2nd one tricky. Wrote we will capture an express requirement on a capture API to relavent groups.

Bjorn: Seems reasonable. Avoid "capture".

Burn: requirements on audio capture APIs.
... requirements on all audio capture APIs.

Bjorn: seems fine.

<mbodell> Olli, is there a capture API in the w3c HTML draft? I don't see it at http://dev.w3.org/html5/spec/Overview.html

<smaug> mbodell: I don't read that version of HTML spec ;)

Bjorn: If no HTML audio capture API. Propose that we proceed even without a mic API.

<smaug> mbodell: http://www.whatwg.org/specs/web-apps/current-work/multipage/dnd.html#video-conferencing-and-peer-to-peer-communication is an early draft

<burn> (now Robert is speaking)

Robert: Concern - browsers will need to implement privacy and security policies. Weird to have for speech alone, but not audio capture in general. May be messy.

Bjorn: Forge ahead, and consider audio capture in general.

Burn: Agreement that's important.

Bjorn: Having control over audio capture does not have to be in the first proposal.

Burn: Is that the concencus?

Bjorn: OK to have speech API if there is not an audio capture API.

Robert: Not create one, and shouldn't be blocked from moving forward.

Burn: Not create one and not block while waiting for one.

Michael: May design suboptimal if no audio capture API and may not fit well once it's there.
... Premature to jump to say we can make total progress without that.

DanD: Goal for group to submit the requirements to the other working groups. Accelarating the cature API for audio may be one of the recommendations. AT&T member of DAP. Recognize needs.

Bjorn: Agree we should not block this progress while waiting.

DanD: May create fragmentation.
... Unless abstracted completely to "get mic".

Bjorn: Agreed that we should start reco without specifying mic.

DanD: Concerned that we should avoid fragmentation.

Burn: Good to get agreement.

Dand: API for capture, if we are able to capture the audio without web developer going through coding, then we are fine.
... If anything specific in the web application to retrieve the audio handle, then we're looking for if-then-else statements.

Bjorn: We would like to do the former.

Burn: What is meant by "start of speech", "end of speech", and endpointing in general? How do transmission delays affect the definitions and what we want in terms of APIs?

Robert: Divide into smaller topics. Distributed env., with speech services remote. 2 notions of endpoiting: by reco or cheap on client (responsiveness and reduced network IO). Look at these 2 as seperate.

Bjorn: Throw out proposal. Require client-side simple endpointer?

Robert: Has my vote.

Burn: No endpointer on my computer.

Bjorn: Browser could do simple energy-based end pointing.

Robert: Lots of optinos. GSM encoder has endpointer. Can have local reco and use for endpointer.

Burn: APi needs to assume client as well as server-side endpointer. client could be null op?

Bjorn: Stronger: has to be something in the client that does tell start and end of speech. even if not good.

Michael: Can see recommending. Don't know how web author can know. requirement is low latency. doesn't matter after that.

Bjorn: Agree with that. But if app points to specific recognizer, can interact.

Burn: Why concerned. Reco can get finicky about input based on training. Endpointing is mostly done in advance. Be careful about requiring local endpointing. If bad, can affect reco.

Bjorn: Avoid bad endpointers.
... Low latency speech dectection should always be available.

MichaelJ: But not forced to use it. FedEx example: some query - using endpointing from reco - want them to be able to use the standard. Client endpointing could cause errors.

Bjorn: Have some parameters. Make it easier for the app. Think you're speaking.

Burn: Ongoing recognition case - won't use loca endpointer.
... plenty of open mic apps - listen for keywords.

Bjorn: Should be one, but should be possible for app to turn off.

Robertt: probably want app to turn it on if it needs it.

Michael: Set a parameter and get it that way.

Bjorn: Hello world app.

Charles: Level for feedback - good to be local.

Burn: Low latency endpoint detector shoudl be available.

Bjorn: Don't have agreementn if on or off by default.

MichaelJ: Talking about detection of end of speech or start too?

Burn: may be big difference.
... Want low latency to turn on speech to reco - but don't want it to stop.

Bjorn: we do the opposite.
... Start streaming right away, server endpoints, but need to stop streaming at some point.

Robert: very scenario dependent. Need start stop speech event. Start when click of button, end matters a lot. Need to have optinos available.

Burn: Forwarding audio to expensive recognizers. Want high accuracy on end pointing. Don't want to send audio unless we have to due to expense.

Bjorn: Cutting off audio vs. endpointer. Can not listen for the event. Control if endpointing cuts off audio.

MichaelJ: Need to control when start sending audio to recognizer.

Burn: Start speech adn reco can be different.

MichaelJ: If reco on for a long time, may want to do something do delay until there is certainty of speech.

Bjorn: Agree tha there is low latency endpointer is available. Should be possible for app to decide if audio is started of stopped on endpointer.

Burn: Audio start /stop separate from speech start/stop. Seperatly controllable.
... Detector detects both start/end of speech and fires an event in each case.

Bjorn: Seperate issue of cutting off audio.

Burn: Audio to the reco process as opposed to TTS.
... Audio start and stop to reco server (resource)...

Bjorn: Control over which audio is used for speech recognition.
... which part of the captured audio.

DanD: Make sure we carefully agree that we are not forcing the application into using the predefined environment engine of the browser and still allow developer which engine to use.

DadD: have a flag. If use optimzied endpointing in application of not.

Bjorn: Seperate from how you choose the engine.

MichaelJ: Related - if turned on, give some sort of event for local prediction of begin/end of speech, is that the resolution we want? If level dectector, can also get level?

Bjorn: Ahould be a more precise way to get actual events from recognizer. Level part of mic API?

MichaelJ: Could be raw energy detector, limited reco listing for "silence", etc. for the local part. The browser, client side, can have best that it can. Not saying anything about how it's done.

Burn: May be a difference when there are multiple endpointers. (1) low latency - prefilter to decide if goes to reco, (2) high quality in engine.
... Would want recognizers endpoint detector. But preprocess one is the low latency one.

Bjorn: 2 event : 1 probably vs. actual start/end of speech.

MichaelJ: Talking now vs. not. More going on underneath. Get complicated to expose underneath if varies by implementation. Energy level might drive aspects of the API.

Burn: Why want distinction? Mic open is one option. ANother is that engine is paying attention. ANother is that engine found something importatnt.
... Might decide that it's not hearing anything.

Bjorn: Started capture, think starting, actually starting. 1st 2 go in the UI. Good to have last for timing.

Burn: In VXML2, have hot word detection. Concluded it doesn't act as if speech is detected untell something happens. Acts as if nothing happend if nothing reco'd. May collapse 2nd and 3rd states.

Bjorn: Thought we had agreement earlier.

Burn: Agreed we had some sort of start and end. Knew we needed to discuss it.

Bjorn: 3.3.3. - onspeechstart/end/error. Need to add more to this list.
... propose adding onaudiostart onaudioend, and split onspeechstart to detected vs. actual (reco).

MichaelJ: energy vs. reco? split

Bjorn: Could be confusing.

MichaelB: onsoundstart?

Bjorn: sounds like a good name.

MichaelJ: Issues of calibration? Sensitivity parameters? Used on mobile phones or elsewhere. Might need calibration to work well.

???: Sensitivity and timeout parameters.

Burn: Whole topic to discuss parameters.

Bjorn: Discuss parameters in context.
... Agree on adding these events?

Burn: We will add onaudiostart/end ... Dan will cut and paste here?

Bjorn: onsoudstart/end shold be low latency. Also say somehting about order.

Burn: OK. audiostart, soundstart, speechstart, speechend, soundend, audioend

Bjron: Might not get soundstart or speechstart.

Burn: onsoundstart, require soundend.
... soundend optional?

Bjorn: not true.
... Can't have onspeechstart without the preceeding two.

Charles: Want end events with start events.

Burn: Can have ends all at the same time.

Bjorn: what if onerror?

Burn: Great topic.
... capture that as issue for discussion.
... what happens to audiosound and speech events in case of error.

Bjorn: Also sensitivity discussion point. And timeout parameters for ASR.

Burn: Meeting next week. Can have call after that. Meeting after that. 2 days of meeting.

- DRAFT -

HTML Speech Incubator Group Teleconference

05 May 2011

Attendees

Contents

F2F Logistics: Any updates on attendance, hotel bookings, and questions or details from Bjorn.

Review new text in updated "Final Report" document [1] to ensure it matches what people think we agreed upon in our last teleconference.

Determine if we already have other agreed-upon design decisions.

Begin discussing issues listed in the Appendix.

Summary of Action Items

Scribe.perl diagnostic output