Local logistics (member only)
Location: 4th floor
Belgrave House
76 Buckingham Palace Road
London SW1W 9TQ
Monday | Tuesday | |
---|---|---|
AM1 | Milan Young | Marc Schröder |
AM2 | Dan Druta | Paolo Baggia |
PM1 | Debbie Dahl | Satish Sampath |
PM2 | Robert Brown | Michael Johnston |
A "crucial decision" is one that could a have sigificant impact on the architecture and thus should be completed before starting work on the API itself. A strawman partitioning of the discussion topics is:
Crucial decisions |
|
---|---|
Crucial decisions partially discussed |
|
Non-crucial (but still important) decisions |
|
Dan: Review list of critical vs non-architecture changing decisions
Milan: Agreed to move recognition parameters to critical list
Bjorn: Summary from Dan's agreed design prinicples document
Dan: Web app author does not get to choose audio src
Debbie: If no audio capture API, then what?
Dan: We can still deliver audio, just that we don't want to design the general API
Dave: What about Bjorn's IDL?
Dan: Perhaps too soon given the number of dicussion topics in flux
Bjorn: Good way to make consensus concrete
DanD: Other components of archtiecture that were missing from the IDL
... for example user privacy
Bjorn: OK as long as the contents of the IDL are accurate
Dave: Time permitting would like to work on the IDL
Debbie: Might be a good break if we get numb
Dave: Let's do 30min tommorow for IDL
... just to get discussion started
Milan: <agreed>
Michael: What are engines
Milan: I interpreted engines as speech engines
Bjorn: For web app developers
need consitancy across browsers and engines depending upon
type
... Let's enumerate consistency
... eg results that can be expected
1) Consistancy between user agent + default engine
2) Across specified engine + user agent
3) Across specified engines
Robert: Another way to look at it is what developers are allowed to modfiy
Marc: What about consistancy between default engine and remote?
Milan: Agree that default and remote engines should use the same API.
Milan: <agreed>
Robert: Some things like confidence threshold and ovbiously not be consistent
Dave: In VoiceXML builtin gramars
were different
... because grammars were so small
... but large gramamrs are similar (like search or
dictation)
Dan: In favor of API for developers that doesn't know speech, but also useful when developers figgure out speech
Robert: Perhaps better way to rephrase is to enumerate what could be different
Milan: The types of things that will be consistent will be the same across the categories
Milan: The categories are:
... 1) Consist between different UAs using default engine
... 2) Consist between different UAs using web-app specified
engine
... 3) Consist between different UAs using different web-app
specified engine
... 4) Consist between default engine and specified engine
<bringert> restaurant: http://kazan-restaurant.com/
Milan: How are we going to define the parameterization?
mbodell: It's undefined
bringert: ... Will be in the API
mbodell: eventually will be standardized
Milan: we're trading off functionality for consistency
bringert: not necessary. It will be supported
burn: I thought we want to specify a feature without specifying what engine
Robert: Number 2 is hypothetical.
bringert: Cases are : 1. Use default 2. specify feature but use default 3. Use specific engine
Raj: You should be able to specify URI and let everything default
Robert: you have to specify language
bringert: language should be a parameter in the API
mbodell: Why not use the web app language?
burn: Any language you specify is specific to an instance of an engine (in Voxeo)
satish: so it's an implementation detail
Robert: Engine and language are
tied
... There are multi language services
burn: engine is a service and
should switches behind the scenes for the language
... We really only have two choices: default or specified
Robert: When use default if you
don't want the control of the service and specify when web
developer wants control
... A dispatcher is required anyway
... Service is what will expose the engine
bringert: Cross site might be an issue. Allow access only from the developer's web site
<smaug> http://www.w3.org/TR/cors/
mbodell: the default engine might be a proxy itself
bringert: We don't have agreement
on the language
... Privacy is an issue in itself
burn: We should not have a topic on privacy
<raj> Correction to my comment: Developer should be able to specify characteristics of the engines/services such as language, any app-specific preloads etc...in addition to having the ability to specify the URI of the engine if known apriori
Milan: if we have one that specifies a service and another one with a different service with a different protocol are they supposed to be consistent?
bringert: No.
mbodell: it is possible you can code it so it is consistent
Robert: Max number of grammars to support
Bringert: the list is fine
<ddahl> performance can include grammar size, number of grammars supported, accuracy, latency
<ddahl> performance might also include maximum size of nbest max
burn: are we talking about the
list of inconsistencies or the list of guarantied
interoperability
... Just want to make sure we're talking about the same
thing
bringert: the web developer writes and app and should know what happens in the user agent
mbodell: We're really trying to capture expected inconsistencies
satish: The same language used across services is not saying is inter operable
burn: I would not be surprised if you will want on purpose to have TTS voices to have accent
mbodell: Size and complexity should be added to the list
Milan: that's more a performance thing
Robert: Sensitivity is something that we don't have control
bringert: exact semantic of parameters including sensitivity and thresh hold are expected to be on that list
MIchael: however it happens in the browser is not expected that the performance will be the same
MIlan: I don't agree if it's not inconsistent is assumed to be consistent
bringert: there should be an api
for fallback and that should be consistent
... Time it takes to return and latency should be on the
performance list
... We have not specified any optional events
mbodell: we should not specify the events yet
Milan: synthesis results should be in the performance list
Michael: One company builds their web site using one provider but will be depended on the user agent and not guarantied that input format is the same
bringert: the user agent might be able to use different codec and sent different format to the same engine on different platforms
logistics page http://www.w3.org/2005/Incubator/htmlspeech/group/2011/05/f2f_logistics.html#MondayDinner
burn: what is that?
bjorn: you want to dictate, continuous translation
robert: open mike scenarios, e.g. driving and issuing commands
bjorn: then you don't need streamning
milan: but you need to be listening all the time
satish: if there are disjoint utterances you could do that transparently
bjorn: in the command case there
are natural pauses
... there's "always listening" vs. "streaming recognition"
burn: results streamed back is
stronger than we actually need
... the results themselves are actually discrete, may be a
sequence, that is, incremental results
... audio you have to send at regular intervals
glenn: audio is streamed, but results are not
bjorn: you could be sending complete results, longer and longer each time
dan: you really need to send the whole stream
bjorn: there's a web app API "on partial result"
michaelB: what does that partial result look like
robert: that doesn't necessarily make sense for dictation, which might be long
bjorn: that's the simplest API, might not always work
burn: one alternative might be a diff
michaelB: a structure like an nbest
bjorn: what if the web app can say, "snapshot"
robert: in most case you don't want hypotheses, just final result, whenever the recognizer decides, two events, "hypothesis" and "final"
burn: "final" is fixed
bjorn: it's an addition
burn: in a two-hour call, at some point you'll send a "final1", what is "final2"?
(everyone): everything since last "final"
michaelB: "final" or "hypothesis" is everything since the last "final"
paolo: it's possible you could change the past
burn: then you shouldn't send "final"
michaelB: it's a tradeoff between how often you send "final"
robert: you could do a rereco with a different LM if the topic changes
glenn: at some point you have to prune the nbest, don't want to keep appending
bjorn: you should get a final nbest with a final, and it doesn't change
robert: the recognizer has
context, it doesn't necessarily report to the app
... what is the use case for sending hypotheses
michaelJ: you could have other ways to index results, e.g. id's, or we could just rely on ordering
bjorn: would prefer just to use ordering, point of "i's" is just to show interim feedback.
milan: suggest removing the "i's"
robert: why do you need "i's"?
paolo: user feedback
milan: users find interim results confusing, we removed it
bjorn: we don't have to require that "interim results" are fired
robert: for dictation, we should worry about correction, e.g. should we expect to look at the nbest, or just overtype?
michael: there is a purpose of defining them even if the web app doesn't display them, for talking about how much data needs to be sent.
robert: any decent recognizer will be able to decide when to send the "f"
milan: could send in a parameter to the recognizer that says how often you want results
michaelB: there are some use cases for getting intermediate results
milan: maybe we should reserve i's for v2
danD: another use case is to know that the recognizer is processing, the events may be useful
robert: then you could send events like "I heard speech", or "paused", or "got noisy", that would seem to be of a higher priority
michaelB: in API there may be
structures that are built up over time
... e.g. for correction
bjorn: or web app could do this
michaelB: doing automatically would make it easier
bjorn: we all agree on "f's", not sure about "i's"
robert: the "i's" don't even need to look like "f's"
michaelB: if you only have "f's", then you might need to be able to correct results from previous "f's".
satish: i's don't need an nbest
bjorn: we don't need to specify "i's" at all
michaelJ: say you have dictation. say something, then pause, at the end you send "overall message, here's my final answer"
michaelB: if we only have "f's", they'll be narrower in scope, and you may need to correct "f's"
milan: "i's" were incremental, would be ok if you could replace an "f"
michael: the thing that drives the simple "f" story means that you don't have to have a structure that let's you correct things
milan: you can explicitly correct results in previous sentences
michaelJ: people have done things like change the name of a restaurant after you get the city and state
danD: what if the recognizer sends "correction available"
milan: that's complicated
robert: you're reserving the right to change previous "f's"
paolo: in any case, timestamps are needed, for example to coordinate display
burn: I would like to see more concrete examples of what's going into the f's.
milan: should not give that responsibility to the web app, too complicated
michaelB: it's up to the recognizer to say what's changed
bjorn: every result event has an id, so that recognizer can refer to them
michaelB: every time a recognition happens it changes the result array
burn: : what if you can't allow results to change? e.g. live closed captioning, it's gone before you can change it
bjorn: web app could just ignore those changes in this use case
burn: only the recognizer knows what to change
marc: web app knows what to do about the change
bjorn: we have result, correction, and array, then web app knows what to correct
michaelB: that array would be useful, even if the web app could construct its own array
bjorn: the array is just a convenience
michaelB: it would be very convenient
satish: how is the web app going to get the array?
michaelB: in most cases you'll want corrections to be displayed
robert: we have to discuss how much we want the UA to do on the web app's behalf? what about using hotwords to make correction, because many users will not be able to touch the keyboard
bjorn: there could be "enable correction commands"
robert: it's hard to write the app that does corrections
glenn: this gets into maintaining the user experience across apps
michaelJ: what use cases are we including? do we include recognizing broadcast news, or rerecognizing people in a meeting? because those use cases bring up new requirements
danD: I would rather not get into complicated use cases
robert: we have to decide if we care about dictating documents; in that case we have to think about correction
michaelB: textArea could be long
bjorn: we have to limit the scope, we could say that correction is up to the UA
satish: would you want
corrections to be different for different web apps?
... the hotword should be similar across web apps
bjorn: two use cases -- I want a field that users can speak in vs. a complete app
satish: you need to have a correction method across web apps
michaelB: you could have a hotword and then speak correct text
marc: correction isn't the only use for hotwords
danD: i would like to pass context as a hotword to the application
milan: you could have a manual process, or the fighting between the UA and the engine doing the correction, the speech engine should do it
burn: if you make the service responsible, when does the UA know that the user is done?
michaelB: in a multimodal
context, correction could be done by the user in a different
modality
... feedback from user might improve future recognition
milan: Google proposal had a feedback mechanism
satish: instead of making it voice-driven, you could allow the user to specify correction multimodally
bjorn: in the simple case you don't have voice correction
burn: your web app can always write some completely other thing
bjorn: in the single-shot case
the result won't change, we can feed it back to the recognizer,
but that won't change the result, what if we just made
continuous recognition a sequence of these?
... the app keeps sending audio, it sends back results
michaelB: it makes sense for it to be the same structure, but not sure that they shouldn't change. in continuous reco, i think the results should be able to change
satish: we could work out the simple case and then add continuous case
burn: we want to make sure that we don't prevent the more complex use case, but we shouldn't spend two days working out the details.
milan: would you get the same events in the continuous case?
bjorn: you get a sequence of
events, what can't we do with that? you wouldn't have
continuous dictation support, but we could add that
... user corrections should be out of scope
milan: or engine should implicitly understand how to make voice correction
glenn: this is a simple API that we could build on
marc: how often would you send
array of recognized results
... after every sentence?
michaelB: or more often
bjorn: simplest way is to eliminate corrective frames
robert: what about
punctuation?
... not in SRGS
... service probably isn't going to add punctuation
bjorn: it's up to the engine to decide if you want to speak punctuation
marc: it depends on the grammar
milan: or add a parameter
glenn: does this matter for
API?
... probably not
burn: this is a case where someone, perhaps bjorn, should write up a description of how continuous recognition should work in this simple case
michaelB: there are interesting parameters, such as whether you're more interested in semantics or utterance, also should I try to clean up text or just transcribe literally
bjorn: we have utterance and interpretation already
burn: it should be up to the service to return utterance or interpretation
glenn: should be a parameter to the recognizer, not two recognizers
burn: I don't think all recognizers will support this
paolo: found a use case for the "i", in the case of dictating a document, you might use commands, but the recognizer isn't sure if it should understand command or do continuous recognition, but has to wait until web app decides
glenn: have two grammars
milan: "final" are not final until the end
danD: after you receive the final "final" you clear the buffers and then you're done
michaelB: a different use case is
speech-enabling a common web form, you may have both text areas
and single field inputs. you want to be able to speak to the
page, not click on each box indiviidually
... for example, flower delivery application
... this would be complicated
satish: speech-enabling existing forms will be difficult
bjorn: someone could write a Javascript library to do this, and you could do a different one if you wanted
michaelB: also need to handle
bindings. I want to have the user just have a natural speech
experience, and it just figures out how to fill out
fields.
... if each of the fields have a grammar, then app could figure
out how to do bindings, this could be done by standard
paolo: if there's an open text then there's a problem
milan: at Nuance we reflect the DOM in the speech server
bjorn: this could be done by a Javascript library
robert: I would be skeptical if we thought that we could do this as part of the standard. we should just provide API for someone to build that app
michaelB: your continuous recognition case doesn't just mean speaking into a text field; it could be a combination of single shot case and continuous dictation
bjorn: do we get the SISR semantics for continuous, it needs to divide it into semantically meaningful units
michaelB: could have the same text with different semantics
bjorn: what if you have a grammar for the continuous case? you could have a grammar as well as an SLM
robert: that would be another special ruleref
michaelJ: you'll get all these events for fields, when doing dictation what is the grammar
michaelB: I think there should be
a lot of default behavior but you can override
... if you want to have multiple grammars active, the match
could be tied to a grammar
satish: what if you have two forms in the same field?
bjorn: you can't speak into multiple fields at the same time
michaelJ: whatever other
mechanism that we have that would enable you to replace
results, we should handle the simple case and not have to think
about structure of complex events
... you have to be able to do simple recognition without having
to think about the complex
satish: filling multiple forms on the same page isn't that simple, you might have to use multimodal interaction to choose the form that you're filling out.
glenn: how much does the web developer enable
burn: any other continuous recognition topics?
raj: continuous recognition isn't for dictation only
robert: could have continous
dictation for SRGS grammars, e.g. games, and i would want to
activate and deactivate grammars continuously
... all time-locked in some manner. stream of audio needs to be
synched up to when grammars are active
milan: could have race conditions
robert: yes, would have to be a best effort
michaelB: this could be the case of a change in any parameter
burn: every chunk has its own set of active grammars, it doesn't change in the middle of a result
paolo: even the application can decide that it wants to add more constraints
milan: you need to know which grammars are active
michaelB: next version of EMMA will have this
michaelJ: does this include recognition of e.g. broadcast stream, or is this just a single user in front of a screen
burn: we are assuming that everything comes from a single speaker, but we don't expect our services to know that
michaelB: this might change in the case of microphone arrays
raj: you could have a game with several participants
burn: overlapping speech is the problem
michaelJ: if you have a service that's capable of differentiating speakers, you wouldn't want to stop anyone from doing that
marc: if you have a dialog
between the machine and the user, you have TTS output, with an
open microphone, how do you stop the engine from attempting to
recognize the TTS?
... is this in scope?
michaelB: two aspects, do we have to worry about this for TTS from our API, or what if the UA is producing sounds?
robert: we have to leave this up to the UA
glenn: are you going to feed audio from, e.g. a teleconference, back to the speech service?
burn: maybe the UA doesn't even know what's going on
robert: it's not us
dadD: this might be a reason not
to separate ASR from TTS
... if it knows about what it said, it could separate the TTS
from what it hears
michaelB: we have to figure out how to change parameters. we talked about both open mike and dictation, is there anything we have to talk about with the open mike?
burn: have not captured any agreements
bjorn: writing up different kinds of eventws
robert: we might want to change parameters over the course of a recognition
milan: that information would be transferred to server.
michaelB: but the parameters would not change in the middle of a result
michaelJ: there's what we would
like people to be able to do, and there's how to make it
happen, can you be streaming audio, buffer it a bit, change
parameters?
... we have to have that discussion
michaelB: back to open
mike/hotword. there will be long periods of time when you're
not getting matches, what happens when you don't get a
match?
... do you actually fire nomatches?
burn: in VoiceXML you don't fire nomatch
michaelB: how you you handle nomatch?
glenn: what if multiple results, one on hotword and one on dictation? would that be helpful for corrections? in hotword case it might be interesting to know there was speech
michaelB: it would be more convenient for results to be together
michaelJ: we have that in EMMA
michaelB: it's still interesting to consider nomatch in the open mike case
milan: maybe you should just ignore it
burn: agreements -- for continuous recognition we must support changing parameters
milan: didn't we say we didn't want to deal with echo cancellation?
???: not sure we agreed on that?
milan: we agreed that we wanted to handle continuous SRGS
Bjorn: what are the possibilities
specify other things or get an error?
... different from errors
... and stuff
Raj: fallbacks for specified language and specified speech service
DanD: who's in charge of the fallback?
Burn: fallbacks should be to
things we know actually exist
... where there are a set of potential fallbacks, they should
be author selectable
... fallback in SSML where a requested language isn't available
is one example of a precedent
MichaelB: codecs don't necessarily have fallbacks due to other restrictions
Burn: SSML always tries to say something even when the requested resource isn't available. this is partly a result of VXML's queueing
MichaelB: not only might the requested speech service not be available, but also the default service
Bjorn: but in the HTML IMG tag
you don't say that if this image doesn't work here's the
fallback
... better to throw an error
DanD: part of the API is to have a discovery API
Burn: if you throw an error, you need a discovery API. if you have a fallback, you don't
MichaelB: the fallback for a
specific service may not necessarily be appropriate
... if you want to fall back to the default, you can catch the
event and swap services. the default should be to not have
speech
Bjorn: Propose: if you request a service, and it is not available, an event fires that the app can catch the event and select another service
MichaelB: there is no fallback to default service
MichaelJ: would there be an automatic behaviour that greys out the microphone icon? or should it only happen through script?
DanD: it's not fallback on the service, it's fallback on the user experience
Olli: querying a service beforehand to see if it's available could expose privacy issues.
DanD: so enumerate all service capabilities
MichaelB: except that in most cases it'll just discover the user's preferred language if used against the default recognizer
Burn: propose: the web app API should provide a way to determine if the service is available before trying to use the service
Raj: what does that buy you? it could still go down
Bjorn: it could remove the
UI
... do we ask "what languages do you have?" or "do you support
this language?"
MichaelB: or just try it and see if it fails, although that would be bad
Burn: what other potential fallbacks should we discuss?
Milan: protocol? continuous reco may not be available
DanD: checking for capabilities gets two birds with one, since it also enumerates capabilities
<smaug> Would API like this work: Speech.getService(ServiceURL, { configuration: foobar}, success_callback, error_callback);
Satish: is there a precedent in HTML to change the UI when a service isn't available?
Bjorn: the google translate app only shows UI when TTS or SR are available
Debbie: the difficulty is that you can't see what failed
Milan: ok if we have defined error codes
DanD: inefficient to keep requesting over and over until finding a parameter set that works
MichaelB: because it's a method off the object, it can test with anything on that object that you can set. e.g. the grammars
Debbie: is a service obliged to
tell the truth about its capabilities?
... e.g. you want en-au, but it only has en-uk, but it decides
this is okay
Michael: return value could give some sort of qualitative capabilities measure
Burn: easier to do vendor specific stuff through polling than through a get capabilities/languages/etc API
DanD: boxing ourselves in if we try to enumerate a specific set of capabilities. need something extensible
Bjorn: HTML APIs normally have separate methods, rather than one huge structure
MichaelB: the protocol could still just query capabilities in one chunk, and the API could break it up
MichaelJ: so if you want to check whether a grammar is supported, should the service download your huge grammar?
Burn: could be a prepare/prefetch function
Bjorn: grammar should be from the same site as the page
MichaelB: disagree. just like images, etc, apps can fetch grammars from 3rd party sites
MichaelJ: will have a rich set of define errors
Burn: the API must provide a way
to query the availability of a particular capabilities of a
service
... and the API must be able to enumerate the capabilities of a
service
Milan: since querying language is only an issue in the local case, maybe we should only restrict this locally, but allow it remotely
Bjorn: or there's a setting in preferences, or not in incognito mode
Satish: don't we already need to
get the user's permission prior to doing reco? if so, why not
require permission before returning capabilities?
... should be governed by the same privacy policy
Bjorn: don't want to pop a
permission UI just to know whether or not to render a
microphone button
... up to the user agent how to interpret the request
Milan: if the user agent doesn't want to reveal the private info, it should say so
Burn: web app only talks to UA. UA discovers capabilities of a service, and decides what to expose to the app
Olli: use same permission/security/policy API as whatever gets specified somewhere
<burn> tentative text: The API must provide a way to ask the user agent for the capabilities of a service. In the case of private information that the user agent may have when the default service is selected, the user agent may choose to give incorrect information to the web app or decline to answer.
MichaelJ: still worried about error codes. there's an infinite number of ways something can fail
Satish: error codes are specific, text isn't because it needs to be localized
Milan: the user agent should not provide incorrect information. better for it to say "no comment"
Marc: is it just the language that's a privacy issue? e.g. list of installed voices
Debbie: haven't spoken about speaker dependent recognition, but exactly who the person is may be private
MichaelB: there are also things in the recognition result that may be private. e.g. a guess at the speaker's gender
Raj: why enable the user to trust only a specific set of speech services?
MichaelB: because the web app doesn't necessarily have access to the audio
Bjorn: if we give an app reco results, we could give them the audio, gender, age, etc, given that there are APIs coming that will allow direct access to the mic
MichaelJ: there are subpoenable issues. having a transcription of somebody allegedly saying something is different from having the audio recording
Bjorn: user agent can hide
capabilities or deny the use of capabilities it doesn't
like
... or services
MichaelB: or web apps
Debbie: different user agents could deny different things, which would affect consistency
MichaelB: consent could be parameterized on different dimensions
Burn: comparable to flash on ipad
Bjorn: wouldn't it be nicer to have one grammar that refers to the others
Milan: more convenient to activate/deactivate specific grammars
Bjorn: how do you specify
weights?
... how about semantics from each?
MichaelJ: weights don't work the same
Debbie: or you want to reuse somebody else's grammar, or a built-in
MichaelB: can't ruleref to an
SLM
... but can run them both at the same time
... and if you have grammars for each field, could
enable/disable grammars depending on what's in focus
Burn: multiple simultaneous recognitions is not the same as multiple simultaneous grammars
Bjorn: would really like to ruleref to SLMs
Burn: these topics have come up
repeatedly in voice browser working group (new rulerefs, and
SLM format)
... SRGS doesn't say that it doesn't allow SLMs for
builtins
MichaelB: it sounds like we all agree that multiple simultaneous grammars is okay
Bjorn: no impractical limit on number of SLMs & SRGSs
Milan: and we agree it should support weights
MichaelJ: multiple requests to different services are different recognitions
Milan: and we must support multiple simultaneous requests to services (same or different service)
Satish: wouldn't be hard to implement in Chrome
Glen: what's the user case for this?
Satish: there's no requirement that the audio for both requests start at the same time
Burn: nobody objects to doing this in the API as long as they are logically independent
DanD: what's the limit?
Burn: practically, there's a limit, but the spec doesn't need to state it
Glen: does this complicate consent?
MichaelB: just need to indicate that something you consented to is listening, not exactly which ones
MichaelJ: need to discuss how we specify weights
Robert: what does it mean?
bjorn: we discussed on the call
that it should be possible to implement ASR without TTS or vice
versa
... Microsoft proposal used some shared parts of the API
... reason for linking them is barge-in
... but not much more
burn: there are common use cases
for only-tts output
... not so many use cases for only speech input
bjorn: voice search
robert: the scenarios where they
work together are where we need to find out if there are any
weird interactions
... so asr and tts should be part of the same api family
danD: where should they be closer
together: on the developer side, or on the engine side?
... is it just a convenience for the developer?
robert: hopefully the apis are loosely coupled
bjorn: except for barge-in or starting speech output after speech input, are there any reasons for coupling the two?
satish: not different from playing non-speech audio, or video, --- what is special about tts?
burn: noise cancellation,
expectation that people have that when you speak to a machine,
that it will stop speaking when you speak to it.
... that's different for a youtube video playing at the same
time.
robert: as a developer, what do I need to build a good interface?
michaelB: maybe too early to discuss the topic
bjorn: what do we want to be able to do? I start speaking, something that is playing stops
burn: ... within a certain time frame
michaelB: and we need to know the marks from SSML
danD: should asr and tts be allowed to be different services?
burn: yes, can be completely different URIs
michaelB: you may want to coordinate asr and tts, for example the tts output should wait before it starts until a grammar is loaded.
burn: no requirement to have the same service for asr and tts.
bjorn: the service, or the UA could coordinate the two. shouldn't have to be the webapp to do this.
danD: so if the URI for asr and for tts is the same, should the two be configured separately?
burn, bjorn: yes
bjorn: would the barge-in be already solved using current events and javascript interfaces?
burn: in mrcp, we had a
requirement on ordering of events
... where the same service did asr and tts, the service could
do a faster coordination between the two internally.
... but I don't think we should impose such a requirement
here.
robert: usually when thinking
about barge-in, we think about low-latency contexts. but on the
web, we have long latencies usually
... so just coordinating events might not solve our issues.
bjorn: we talked about requiring a low-latency speech detector in the client. would that be sufficient for barge-in?
michaelB: for certain cases of barge-in, you need the more advanced speech detector in the speech service.
robert: what is the
scenario?
... 1. while the tts is speaking, interrupt as soon as there is
any sound
... 2. a lot of background speech all the time, but then there
is a keyword
bjorn: what is an acceptable latency?
burn: as fast as possible
milan: a round-trip to the server and back to the client is unacceptable
burn: agree
bjorn: if the speech server is taking the decision to stop output, that round-trip is inevitable
joint clarification: two round-trips would be unacceptable, but stopping it locally after the speech service takes the decision to stop is fine.
milan: need to preserve
timing.
... asr service needs to tell tts when barge-in occurred, so
tts can check how that relates to its ssml marks
glen: the only thing that matters is what audio the client has played, not what the tts service has sent.
robert: for both output and speech input can keep track of times, put both on the same clock.
michaelB: if client means UA, I'm fine with that. shouldn't have to be the webapp.
bjorn: how fine-grained a timing we need?
burn: in voicexml 2.0, we had mark; in voicexml 2.1, we added offset from marks, in milliseconds.
milan: relevant user perception is around 30ms precision
glen: what is the use case for knowing the timing?
burn: it would be nice to know which word that was when the user barged in.
michaelB: if you hear a list, you want to know if you are 2 ms into the list item, or 2997 ms into the item
glen: don't see the reason for wanting to state "1.7 seconds after the mark"
burn, michaelB: disagree strongly. this is exactly how humans work.
burn: what should be the
granularity?
... for the client it would be easy to specify to the
millisecond.
bjorn: ok, so what API do we need?
robert: asr provides a
client-adjusted time stamp for a recognition result.
... then it should be possible to ask tts, "give me the mark
and offset for this timestamp"
michaelB: this could be done by the UA rather than the webapp
robert: agree, if webapp author makes it explicit that "this is the tts I care about", this can happen behind the scenes.
bjorn: this could be done in a
javascript library, on top.
... say you have two tts...
burn: as long as the timing
information is available, this can be done.
... client needs to tell speech service, "in my local time,
this is when it started"
michaelB: not convinced, the simple case of one-stream-one-stream barge-in coordination should be part of the API
burn: I could live with either choice.
robert: tend to agree with bjorn, because we have the two types of barge-in. the UA cannot be expected to keep track of the hot word case.
burn: it would be very easy to tell the UA: on speech start, stop tts output.
michaelB: agreement that we want
to be able to support the complex interactions of multiple asrs
and ttss
... would this complicate the API?
bjorn: it might complicate the API but simplify the apps
robert: have implemented this API
quite a while ago.
... relatively simple api.
michaelB: make the
99.5%-of-the-cases scenario really simple.
... bjorn: but this would be a single line of javascript
code...
... do I understand correctly that this would be a convenience,
not offer new functionality that you couldn't do
otherwise?
... yes, to make web app developers' lives easier.
... probably doesn't make the architecture much different.
bjorn: right, so this is no longer a crucial decision.
<paolo> s/^... bjorn:/bjorn:/
burn: it's like a "play-and-recognize"
olli: similar case happened in other cases: method appeared first in script libraries, then the browser vendors implemented it.
bjorn: different design strategies: I prefer providing the simple basic stuff as a starting point.
robert: when we (MS) prepared our proposal, we wrote asr and tts parts separately, then discovered some parts are exactly the same.
(dan is capturing agreements reached)
<burn> Decisions added during this session were:
<burn> - We disagree about whether there needs to be direct API support for a single ASR request and single TTS request that are tied together.
<burn> - It must be possible to individually control ASR and TTS.
<burn> - It must be possible for the web app author to get timely information about
<burn> recognition event timing and about TTS playback timing. It must be possible for the web app author to determine, for any specific UA local time, what the previous TTS mark was and the offset from that mark.
<burn> - It must be possible for the web app to stop/pause/silence audio output directly at the client/user agent.
<burn> - When audio corresponding to TTS mark location begins to play, a Javascript event must be fired, and the event must contain the name of the mark and the UA timestamp for when it was played.
In Crucial Decision:
"What is the mechanism to use capabilities outside of the mandated interoperable set, and where (default engine, remote engine, or both) can they be used?"
Milan: Protocol impact
Dave: design API first then protocol
Michael: We wrote something on Post body
DanB: Yes, but we can have
more
... we touched this in other discussion
Bjorn: One thing is
parameters
... or we can put in the URI ,,,
DanB: I don't like only there
Milan: In POST body
... Naming scheme?
Bjorn: If it is in URI don't matter
Satish: It will be server specific
DanB: I don't like call them 'extensions', I prefer vendor- or server-specific
MikeB: More interesting is to figure event in return
ddahl: Important to define the whole set of options
MichaelJ: If it is round ASR, emotion detection, prosody detection ...
Bjorn: Most of the can be in EMMA, if they are at the end
MikeB: If you need to coordinate event based is difficult to understand how to have in EMMA
ddahl: You can configure server
MichaelJ: In EMMA you can add more in the content
Marc: We talked to filter EMMA
MichaelJ: Filtering is to do ???
MikeB: For the extensions you
need coordination and passing, so UE is free not to use them,
like emotions, or ther stuff
... The UE has a role to be active or accept extensions
Olli: Handling events will be trivial, speech service specific events are more complex
Satish: Add extensions to events
to handle to the service
... extension properties to be sent to the service
DanB: Aside from naming, are there other questions
MikeB: There was some talks to be
strings, but sometimes might be binary or structured
data.
... not clear string are enough
Bjorn: Do we need vendor specific events?
DanD: you have components that
you want to parametrize ... how you specify them for standard
service
... there are multiple types, for certain URI might be best,
especially for initialization
Bjorn: I don't say URI only ...
DanB: We should not limit extensibility
Bjorn: How much provision in the API put in UA for parametrizing server
MikeB: vendor specific handler
DanB: We have API and we have events that are received from web app author
Milan: What about send a grammar,
or a property
... if we allow any parameter at any time
Bjorn: Is it enough any parameter with free name and value, but not event back or call any method
Olli: Generic extension events might be useful, than you can have binary
Bjorn: We propose three:
... 1. set parameter with name and value
... there might be standard names and vendor specific "x-"
Discussion on agreement on names
scribe: about values: objects or strings?
DanD: the actor is web developer
Olli: value will be JS object
<agreement>
Robert: standard way to serialize the value
MichaelJ: The action and methods
is recognize, if we do verification the answer will come back
in EMMA ...
... after speaker recognition I look in a Db and put in EMMA
results ...
... return HTML in the panel to change the interface
DanB: We don't need extension for EMMA, because it is extensible
Bjorn: number 2
... ok for speech service to add whatever extra in EMMA
Milan: Must be in EMMA?
MikeB: Yes
MichaelJ: different way to do in
EMMA: put in Interpretations and use emma:info for additional
annotation
... or to have derivation chains
... For recognition, it might return emma:literal, but not in
API. This is bad.
Bjorn: We can put in protocol or
leave to UA
... Web app don't need to look to EMMA
MikeB: Web appl might not reflect emotion easily like confidence or other simpler stuff
MichaelJ: Service return both EMMA and special fields?
Bjorn: I don't mind
DanB: You can put application specific in EMMA today, so we are recommending that
MikeB: We are not specify this is the specific EMMA, use EMMA in the way you like
DanB: It doesn't add beyond EMMA
MichaelB: If you have EMMA ...
Paolo: is there agreement to have EMMA and also other parameters utterance, conf?
Bjorn: yes it is
Milan: Put EMMA in EMMA
Bjorn: Yes
Milan: it seems asymmetrical not
to have multipart
... you have to encode
Bjorn: You have to encode in any
case
... it seems a special case to add binary data, but it is
doable
<agreement on point 2>
Bjorn: point 3
... event that come from the service to web app a part from
standard ones
... we need to have register event listener
Olli: No normal DOM event
... it is arbitrary
Bjorn: Browser might reject
Satish: Allow browser to register a name and an event
Bjorn: to avoid conflicts in future
MikeB: If through the extension events, I want to send a standard event
Satish: only difference of a single listener
DanB: All listening to the same event and parsing
Bjorn: You set multiple event handlers to look to type ...
Olli: It is annoying
Bjorn: If we have a method like set-extension-event handler
Olli: You have in DOM tree events, in Mozilla most prefix events ...
Bjorn: We can use that
Olli: We can use custom event in
DOM3 event
... very flexible
<smaug> http://dev.w3.org/2006/webapi/DOM-Level-3-Events/html/DOM3-Events.html#interface-CustomEvent
Olli: you can specify any event
<bringert> http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions
Correct URI is: http://dev.w3.org/2006/webapi/DOM-Level-3-Events/html/DOM3-Events.html#interface-CustomEvent
Milan: We might want to send
timings
... track timing in client might be more robust
Bjorn: It is more protocol, when in protocol start recognition, the UA should send to the speech service the begin of capture audio
DanB: it seems a protocol detail
Bjorn: I think it is important to
agree on it
... Agreemtn on point 3 is to use DOM3 Extension Events
Marc: Very thiny events tight to audio
Bjorn: you could send event with timestamps, sent in advance
<ddahl> s/Agreemnt/Agreement
Marc: you have two indepedent in case of barge-in
MikeB: It is doable, but the bar
is raised for doing it.
... At this time fire this event
Bjorn: Fine
... Agreement: speech service should be able to instruct the UA
to fire a vendor specific event at a specified time in the
future
Marc: not specific time, but on streaming audio
Bjorn: It have to be relative to the a client time, or multiple times absolute
MichaelJ: The audio and video might be done by a single object, to avoid problems
Marc: Ideally yes
Bjorn: Requires to have EMMA extrastuff for TTS
Marc: Has to be possible to be played through something else
Robert: we can agree to fire event at offset of streamed audio
MichaelJ: Use TTS to point at URI
which mixes timestamps, etc
... ship arbitrary requests
Bjorn: You don't use TTS
API
... use Audio tag
MichaelJ: I have two more:
... a. do audio/recognition with string back, user corrects by
MMI, do we provide support for text?
Bjorn: There is a point: "Do we support simulated reco input (semantic analysis of text)?"
MichaelJ: second
... b. You want to send other stuff more than audio, people
talks and click to send.
Bjorn: Could you hijack click at
place
... you could put in parameter
MikeB: Use EMMA
MichaelJ: But EMMA is for
input
... for instance UA capture InkML
Bjorn: to be sent to the service with voice
MichaelJ: We can send parameters not only at the beginning for that.
Bjorn: We can do that to intersperse things ...
DanB: It seems out of topic
Bjorn: We can do a protocol to do that
Milan: timing information need to be sent
Bjorn: This is not related to number 1, this is a protocol issue
DanB: Any parameters sent from UA to service should include a timestamp
<agreement>
MikeB: Parameter if it is at the begin, an event in the middle of something
MichaelJ: In JS API there is a set-parameter to send encoded image
MikeB: yes
MichaelJ: Maybe sending is multipart ...
Bjorn: You can chunk the audio and intersperse with that. It is not standard, but doable.
DanB: A good topic for PM is to talk of protocal
Bjorn: I'd like all the crucial Web app API first
DanB: yes if we have time
DanD: I have classes of parameters: authentication and privacy (who is able, at what time, ...)
<ddahl> this general parameter-setting mechanism could be very general, e.g. you could send a parameter like "the sun is shining", and since parameters can be sent at any time, you have a general eventing mechanism
Bjorn: I don't see any privacy to be standardize
MikeB: It might be the case that I don't want to communicate to service that are not secure ...
Bjorn: It seems a UA specific setting
Milan: https URI
Bjorn: User can check box, like block images in pages
MikeB: or cookies secure
Bjorn: Web appls loaded on secure
connection should use only secure connection to speech
service
... there should be some recommendation
Robert: The bar should be the same of other context
MichaelJ: We have to say, we must support https in addition to http
<agreement>
DanD: If the web apps, use https
...
... if web app using https and the URI isn't https should
pop-up
Bjorn: We cannot standardize it
because it is part of browser
... important to capture:
... 1. must support https
... 2. the default speech service implementation shouldn't be
using unencrypted when secure communication
<agreement on point 1>
Discussion on point 2
Milan: Should not use unencrypted
<burn_> discussion agreements during this session were:
<burn_> - It must be possible to specify service-specific parameters in both the URI and the message body. It must be clear in the API that these parameters are service-specific, i.e., not standard.
<burn_> - Every message from UA to speech service should send the UA-local timestamp.
<burn_> - API must have ability to set service-specific parameters using names that clearly identify that they are service-specific, e.g., using an "x-" prefix. Parameter values can be arbitrary Javascript objects.
<burn_> - EMMA already permits app-specific result info, so there is no need to provide other ways for service-specific information to be returned in the result.
<burn_> - The API must support DOM 3 extension events as defined (which basically require vendor prefixes). See http://www.w3.org/TR/2009/WD-DOM-Level-3-Events-20090908/#extending_events-Vendor_Extensions. It must allow the speech service to fire these events.
<burn_> - The protocol must send its current timestamp to the speech service when it sends its first audio data.
<burn_> - It must be possible for the speech service to instruct the UA to fire a vendor-specific event when a specific offset to audio playback start is reached by the UA. What to do if audio is canceled, paused, etc. is TBD.
<burn_> - HTTPS must also be supported.
<burn_> - using web app in secure communication channel should be treated just as when working with all secured sites (e.g., with respect to non-secured channel for speech data). -
<burn_> - default speech service implementations are encouraged not to use unsecured network communication when started by a web app in a secure communication channel
Bjorn: started writing the IDL because it seemed like we agreed on a lot of things and an IDL would make it clearer.
<satish_> MikeB: are we reviewing the proposal and making corrections to it?
<smaug> ping
<burn> test
Bjorn: for the JS api, grammar should be a sequence of objects and each with its own weight.
Dave: grammer should not be optional and there should be a built-in value
DanB: perhaps better not to get too much into implementation details
Bjorn: agree with Milan that there should be a language attached to grammar so that you can specify a language for built-in grammars too
DanB: similar to voice xml
Robert: should built-ins include language?
Bjorn: perhaps better to not have part of the URI
Robert: If we are making up a scheme for built-in URIs, we could include language in there too
MikeB: In the HTML API it may be
understood implicitly based on which element the binding is
done to.
... we need a sequence of objects, the object have properties
etc.
Robert: I see 'maxresults' property and there is potentially more like that.
Bjorn: should we have a 'setParameter' method or have them as separate properties?
Robert: prefer typed programming
paradigm
... for well known properties have them as named attributes and
have an additional setProperty method for other attributes.
Milan: could have a properties object which has all properties on it.
Robert: typical to have a flatter API
MikeB: supports property type apis as well instead of a setProperty method
Bjorn: agreement that standard
parameters would be settable as dot-properties but they can
also be passed in to setParameter method.
... Should we discuss other standard parameters to include in
this IDL?
DDahl: easier to think about after this f2f and discuss over later.
Bjorn: the start/stop/cancelSpeechInput methods were in general agreement
Robert: the names seem verbose, could shorten in a later iteration
Bjorn: <describes the events>
Dave: is there a strong requirement for onsoundstart and if not could we trim it?
DanB: these were agreed upon in past discussions
MikeJ: could we have an event for energy level changes?
Robert: seems like a inventing a microphone api and we should rather use an existing api if available
Bjorn: once such an api is available we plug that into the SpeechInputRequest object etc.
Robert: Is a start guaranteed to also fire an end eventually?
Bjorn: yes, eventually
... the email I sent yesterday expands on the 'onresults' event
and the event object received.
MikeB: We haven't decided on how to handle NO_MATCH case, could either throw an error event or an onresult event with an empty list.
Bjorn: result event has emma document for each result item as discussed earleir.
Dave: the IDL looks quite comprehensive, we perhaps need a pruning exercise to reach a core api instead of solving all the issues
DanB: I have the opposite concern, don't want a perfect api but also don't want a toy api which only works for very few cases and having to revisit soon.
Dave: worried that browser vendors may not adopt a complicated api
Bjorn: my only issue now is when
we get to the protocol which may be tricky
... had the same concern as Dave initially but things have been
going smoothly
Bjorn: 3 different events described, 'result', 'intermediate' and 'replace' events.
MikeJ: should the replace event id be a range of individual?
Bjorn: send separate replace events for each id
MikeJ: problematic if the service segmented by mistake into 2 utterances and if the final result should be a single utterance
MikeB: could use empty utterances to achieve this, send one full utterance and then an empty utterance
Bjorn: that wouldn't work if we were to segment into more pieces, e.g. 3 utterances instead of 2.
MikeB: having more options than a simple replace is perhaps necessary
DanB: the current replace api works fine for the common case, could add insert later if required.
Bjorn: one wrinkle: if you change parameters such as grammar in the middle, audio since the last result needs to be buffered and recognized again.
Milan: there is no feedback mechanism/method in this IDL
Bjorn: Yes it is missing from this IDL and needs to be added, perhaps to the Result object
DanB: perhaps this is something which should be done using vendor specific extensions and test
Bjorn: after this discussion, do we want the intermediate and replace events?
Robert: no
DanB: usefulf or demos but perhaps not required in real use cases
MikeB: ok to not have intermediates if we have replace
Bjorn: ok to have replace if the web app doesn't have to always handle them. e.g. services sending empty results and then replacing them
Robert: if we have partial/intermediate results then we don't need replace.
Satish: replace makes state management heavier
Robert: if we are looking to cut something at a later stage, replace is perhaps a candidate.
<agreed to keep all 3 events>
DanB: i support it
Bjorn: seems like simple to support
DanB: there is a difference between simulated reco and semantic analysis of text
Robert: some use cases such as video games have all words made up, languages like japanese have a lot of imported words etc.
Bjorn: why is this a problem for testing/analysis of text?
MikeB: could have a separate call to pass the text or just pass it through the input stream to the service
Robert: this affects the api
Bjorn: we should be concerned about things which are affecting the web app api
MikeB: if there is a web app specific text such as product name which should be pronounced in a particular fashion and the grammar is designed for that, the web app should take the typed text and convert it to how it should be pronounced in that domain before giving to the speech api.
MikeJ: focus on the simple use cases which cater for most web apps and punt on the pronunciation issues
DanD: user has low expectations about reco if they were typing it in the first place
MikeB: don't agree that adding support for pronunciation in this case doesn't complicate the api
Bjorn: can we specify a different grammar for simulated reco with text?
Robert: we don't want to have different grammars for speech and text
Bjorn: most of the use cases
heard so far are for test tools, better not addressed via our
api
... it is up to the recognizer to decide how to take audio and
text input and return results, let the web api not provide
anything specific for it.
General agreement that there would be a method to give text input to the request object and a parameter to specify the type of text matching - fuzzy or exact
We had a short discussion about protocol and decided to look deeper into the latest draft of WebSockets. We'll revisit the discussion after reviewing the draft, possibly over email or with a smaller group.
Bjorn: what are the use cases?
Milan: change language and re-recognize because past recognition returned no results
MikeB: change grammar and re-recognize
Bjorn: can't we use parallel recognitions?
Robert: wouldn't work if we want to change grammar based on previous recognition, e.g. if London was identified then re-recognize with London specific grammar to recognize the local address.
Bjorn: isn't this just a fancy language model?
MikeB: no, this is about keeping the audio around and doing better recognition
DanB: very similar to text input
where instead of text input send audio
... could implement both recognize from file and recognize from
past session through this api
Bjorn: if we have a rerecognize()
method in the SpeechInputRequest object and if that
re-recognizes the previously recorded audio, that would cover
the common case.
... web app can change grammar and other parameters before
re-recognizing
Robert: requires state, better to have web app specify earlier that it may re-recognize with a setParameter call so that most common cases don't end up storing state and audio which doesn't get used
Olli: if the api is based on Streams where the recorded audio can be obtained as a stream, don't need to set this property in advance
Robert: if network failure could even try again with the local recognizer with this method.
DanB: The client must store the audio in this case if the web app asked for it but as an optimisation it may not retransmit the audio if the server had indicated that it has a copy of the audio.
MikeB: we should also think about the case where stored audio e.g. voice mail is played out and re-recognized
Bjorn: we could wait for a microphone capture and audio processing api before supporting this
DanB: if the client should already store audio locally when requested, why is it difficult to support this scenario?
Bjorn: because there is no standard api to handle and process such stored audio today and we should wait until the audio WG's api proposal gets adopted
DanB: I'd like to think about this use case and design the api with it in mind
<Raj> Are there any privacy issues with providing access to VM etc..and submitting it to a WebApp..that need to be considered....
Bjorn: is re-recognition important enough to think in detail today and not wait until an audio api is available from the audio WG?
Robert: yes, we do a lot of re-recognition
DanB: it is not the highest priority items but it is one fo the most requested items in voice xml
Bjorn: I'd like us to do less than voice xml, not more
DanB: Lets leave it as discussed above with the rerecognize() method in the request object.
Bjorn: could potentially pass in a stored data from local files to the rerecognize() method.
Robert: should we also think about URIs and not just file/blob data?
Bjorn: raises concerns if the URI
points to data on the intranet and the web app is outside the
intranet
... should the client download or the service download?
... won't work if the download of data requires a cookie and
such data can't be sent to the remote recognizer.
Robert: Any URI specified must be accessible by the service without cookies and other related stuff
DanB: don't see this working. In MRCP the cookies were forwarded to the service, e.g. rule refs in the grammar would break without that since the service can't process the full grammar.
Robert: probably ok if we say the service can access the data it should be sufficient
Bjorn: perhaps a unique URL with a key should be given to the service which can let it get the resource
DanD: sees two types of cases - one where the grammar is only accessible to the service, other requires user authentication
Robert: a third case is when the grammar has embedded refs to user data such as contacts
<Raj> Yes..here again...on mobile devices..there is user action required for providing access to such client data....
<Raj> besides technical certifications for access to privlieged info...
DanB: unlike the web api, this is perhaps a candidate for a smaller group to go and figure out how the authentication would work
MikeB: also need to decide on how the dynamicism works
Robert: example is a huge SRGS which has rule refs to the user's contact list, department names etc.
Bjorn: how is this not authentication?
Robert: could be that you trust the service and given permission already
DanD: wouldn't you have to compile for the user anyway?
MikeB: you can have the large grammar in a precompiled state and have the smaller grammars compiled at runtime and referenced.
Bjorn: would it be sufficient to pass a substitution map?
<Raj> But, the user agent cannot automatically provide access to contacts etc..on the device to a web-app without some privileged access to such Device API often involving explicity user-consent
<Raj> to a webapp
DanB: this adds a requirement to the speech services
Robert: not that complicated,
every engine has a resource loader and that needs to support
this substitution map
... have to come up with some sort of virtual rule ref mapping
to a token
DanB: does the web app have to parse the grammar?
Bjorn: no, the web app only needs to know the place holders and values for them
DanB: may have pieces of grammar authored by different people and hosted in different locations and the person using the grammar may not know enough about all levels to replace
MikeB: either standardise the uris on how to replace, or the web app need not provide all the values, there could be defaults.
Robert: simplest way to do it is to take a rule ref and add a segment 'virtual=PLACEHOLDER'
Bjorn: better to add a new URI scheme that can be used in rule refs and grammars
DanB: we should also think about
feeding this into IETF as a draft and register
... and if this is so generic why doesn't it exist already?
Bjorn: because there are not a lot of other contexts where there are replacements
MikeB: will send a proposal
MikeJ: we are talking about getting people's contacts.. should we talk about the security and privacy concerns due to the composition of these?
Bjorn: thats not speech specific, if they can access your contacts they can send to random services already.
DanB: (presenting on board)
To complete:
- remaining important topics
- protocol between UA and Service
- Web API
- Markup API
DanB: Laying out calendar
... Have until 25 Aug (end of XG)
<Raj> Co-ordination with any Device API activities as well
<Raj> as access to a lot of microphone and other device attibutes depend on such access
DanB: Forthcoming weeks impact on
telecons 26 May no meet, 23 June, VBWG, Jul 21 IETF 81, Aug 11,
Stek NY
... about 10 telecons to go
... Not enough time for Protocol / Web / Markup
... Need to complete discussion and design agreements
Bjorn: Remaining topics?
DanB: As listed in Dan's topic list
DaveB: Hoping with draft proposal
of Web API and markup
... Is it still possible to get these out
MichaelB: Need to address the protocol also to extent it impacts the API
Satish: 2 Level API?
MichaelB: No,
DanD: Needs to be interoperable
Satish: Do default case?
RobertB: Already had this conversation
Bjorn: Finish web API, start protocol
MichaelB: Close to proposal
RobertB: Lets see if there is a way to finish these
DanB: Reach point where group agrees on level of specification
Bjorn: Want to get to point that experimental implementations can be
<scribe> completed
DanB: Concern about press release of chrome as implementing this standard
DaveB: Marketing snafu
MichaelB: More work needed on protocol as it will influence the web api and markup
Bjorn: Want to experiment to see if the protocol works
Robert: Speech service providers need to start on implementing backend
DanD: Web api and binding will impact the protocol, natural order
DaveB: Challenge is web API
... Protocol as subset of MRCP
<Raj> Yes...subset of MRCP is the fastest way to REC
<Raj> otherwise, it will become chicken & egg with both parties ( UA & SS ) waiting on each other to finalize
MichaelB: Need to deal with the protocol in order to deal with web API
RobertB: Suggest breakout to groups
DanB: 4 Aug to decide what we do after this XG
MichaelB: Options, End, 1 year extension, recharter as working group
DanB: Recommendations from w3c
that would be better to be in its own working group
... html is too big
... better to have groups on separate topics
DaveB: Agree,
MichaelB: Get proposal
Olli: html not best match,
webapps, more comments
... Seems like we need a separate working group
<Raj> perhaps aligning with MMI WG can also be considered
DanB: Need to work this out by Aug 11
Milan: Could have additional calls
DanB: need to keep the group public, take minutes etc
Milan: Minutes?
MichaelB: Can be summary
Robert: Need to assign leaders to each group
MichaelB: Report document is in place
DanB: Fill in sections,
MichaelB: Editor for each section
DaveB: interdependencies between javascript api and html binding, need to be one group
Satish: Need to be single group
Olli: concerns about separating markup binding and web api
Bjorn: Want html markup binding
MichaelB: Report could list issues
Robert: markup binding is remaining topic
Milan: Wants to be in protocol team
MarkS: TTS not a focus, if split out this way wont come to it
Bjorn: There is agreement on TTS between the draft proposals, TTS is separate
MichaelB: Subsections on asr and tts, not separate
<Raj> I can volunteer for UA to device access & protocol sections work..
DanB: show of hands for participation 6 protocol, 7 js api 4 markup
DanD: other topics not listed,
e.g. security and privacy considerations
... End to end architecture
<Raj> Dan..Hope you included my show of hand remotely for protocol
Milan: Have important item discussion of markup binding before making it a group?
Robert: Plan to come back by 9th of June to indicate how to handle
DanB: straw poll on choosing
one
... Planning, dates Jun 2: markup binding, etc
... June 9 plan of both, June 16 topics
30 June Web API
Dan B: July 7, protocol
DanB: Robert as protocol lead
Bjorn: would like to lead web api, but heading on paternal leave
DanB: Michael B agrees to lead
web API, bjorn supports
... Delay decision on markup binding group
Robert: Wants people to put time into doing the work
<Raj> Protocol work needs Speech Engine Companies to work closely...( so loquendo and Nuance )..to that we can start doing the work as robert suggested
DanB: Requested slot at tech
plenary for fall
... for web API: MichaelJ, MichaelB, Bjorn ...
DanB; for protocol: Robert, Milan, Michael, ...
Bjorn: Issue of request for comments from HTML5
DanB: Left off because of time limits
MichaelB: Specific request
DanB: Nothing for us to say because of timing
MichaelB: If there are things we know are and issue could say now
DanB: Send official response if there are things we run into in the subgroups that relate to html5
<paolo_> Timeline:
<paolo_> 26 May - No meeting
<paolo_> 2 Jun - markup binding
<paolo_> 9 Jun - plan for both
<paolo_> 16 Jun - topics
<paolo_> 23 Jun - No VBWG f2f
<paolo_> 30 Jun - WebAPI
<paolo_> 7 Jul - protocol
<paolo_> 14 Jul -
<paolo_> 21 Jul - (IETF 81)
<paolo_> 28 Jul -
<paolo_> 4 Aug -
<paolo_> 11 Aug - (STK NY)
<paolo_> 18 Aug -
<paolo_> 25 Aug -
<paolo_> End of XG
MichaelJ: Is goal to determine what can accomplish by end of time, or what meets requirements?
DanB: Date, features, time
... Plan to have meetings and discuss topics
Milan: additional to topics for discussion, preloading resources
<Raj> Isn't preloading resources, part of the messaging between WebApp and SpeechService
DanD: Any expectation of interoperable example
MichaelB: More at level of implementation reports
Robert: Will go and develop proposals
MichaelB: Wont have interoperability by end of XG
<Raj> Agree with Robert on developing interop and implementation proposals
MichaelB: Useful things, do
default actions
... Several benefits to connecting to markup, most things could
be done with javascript directly, but useful to support common
interaction patterns with html elements
Olli: How to bind to all the different input element types
Satish: Lots of comments on this
Robert: Text easier, less on how
to use checkboxes, drop downs etc
... input patterns are also an issue, patterns and types are
not necessarily compatible
... Can we reliably generate grammars
Bjorn: Security is the only reason, patterns etc, just so can click and start speech input
DaveB: Action based
permission
... Kickback about info bubbles, more and more apis, more and
more info bars
... Does not scale
Robert: Up to UA to decide how to enforce security, may involve popping up
Bjorn: translate on google.com, want to just click and then gather speech
Robert: click on it, onclick, browser policy checks whether this is permitted
Bjorn: Click jacking is a problem
DaveB: UI burned in, so will have some indication
Bjorn: Need action and UI in order to avoid system starting to listen
when you are away from the computer
Satish: Want to avoid confirming the confirmation
Robert: web page designer wants to change the UI
MichaelB: But if allowing the javascript could get past this anyway
Bjorn: have additional security
on javascript api
... even if have info bar for js api, avoid for markup case
Olli: how does the button
generated by the html markup matter
... can call click on an input element, special case for input
type file
DaveB: You cant
... not in webkit
Olli: Mozilla has for backwards compatibility
Robert: still havent uploaded a file, not a security problem
DanD: Another argument for markup
is to be less dependent on the browser chrome,
... add this to discussion of markup?
Bjorn: click on file upload,
Bjorn; didnt work in chrome
Robert: why not design a better policy experience around this
Bjorn: has not been
possible
... different browsers could handle this differently
DaveB: Automatic filling text box is interesting
Robert: scripting api only work
in installable extensions
... does not sound good
Bjorn: need chrome team to be happy with the security model or cant add it
DaveB: straw poll, ie mozilla, how to deal with this, ok with pop ups, infobars?
All: bizarre discussion about cats and mice
<silence>
Robert: general warning when go to page
Bjorn: concern is dont want this
for everyone who goes to translate.google.com
... want to go to page, and be able to use speech without pop
up dialog
... chrome designers unhappy with info bars
DanB: same folks have security concerns
Olli: what about geolocation input element
Bjorn: geolocation has a pop up
Robert: want to see the
alternative better proposal
... if go to site once and have to permit speech
Bjorn: want a way to never trigger infobar
Olli: ... why do you need the button?
Bjorn: to make sure is user
Olli: trusted events
... Button can be hidden, set size or opacity
Bjorn: click, now recording speech thing comes up
Olli: agree notification pop up is bad
Bjorn: at least speech requires action compared to geo
Robert: more of a policy working group issue
DaveB: should have design that facilitates multiple approaches
MichaelB: more interested in
grammar derivations from element etc, less convinced by
security
... concerned about interoperability, js will only work in
sandbox
DaveB: more about annoyance in UI, scalability
Bjorn: want a way to not trigger
the info bar
... fine if some browsers do security differently
DaveB: impressive to very easily speech enable a web page
MichaelB: How hard is it to define the markup binding?
Bjorn: tried different approaches, e.g. speech attribute, input type=speech etc
Satish: this is a new kind of interface, divided responses, limits adding to text areas
Robert: need input type is
speech
... how about textareas
Bjorn; speech element that is just the button, speech input
MichaelB: cant send it a javascript click event
Satish: if js api, could bind to other types
MichaelB: use for attribute
Robert: label, like for
Bjorn; could also have a drawing area on touchscreen
DanB: Where are we on markup binding?
Bjorn: speech input element
Olli: concerned about click jacking
Bjorn: have different policy
Olli: don't want things different
Robert: if not avoiding putting rest in sandbox, not as concerned
DaveB: Strong use case for starting speech when first go to page
Bjorn: open app
Robert: dont want to have to push a button for each speech input
Debbie: app with several pages, enable speech, dont want to have buttons on different pages
Bjorn: assuming will apply
DaveB: word spotting
MichaelB: often want to control
the button
... new element, that provides the onclick, maybe for linkage
maybe not
... olli skeptical off that also
Bjorn: gmail, have it ask you, do you want it ask you, say yes, then have it read, no button click
Robert: video games in browsers, speech as side channel, a use case
DanD: accessibility issues
DanB: may not need full call, but need to check on sandboxing issues
Bjorn: dont have agreement on having element
DaveB; want low key
DaveB: want to have low key way to add speech
Robert: roll into the web api
discussion
... why need element for tts?
DaveB: elegance of specialization of html media element
Robert: some commonalities, also
parts are irrelevant
... semantics of different input formats, how to apply media
element to ssml
Bjorn; if similar to audio, should behave like audio
Robert: Copied bits from media element that apply
DaveB: No superclass to base it on, javascript API
MarkS: fine with just having js api
Bjorn: controls
Robert; but why needed (shuttle controls)
MichaelB: play now, or render,
MarkS: pause is useful but could do button for that
Robert: in existing ms translate example
All: agreement on not having a tts element