(No specific discussion during this session)
Question from JL: will the concerns about privacy cause public backlash that could negatively impact the industry? Valene: public perception is good. How the biometric is used and handled is what is most relevant. JC: problem is that police or government often have a different intention for the technology than the scientists do, and that leads to battles. JM: this kind of battle is already beginning. Parents are concerned about how the technology is being used to authenticate their children in schools. General question: what can the speech industry do? JC: need to include intelligence/law enforcement community in discussion. Biggest issue is stealing of biometric id. There was brief discussion of a court case in Michigan where forensic investigators claimed that they could tell from small amounts of speech precisely who the individuals were, and it was thrown out of consideration, but this could be non-binding on future cases. Should SIV capabilities in VoiceXML be designed to also support non-IVR applications? In discussion of this question, the group asked the question of whether SIV belongs in VoiceXML 3 at all. Is it an IVR technology? RS: "Whatever we do in VoiceXML 3 must be compatible with what's done elsewhere". We consider this to be in the security camp and not IVR. We will likely use it over the web. DB: The key is determining which pieces are "ripe for standardization", and whether there are enough pieces that it is worth the effort. KA: there is a Security activity with W3C that we should be working with. Judith says we've already contacted them.
We got into a discussion when we reached Ken's security slides. Some key points: we need to enable any developer to use biometrics, without setting policy for how they do it. And yes, they may do it badly. RS: we have completely separated security components from everything else. Key points: - browser does the collection - the comparison with the security token. There is modest agreement that the process that does this is separate from the browser. - decision incorporating result of one or more comparisons. There are many places where this decision can occur.
During the break we then had a discussion where Bruce suggested a different way of thinking about this. He was volunteered to write up this model for us to shoot at later today :)
(Minites TBD)
* Listed uses of Identification, customizations (seat high radio presets), right to drive, etc * Ways to Identify - Driver only settings - Keyfob (identifies owner of fob, but not the driver). Isn’t useful for multiple drivers of the same vehicle - Identifying a the user by their cellphone in their pocket - Biometrics (could identify the drivers and the passengers). Ideally it would be nice if the care identifies you by your weight when you sit in the car. You could ask users to say a phrase, or swipe their finger over a card reader * There is a requirement for multiple modalities * Might not be required for customization. Definitely required for ability to drive/lockout decisions - Connected vs Onboard biometric engines both need to be considered. Network only isn’t good, but they think there are advantages to network, for example if a single user has multiple cars. * There are synchronization issues to consider since the person could use speech or touch to change the radio station. * The remainder of the discussion focused primarily on ways to implement these multi-modal systems and did not touch on biometrics. * The basic ask of GM was that if any standard is adopted it considers multi-modal
Context: IVR centred. Core issue: letting data interact. Today's call center: The lowest cost processing - touch tone and ASR. Second chance - ID&V - next low cost processing. For those that cannot be handled here are totally handled by operator. Layers of processing: bottom layer- voice biometric response and ASR response. This is the technology layer, which is uncertain. Resulting is a high raw error rate, e.g. false matches. real-world layer - being able to combine information. We know the problems that can be had and know, from a security perspective, the vulnerabilities. Also have the context and suspicion rules. E.g, when users speak their own date of birth, tend not to have disfluencies, whereas someone reading a date of bith is more likely to. These are just probabilities and are therefore uncertain. business layer - this is the level that the designer really wants to be at. Not concerned about being right or wrong, rather about behaving correctly. Interest is not EER, but how penalised will I be if I behave incorrectly. The interception model. Assume I have no concern for self service. But I have to track the person on the way to the agent. This means that you do not have to deal with security in the IVR. Can just take the failures (a small, but expensive population, probably lots of fraud users here), you process these prior to them presenting to the agent. The other (majority of) users are not handled through this process. The machine is asking questions and passing answers to the agent. The agent, hearing those customers speak, can make a judgement. The next time the caller calls, they get the 'sock' and 'a sock' and 'a sock' treatment. Next stage is to move up the line to the next most expensive group, and so on, until you get to the full screen pop callers, say over a period of 9 to 12 months. Cost saving is the reduction in call handling time (not in security). Security (and privacy) increase is a side effect. This model trades off calendar time for boot strapping capacities for the (large) user population. The interception model - call goes into the IVR, then the sentry selects some of those calls, then boot strap the enrolment as the call goest to the agent. The front door model - the IVR is (logically) integrated with the IVR. You can then give custom menus to the caller. This is about customising the experience, and thereby streamlining the process. In this model, the sentry is a voice biometrics box. A voice biometrics is really a 'nearness' engine. The eaves-dropping model - the Sentry sits with the agent (rather than the IVR). This model has many issues, but also has merrit. This process elicits confidence. In addition to the automated authentication, the agent gets to hear the user and their issues. In this model, the sentry may, but probably will not, issue the challenges. That is more likely to be done by the agent. This model readily allows step-up. Note that details about interaction are glossed over in this model explanation. The idea is to stop the human agent from hearing the security credentials. All of the models have a lot of background work in terms of collecting data for enrolment (and authentication). Agreements or warnings would be said: many of the explanations are not legally binding, but are knee-jurk reactions on the part of the agent. However for the biometric model, there are not a lot of court case tests yet on what you do need to inform the caller of. In the ROI Grows Over Time, to start with, there are no biometrics involved, but the minute they are, then the caller needs to be informed that recordings are being made. Once you have enough data, then you have to work out whether the data could be used for enrolment. Controls will be influenced by the privacy laws of the countries. Bag of issues: do we need guidelines? Service oriented dialogue box and security orientended dialogue box models - by uncoupling, you get a better approach, leading to development of a better migration strategy. Often fraud is not a cost effective argument for self service authentication level increases. We need to think in the way that there may be already existing things that can be leveraged off to help us out. The reverse taging of data becomes important. There may be events in the future that redefine events in the past. We must be able ot invoke SIV functions from a VoiceXML dialogue. But there are functions that may be required of SIV outside of VoiceXML. VoiceXML is about the voice user interface. Streaming - support for analysis part way through the delivery of a speech stream. Issues with speaking too soon, etc., knowing where you are relative to the mark. It is specified in VXML 2. You can use this for changing thresholds, for example. Issues for double barge-in, "I wan", barge-in stops speech, then user "I want to ..." - the recogniser has serious issues with this (but people don't). The suggestion is that there is streaming ASR and streaming SIV in parallel, potential. (Different to the end-pointing environment.) If you are using the same engine for doing ASR and SV, then they can leverage off each other. In the 2x model right now, there is a wait for the results to come back from both systems. Should there be a wait for VXML V3? E.g., if we know that the caller is the wrong gender from one SV engine, why bother to wait for the second engine to come back? Options: 1. First one wins 2. Bayesian 3. Dempster-Shafer (theory of evidence). This decision needs to be pulled out of the engines, due to the complexity. There are different kinds of decisions that need to be made. E.g. what person, what word, etc. In ASR, there is some decision making going on at the device level, e.g. match or nothing heard or poor audio quality, which leads to actions taking place. This would translate to SIV. For processing speech recognition, we have moved to keeping the decision fuzzy much longer (than was once the case). We need to allow that sort of movement in SIV. Larry's perspective is to leave the decision making to the engine, not to later.
How to these different entities send info to one another What should be standardized - audio formats, recog results, and the integra= tion with engines Goals - basic list of formats - open source - ease of adoption - stability and quality - need to have a small number of formats (Jim - no more than 3) Sampling process - periodic and multi-rate (really means variable bit rate coding) Coding scenarios - lossless, compressed, multi-rate 0 not Wav 0 LPCM 0 G.711 alaw, ulaw 0 not ADPCM - too many variations 0 not MP3 - not open 0 OGG Vorbis 0 OGG Media Stream Jim - now we have more than 3, HB - this list is really only 3 OGG Vorbis is very easy to implement John - Google is using the Speex speech codes for their speech search OGG Vorbis can stream many types of media/formats What is not covered - Interchange requirements Missing features Compelling reasons to=20 Q/A ---- JP: Customers say "this is the audio that I am going to give you" [in our own particular format] HB: You can convert to other formats to the standard - the standard for interchange and storage Dan: There are many new types of codes coming online all the time HB: This is the audio standard for SIV engines Ken: who is responsible for conversions - the server? HB: If this is a standard, then the audio that is sent to the SV engine wi= ll conform to the standard John: There are good motivations for new codes. All compression codes are non-linear. Trans coding of non-linear codecs is crazy HB: The uses need to convert to the standard HB: Engine does not really have different from ends for all the different types of audio formats HB: Do the conversion at the proper place - at/on the client is the best place Dan: engines use 3-rd party libs to convert to their internal standards Judith: Government agencies that we are dealing with are prompting/requiri= ng standards for audio codes - if there is not an audio standard, then the = DEFF could not become a standard for data exchange Jim: Is this really necessary Ken: I have pushed this rope [standards for audio formats] for many years Ken: There is now wide band audio for many types of phones - skype. Requirement for new codes are all over the map Val: There are security issues with file formats Jim: what are our options here? (1) Try to please everyone =E2=80=93 snort [not] (2) Have a small # of things [in the standard] that are useful to a large number of users Jim to Judith: should we wait for the standards groups to render a decision Judith - the groups have already accepted the [our] format standard Jim: should we encourage the stds groups to decide? Dan: "We should go to the VBWG and tell them about this (proposed) standard and ask whether the VBWG wants to use this standard as either the complete set or minimum set required for one or more of its specifications." Ken: I want convergence - I want it to be correct =E2=80=93 our original w= ork was not specific enough to be useful. I buy into what Dan is saying. = I personal do not think that we should have the other groups set the standa= rds for us. I am discourages by the prospect in setting a narrow set of st= andards. My exp is that there is not going to be a standard Dan: the is a reason that the IETF can up a media registration - allow the market to decide what formats are going to survive Dan: if you [the customer] are worried about trans-coding - then you shoul= d use best internal format HB: How may SV errors do you get from trans-coding due to incompatibility HB: If we have a standard, then you the customer, if you want the best performance, then send it to the SV server in one of the supported (standard) format. It will give you everything that you need to support audio transfer. Jim: Let's go in strong and set a minimal set for the audio formats Ken: Headers may not be compatible. JP: Too may versions of RIFF headers Ken: There are too many and too broad of an interpretation/implementation = of RIFF headers
Dan gave a historical view of MRCP - created by the IETF - Protocol only carries control messages (no media) - MRCP v1 developed in 2001 by Cisco, Nuance Speechworks, but IETF disagreed with some of the RTSP tunneling aspects of it - It was broadly adopted though so it was released as an Informational RFC #4463 - It included only ASR/TTS as the engines - MRCP v2 was done as a standards track document in SPEECHSC Working group o Has it’s own protocol, not tunneling - MRCP v2 has the concept of allowing the ability to use data recorded in earlier sessions - attempts to define protocols for controlling o ASR o TTS o Speaker Verification o Speaker Identification Verification Resource in MRCP - Has a concept of session - Audio Buffering which can be enabled/disabled - Simultaneous ASR/SIV can used or SIV alone - Supports both Enrollment and Verification mode - The Result structure is the same for ASR (NLSML result. EMMA is also available) - Small Scale Speaker Identification is available via a “GROUP” voiceprint Homayoon asked if the SVAPI was considered - SVAPI included identification and classification - SVAPI is much more complicated that MRCP - Dan indicated that a standard is going to only work if all the participants in the creation are comfortable that they can work within the framework Security Model - Audio is expected to be secured via channel-specific mechanism, eg sRTP - Voiceprints are handles as references that are meaningful only to the underlying engine - Cookies can be used to pass credential information from client to server, for server use downstream (ie: to get the voiceprint out of some type of secure database) Dan showed an example of the messages between client/server in an example call. - It included a sample result - Homayoon asked if engines could include results not specifically defined in the standard. Dan said that if the namespace is done correctly this is OK - JP asked how the Verifier knows when to stop processing audio. Dan said the expectation is that endpointing occurs in the engine. Dan described that the MRCP standards group is thinking of breaking out the standard so that there is a server which handles the control messages but that there will also be underlying protocols specific to each engine.
[Notes by JP Shipherd] Architecture Discusion based on Bruce's updated picture The team looked at a proposed architecture that puts the SIV engine "under" the browser and accessible via VXML (much like ASR and TTS resource are. JP asked us to weight the pros and cons to the SIV via VXML approach: Pros: * Allows for streaming authentication * Allows for ASR/SIV in parallel - lower caller perceived latency - potentially improved ASR/SIV accuracy * Simpler credential collection (all done in VXML) Cons: * "Security Concerns" -- need to ask the group to enumerate these, but a few might be: - audio in transit could be purloined and used for breakin attempts - comparison has now moved to the IVR/Call Center environment, where other comparisons are happening in the application environment * Standard will likely: - include functionality that is not available in some vendors - not include functionality that is available with some vendors
[Notes by JP Shipherd] Ingmar - DT Labs * DT Labs is both the technical arm of DT, but also affiliated with a university * DT's VoiceIdent product comforms to the "Common Criteria of the EAL2" (which I think is analogous to an ISO standard, but for Europe) * SIV in VoiceXML needs to support - SIV Only, SIV in Parallel with ASR, SIV in the same engine as the ASR - Event management issues are in play when two engines are running in parallel - Text Dependent, Text Independent, Text Prompted - Enrollment, Verification and Identification - Rollback of last turn - Query SIV results, Catch SIV events (noinput, nomatch) - Voiceprint management (query, delete, copy) should be outside the scope of VoiceXML * Judith asked if the browser interface needs to know details such as if Text Dependent or Text Independent is necessary. * DT wants the ability to be vendor independent and want to ensure that there is a decoupling between the VXML browser and the underlying resources. * DT wants to be able to move voiceprints from storage via HTTP/HTTPS. They see that the SIV engine would make calls to this interface, but so would administrative functions. Homayoon says that this opens a security hole. Igmar indicated that his big issue is that he doesn’ t want to rewrite administrative tools for each SIV vendors data storage mechanisms. He wants the application to only have to pass references to a voiceprint location and have the engine be able to fetch a voiceprint whether it be remote or local. * He described a few use cases: - SIV without ASR. IN this model SIV must do endpointing to do barge-in, End of Speech, etc. He also wants the vendor to support the concept of session, so that multiple turns can be taken and cumulative scores can be provided to the app - SIV with ASR. In this model two engines are running in parallel and they could provide a different set of endpointing events and even success failure conditions. Who's job is it to "normalize" these events. Dan said that this would only work if the MRCP server takes care of this. - ASR with SIV from buffered audio.
[Notes by Ken Rehor] Ingmar Kliche T-Systems Laboratories What SIV functions should be supported in V3? - SIV only - SIV in parallel with ASR (separate resources) - SIV integrated with ASR in one combined resource SIV types - Text independent - Text dependent - Text prompted Decision Control - Either the SIV engine or the application may control decisions (regarding acceptance/rejection) SIV core functionality - Enrollment - Verification - Identification ==> requires: Save voiceprints (after enrollment) and Load voiceprints (before verification/identification) V3 should load/store voiceprints implicitly without explicit markup Further basic/core functions - adaptation - buffering utterances for later use - rollback/undo of last turn - quary SIV results - catch SIV events - Query, copy, delete (and other administrative operations) of voiceprints are out of scope of V3 Architecture What should the interface be between the SIV engine and voice model database? What administrative functions should we provide? - Support MRCPv2 for integration of SIV engines - Extend MRCP vs. limited SIV functionalities - Use EMMA for representation of SIV results - Use web protocols for voice print transport Use Cases #1 Standalone SIV without ASR - SIV needs to support endpointing (like ASR) - SIV needs ot support timeouts (like ASR) - SIV should provide bargein SIV may need multiple turns (within one SIV session) #2 SIV + ASR e.g. single utterance of account number used for ASR and SIV issues: - SIV and ASR independently might return events such as noinput, thus the application must deal with them - rollback/undo must be supported #3 ASR + SIV from buffer requires new ASR function -- able to accept an utterance from a buffer #4 ASR + SIV from file Sessions: need to define a session that comprises of one or more dialog turns.
[Notes by JP Shipherd] Chuck Johnson - Configuration Management Issues * Results that come back differ from engine vendor to engine vendor. - Normalized Scores (different engines have different scales) - Confidence Scores (this is the confidence of the how well the normalized score fits within the configured operating point) - Raw scores are also sometimes returned - It would be hard to all vendors to have the same meaning for a given score, but perhaps we could standardize on a scale (ie: 1-100) - It would also be nice to recommend a set of “error conditions” - Should engines return a binary decision * Database Management - Database setup is clearly outside of the scope of an application - Applications may be precluded from being able to delete voiceprints. If this is the case apps need to have some way of "cleaning up" partially enrolled voiceprints * Enrollment - Apps need to have control of enrollment and manage the flow of the dialog that creates voiceprints - Engine vendors need to have a defined set of return/error codes for enrollment too * Background Models - Out-of-the-box applications come with background models that might not be appropriate for the calling population o Children, high frequency women’s voices are particularly difficult - Should apps be able to specify background models on the fly o Engines would need to be able to load them on the fly o Obviously vendors would need to expose the ability to create these * Different Classes of Users - Apps should be able to adapt threshold based on knowledge about the user or the task they are doing. (Isn’t this almost universally available today since it is out of the purveouw of the engine) - Apps should be able to dynamically set supervised/unsupervised adaptation. JP asked if there is a real use case for unsupervised adaptation. Chuck said there definitely is in circumstances where the engine is making a yes/no decision. - Judith said adaptation can also be used to “extend and enrollment” over the first couple of calls into the system. * Voice Model Adaptation - Clients should be able to rollback adaptation. - JP raised the question of “how much rollback do you want”, since single utterance, multiple utterance and multiple session rollback all have different costs from a provisioning and data storage standpoint.
[Notes by Ken Rehor] ======================================================================== Chuck Johnson iBiometrics, Inc. Configuration and Management of SIV Engines VoiceXML Forum Speaker Biometrics Committee surveyed major SIV vendors regarding features and functions All results were different - no consistency in interfaces or values of data such as confidence scores Some possible standards: - normalized scores - consistent minimum set of error codes - etc. Need to work with engine vendors to explore which areas might be standardized Voice Model Database Management --> Applications shouldn't be permitted to create a voice model database Enrollment - Voice Model Creation --> in scope of the application, assuming the database is setup properly and securely Need a minimum set of return results, data, and error codes Distinct User Populations World models might be tailored for specific application scenarios --> gender, age, regional differences --> Application may need to dynamically load and update a specific world/background model Different Classes of Users e.g. - Corrections applications, finance applications - classes of users based on risk profile Voice Model Adaptation --> Adaptation is necessary enable/disable, set thresholds, query adaptation outcome, rollback adaptation (from last turn) Rollback requires more storage -- is that a problem? One rollback is simple, across sessions is more complex: how far back should this be supported? Single-utterance rollback? Entire-session rollback?
[Quick reminder that there is security terminology in the glossary]
Privacy, Security and Risk Management considerations - Incorporate Security and Privacy in Development Model Confidentiality, Integrity, Availability see NIST Security Software Development Lifecycle standard What 5 most important things needed to make an SIV application secure? SIV applications will be subject to many security breaches V3 environment is more complex (multimodal, open source, multiple networks, etc) and thus subject to more vulnerabilities 5 steps to take to make an SIV application secure 1. Envision - Identify Threats/Risks 2. Plan - Profile, threat/risk modeling, generate requirements 3. Develop - control check 4. Release - Handle threat/risk 5. Stabilize - Learn and Educate ---> Recommendations to the industry to foster secure SIV software design, development & deployment ANSI M1 produced a document on threats See free report on ANSI/INCITS website Study Report on Biometrics in eAuthentication http://www.incits.org/tc_home/m1htm/m1070185rev.pdf How to keep my database of voice models secure? - Utilize standard methods and best practices which are consistent with the organization's security framework How to keep my voice models and other data secure when I transmit them to others? - ditto Should voice models be encrypted? What's the value of getting access to a voice model? Biometric information is categorized as personally identifiable information, thus may require encryption. Depends on overall system requirements and deployment scenario. Choices for specification of security in V3: - specify nothing - - require specific encryption and security standards Facilitate strong security but don't require it How to ensure V3 doesn't preclude security standards See OMB M-04-04, E-Authentication Guidance for Federal Agencies, http://www.whitehouse.gov/omb/memoranda/fy04/m04-04.pdf How can we structure V3 SIV support so that it can be governed by security and privacy policies of an organization? - ISO 19092 What are security and privacy regulations of which we must be cognizant? --> MANY and expect more!! Who should specify and/or recommend policies and regulations? Complementary activities: --> V3 language development in W3C VBWG, aligned with recommendations from 3rd parties --> VoiceXML Forum develops collection of recommendations, policies, standards to be used in V3 Need to make sure other related standards are considered, as well as entire application environment -- Security isn't only specific to SIV Additional references - Other security standards - in progress - ISO/IEC 24745, "Biometric Template Protection", WD3 - ISO/IEC 24761, "Authentication Context for Biometrics (ACBio)", FCD - ISO/IEC 19792, "Biometric Security Evaluation", CD3 - ITU-T Study Group 17 Question 8, "Telebiometrics System Mechanism (TSM)" - ITU-T X.1081, Telebiometric Multimodal Model Framework (TMMF), Q.8/17
Cathy Tilton CBEFF (Common Biometric exchange formats framework) - a set of metadata elements for exchanging biometric data particularly over time History of CBEFF: started in Feb. 1999 in a bio-consortium workshop -> NISTIR 6529 -> NISTIR 6529-A -> 3980-2005 ... metadata header (SBH) biometric data (BDB) security information BLK (SB) -- optional defines abstract data elements used to describe the data registration of biometric data via IBIA allows for new adaptations Mandatory header is a format owner and type (4 bytes) - It allows you to recognized the format- could be vendor specific or standard, processed The header - standard of prop published or unpublished raw, intermediate or processed latest version has a format ID for the security block Format Owner (1st two bytes of the mandatory header) INCITS M1 0x001b ISO/IEC 0x0101 NIST 0x000f IBIA website has place where vendors can register to become format owners. format type is the 2nd 2 byte field e.g., 0x201 minutiae (basic) INCITS... 0x0701-6 Signatures (various) ...... The actual data goes in the BDB (Data Block) - See CBEFF minutes for an example examples of header elements are mandatory and optional elements registration authority procedurespatron formats security block formats patron (a secondary specification that implements CBEFF in a specific way) formats: OASIS XCBF is an older version CBEFF is being used in ICAO - E-passports, (logical data structure, LDS) PIV Federal employee credentials transportation worker identification credential (TWIC) registered traveler (RT) cards other standards, ... Judith Markowitz INCITS 456 Speaker recognition format for raw data interchange (SIVR-1) - Used for storage and exchange of speech or spoken data - only concentrates on the raw data block (BDB) of CBEFF - but it does not have to be a part of CBEFF, it could be used with a different wrapper -- e.g., FBI has a wrapper it uses around it - Draft standard which passed the first public review and has received some comments - 1st BDB for spoken data and 1st one in XML - It was developed jointly by ANSI/INCITS M1 and VoiceXML forum's speaker biometrics committee - In Australia they have been using XML in the security community for a decade. - does not capture features or models like some other BDBs may -- this one only uses raw data - Goal is to provide information that will enable recipient to analyze the data - audio format - input device and channel - Does include Speaker (sex, age), but not claim, also does not talk about whether it is a part of a verification or identification session - language/dialect - Data sharing can benefit - used for watch list creation - internal system audits - automatic enrollment of users - multi-biometric fusion - product/algorithm testing - SIV registry/service (BT was interested in using this is ISO for this purpose) - The ISO/SC37 is doing raw and feature data and they do binary and XML and they are considering the same data format of the INCITS456 Two levels - session header (one session per BDB) - things that do not change during the session such as sex, date of session, device and channel, audio format, etc. start and end time, channel, audio format, sex, etc. - instance header (at least one instance should exist, but more can be there) informations that changes from turn to turn of a dialogue prompt used, utterance, quality rating - audiofromatheader, audio, quality (score, algorithmvendorid, etc.) - EMMA may be able to use the DEFF format, it could also use it as a part of a derivation (further processing of a single input or more).
[Review of terminology — multi-factor and multi-modal including multi-SIV engine]
Session 8: Data format for Multi-modal and Multi-Factor Applications Jim Larson: Extensible Multi-Modal Annotation Markup Language (EMMA) Slides: http://www.w3.org/2008/08/siv/Slides/Intervoice/EMMAjl.pdf Jim Larson: - EMMA has been developed by the MMI working group of the W3C - Reference: http://www.w3.org/TR/emma - EMMA defines various annotations, such as confidence, timestamps or medium - EMMA represents user input Valene Skerpac: Is "video" as medium also supported? Jim Larson: Yes. Jim Larson: - Verification might also be represented using EMMA - Example on Page 8: shows standardized EMMA annotations (within EMMA namespace) and possible annotation of claim and verification result. Chuck Johnson: Does EMMA define also the <claim> and <result> tags shown in the example (see example on page 8)? Dan Burnett: No, <claim> and <result> are within application specific namespace (i.e. would have to be defined by the application) Jim Larson: - EMMA may also be used to represent fused user input - fusion process is responsibility of application, EMMA is just for representation Cathy Tilton: - In December we discussed on how to combine EMMA with CBEFF - action item from the teleconference to develop a joint use case, has not been looked at again
------------------------------- BioAPI and BIAS - Cathy Tilton ------------------------------- BioAPI ------- (Presentation) cathy: explains high level summary of BioAPI * history, basic architecture, functions * operations: basic functions and privitive ones * handling security concerns * technology modules * implementations: Win, Liniux * BioAPI BIR (data format): includes Biometric Data Block (BDB) * related projects: US, ISO (Q&A) --- jim: asks about the data format cathy: explains BIR is a binary data format, but the next version ... would be able to contain XML as well --- homayoon: for Speech recognition? cathy: should be some implementation. maybe L&H? homayoon: streaming application? cathy: static vs. dynamic usage... ... not easy, though homayoon: any plan for update? cathy: ISO version might be update for several new operations ... BioAPI interoperable framework should be useful for server/client ... applications ... Japanese committee uses BioAPI for some dynamic search system ... also conformance tests ... US version is based on the BioAPI Ver. 2.0 --- dan: what does "framework" mean? cathy: communicate with BSP and applications dan: that's why two levels of API (1) between app and framework and ... (2) between framework and BSP --- ken: framework could be a Voice Browser cathy: right --- homayoon: what's registry? cathy: schema, what kind of BSP supported homayoon: device support info as well? cathy: in v2.0, there is device scheme as well ken: see "Biometric operations" slide --- homayoon: specific device for BSP? cathy: rather query what's available --- jordan: is anybody using that? cathy: good question ... quite a few products ... v2.0, which is much more powerful, doesn't have many reference ... implementations, though BIAS ----- (Presentation) cathy: explains BIAS using short version slides * many biometrics standards but what's missing? * BISA - driving requirments * INCISTS (spec) & OASIS (web service binding) collaboration for Web apps * BIAS system context (INCITS vs. OASIS) * OASIS defines SOAP profile and their work includes XML schema * BIAS operations * Representing biometric data: XML, CBEFF, and Metadata * BIRs are binary, though * twho methods: set of individual data items, existing formats * status: INCITS 442, OASIS BIAS SOAP Profile * plan for next meetings: INCITS on Apr. 14-15, OASIS on Mar. 17 * possible relationshps: architectual relationship to W3C's SIV * communicate with bigger apps using BIAS while talking with voice MC using SIV? * data/organizational relationship? (Q&A) --- jim: how to combine these with EMMA? cathy: compatibility problem jim: compatible with what then? cathy: BIAS can invoke operation ... can use EMMMA itself or converted data --- jim: would suggest to Debbie Dahl to work with you to consider ... that compatibility cathy: don't see any down side for doing that ... EMMA may be able to contain any data format within it ... and could be used for BIAS --- judith: BioAPI is an existing biometric API ... even though it's now using binary data format ... what extent can we import their idea into our SIV spec? ken: have some idea and would like to show his slides (slides TBD) ken: 1. VoiceXML 2.X SIV Integration via BIAS web service ... BIAS is used for communication between VoiceXML app and verification ... app ken: 2. VoiceXML SIV Integration via... VoiceXML browser directly communicate with verification app ... using and ken: 3. VoiceXML SIV Integration ... SIV engine is not under verification app ... but under VoiceXML browser ... and communicate with the VoiceXML browser --- cathy: mentions the official comment period for BIAS OASIS is ending ... tomorrow ... but we'll accept comments anytime ... so please give your feedback ... can forward the information on the "call for review" ------------------------------------- Additional session on SRI's research ------------------------------------- sri_guy: explains SRI star lab's work ... on speech/speaker recognition sri_guy: Anatomy of TalkPrinting * not only voice print but also prosody based on anatomy, etc. * pitch, energy, speech style context (e.g., schwas, pause) * not considering speaker's culture * NIST SRE08 test result [lunch]
Define what we are suggesting to change and then go into pros and cons. Slide - Typical architecture for call center. Use of record/record utterance tag. Usually feeds application server that talks to SIV engine usually thru proprietary interface. All we are really talking about is moving a.where SIV engine lives b.how you talk w it access directly from VoiceXML interpreter. extensions Will vary with platform implementation. Jim: how does concept of challenge dialogue fit? Challenge dialogue data should be held where voice models are. JP: that is not true. Credential database Ross: collect biometric data separately and store it separately. Jim: does application access credential database and do challenge dialogue Dan: it could be done that way but if you are using existing VOICEXML environemt. Usually have a grammar that includes the data want to get. In that senst the credential is not secret. Jim: do we need to have controls over it? Ken: it may not be secret. Dan: in either case recognizer recognizes and gives string to authentication process. Collecting what person said but not audio. JP: the way you deal with it does not matter whether use this architecture. Jim: want a uniform way to deal with models and challenge information. Ken: bank already has to deal with the security of that information. Ross: for us, security is a separate layout. Security is dictated to us by the security director. We have layers within our security. This being a security must be a separate layer from the application layer. The data collection is application layer but SIV is at one of security layers. Once hand data to security layer it is out of application. JP: collections on left side and interpretations on right side of picture. If move SIV interpretation to left side you may have security issues - even though you may get better performance and it may be easier. Ken: left side not have had much security concern. You still need to apply security to things like logs. You have to keep logs because of regulation. This is a security hole for both the audio and the text log. JP: two ways to plug hole a.put ASR where no log of utterance. So, no trail. Must be clean all the way. Audit is problem. b.Apply security Ross: we have requirements to log. So we must log. Write-only memory. And there are read-only methods to access it to show it is only going one way and is tamper resistant. JP: to next logical step is there anything different for SIV and ASR? Ross: it is comparable for SIV. Very careful about how secure entire system Ken: I have seen specific requirements that are the opposite to that. Tap it at incoming and record everything and cannot blank anything out so that have a complete audit trail. Financial and healthcare Ross: have recording requirements, as well. Don’t record taking of a credential but log/record the results. Ken: recording requirements are being develop to record all or record specific things. Have many conflicting requirements. Usually driven by regulations throughout the World that conflict Valene: not only requirements. Ross: do threats vary depending on the two models you presented? Chuck: yes, it does because of access of VoiceXML interface. Could have VoiceXML interpreter on left Dan: interpreter is not doing interaction w SIV engine. It is possible for the application to be assured that the VoiceXML interpreter has not interfered with the data. Except for the utterance issue.This applies to the control JP: the utterance issue exists in both scenarios. Trusting interpreter is collecting utterance properly and giving it unaltered. Dan: utterance issue is the same either way so look at control issue. JP: voice model storage changes. Alfred: question of how to delete and adapt model JP: don’t allow an interpreter to do. Must manage that thru another interface. That leads into an issue of who is using this Second framework is not specific to call center. Talk to agent and scoring using framework or logon using it. Other components in enterprise can access. Could still have SIV engine on left or right. Ken: if I have multiple modalities in this architecture. JP: another issue is that the call center could be outsourced. Ross: Most call centers in Australia are outsourced. We are an exception. Bruce: also see opposite. See service providing sedurity is hosted. JP: could do a hash on the identity so oursourcer doesn’t know who the person is. Dan: several different pieces of information that flow across the line Audio information Identity claim Control information Result Don't think there is too much concern about control except that MRCP allows you to do a delete. Haven't talkedabout identity claim and result flowing over horizontal line. Ross: we encrypt the data for result. We would not allow voicexml interpreter see the actual result. But, our stuff goes thru the IVR services but it is in an encrypted packet. The engine does the encryption. Dan: if do that then application will never see the result. JP: could be two separate binaries. Ken: interpreter box and SIV box are going to be close to each other. Could put that into a secure network Ross: we have security stuff physically locked. Our engine is physically locked up. All physically secure and logically but it is close together. Ken: this isn’t unusual Dan: can have SIV engine on left thatgives encoded results information. The safest thing is to dynamic generation of new pages so that it sees only current page. Even in application model the interpreter only gets pages where the decision is made or not made. Interpreter is a slave to the application. Varies with how build application and security requirements There are implications for syntax of pages. Might have syntax to run SIV process. When a result comes back that is sent to the application - it is encrypted. This gets to the heart if issue of having SIV syntax in VoiceXML. Uncomfortable having interpreter involved. Ken: should be able to do iteither way. Dan: what kind of thresholding information might be sent to the SIV engine? JP: SIV doesn’t know much about thresholds unless it issending back normalize sxores. Chuck: in Nuance engine can set thresholds of several times. Some are runtime settable. Dan: should we send an encrypted configuration block to engine? Someone would ask about that - to make it entirely opaque. You could run it this way or not. JP: there is a configuration file which must be hidden. Dan: if it is not a runtime settable file it could be verified by the SIV engine. The problem is the runtime settable stuff that must pass thru the interpreter. That is worrisome Chuck: do you see the ability of the interpreter to modify parameters of the engine is a security issue? Dan: Cahty suggested that the ability to change such things as a security issue. Example: If I have different places in application where need to change thresholds for authentication. I might set some thresholds to 50. Someone could set to 0. The interpreter would be compromised. Valene: this would be true of other things as well. Could have denial of service. JP: could substitute my voice model for anyone else’s JP; why are we doing all of this? 1.simplercredential collection - so it isn’t distributed across a lot of different things. If you could do it in one call do all of authentication. Ease of use 2.bit customer latcncy problems with existing applications. Call ASR engine and get result. App gets results and sends it to SIV engine. That has latencies. Result is returned. Get a4-5 seconds. In the proposal the interpreter is working on it immediately. 3. potential improved accuracy of ASR/SIV - because engines share things 4.Ken - benefit to application developers so don’t have to buy, maintain, etc. SIV egine. 5.to platform vendors because use same engine across apps. Service in a network you can access. SIV is just another resource. Question: if automobile must do SIV but has no access to network. Offline processing. Running over a cellular network. Valene - from a security view could offer security to ASR side. Potential to make security overflow. JP: would that be vendors? Dan: vendors - it is end-to-end encryption. Valene: however you do it there is a benefit. It would be consistent and include ASR. More streamlined as well. JP: disadvantages. big question - is market demand such that platform and engine providers would be onboard? Jim: could do Dan: one of things taking away from this - what needs to be enabled as possible in VoiceXML in order to provide security. May be simple - encrypted results and encrypted configuration block sent to engine. It may be that simple. Allowing those two to be possible but not requiring them to do it.that enablesas tight security as you want. Make sure you can send and accept and encoded block. Not know what is in it? I think it affects architecture One issue in EMMA is how represent results. MRCP V2 does not talk about sending back an encrypted block rather than filling in fields. That is a modification may need to be done. Encrypted configuration block as well. Same as for voicexml. Implication is mild - allows different kind of data be sent. Dan: effort for browsers would be just change kind of data. Very little work. Work is on vendor side. Vendor needs to support this ONLy if customers are asking for it. Vendor can say if you want an encoded encryption block could add it if want to do it. JP: endors would like this because gives vendors more of a role. Before concerned about our role in all of this. Dan: I can now go to my CEO and say here is how we can add SIV securely in our platform Ken: the ability to support multiple engines is helped. It makes SIV look more like ASR. This is from browser implementer standpoint. If have SIV, ASR, and other engines have not yet made those decisions. Assume wait until all results in but not necessary to do it that way. Valene: if I'm prompting for something I might use TTS and someone hacks and system asks person to call a fraudulent phone number or changes something that is requested.
The following were discussed and agreed to at the Menlo Park meeting. These comments are not endorsed by the W3C or the Voice Browser Working Group and do not appear in order of importance. Each person is asked to provide one idea or topic or insight from the meeting 1. Ability to delay tagging of audio and other structures based on downstream information. Example: machine that asks caller to answer a challenge question. I whisper an answer to agent who then can pass the answer to the system. Associates meta-data with recording in non real time. 2. Obtaining CBEFF format ID for EMMA data. Figure out how to integrate EMMA with CBEFF. Cathy will offer to take the action to get a format ID in the M1 meeting in April. 3. There are multiple ways of integrating SIV with VoiceXML, including adding SIV functions as native features to VoiceXML 3 (e.g., enroll, identify, verify) or as Web services. 4. Support the transfer of secure encrypted control block from the application server to the SIV engine through the browser and the return of a secure, encrypted result from the SIV engine to the application server by the browser. This reduces the risk of tampering by the browser or the VoiceXML code. If you are in a hosted environment this can increase the trust of the hosted configuration. 5. Recommend that the language specifiers, platform developers, and the application developers use and promote a secure lifecycle, leveraging existing standards and methods, where applicable. 6. We did not explicitly discuss speaker identification. There needs to be more discussion of speaker identification. 7. Standards will be built bearing in mind good security practices. 8. The information-fusion algorithm is not a solved issue in scientific community. Standards should not proscribe the decision-making algorithm. 9. The browser does the collection of the biometric sample. The process that does the comparison with the reference (voiceprint/voicemodel) is separate from the browser. The decision can incorporate the result of one or more comparisons. There are many places where this decision can occur. 10. In addition to enroll, identify, and verify we agreed that some administrative functions (e.g., copy) are necessary. We did not know whether they belong in VoiceXML 3 or not. 11. We should co-ordinate with other organizations, notably ANSI/INCITS M1, OASIS/BIAS and Speaker Biometrics Committee of VoiceXML Forum. This includes work on developing a standards framework. We should reference existing work rather than re-creating it. This includes security and policies/best practices. 12. We agreed to collaborate on security activities with other W3C working groups, notably the security working group. 13. Public relations is an important issue to be considered during the development and use of the standard. There is an ISO publication related to this (ISO 24714-1). 14. There was a specification that includes a minimal set of audio formats (ANSI/INCITS 456). The VBWG should consider using or including this specification. Higher level speaker-recognition functions (e.g., Discussion is government the only real market for SIV - Canadaian phone company - IDZ bank in Australia deploying - Government and banks will need to know that they are dealing w the best/right GM – what am I doing to make things easier. This is important for making life easier Parameters are different Jim- wants SIV to replace passwords – Ken agreed When someone calls I want to know who is calling Jordan – put things on the device. ON calls want to know who is talking. Agree that SIV has markets beyond government. A market is opportunity + money. I haven’t seen the money yet. JP – anytime you have an enterprise where someone must authenticate you need SIV. Frankie: anytime have an enterprise where people call about once a year you need SIV JP: anyone who needs to know who you are could be interested in SIV ----------------------------- JP: ARE WE MISSING ANYTHING? ----------------------------- Why SIV functionality should be added to VoiceXML: 1. Performance benefits / Customer-perceived latency. The system is more responsive. 2. Easier for developers. Programming interface is consistent with the way they use other VoiceXML resources and shields application developers from low-level operations. 3. Efficiencies of scale in hosted environment. Can share services. 4. From a business side. Makes it portable. Standardizations of easy to use API will grow the market. If you can show a standards-based implementation you can grow the market. 5. Facilitates integration with web model integration. 6. Makes SIV applications consistent with the Web model. 7. Drive towards service portability across implementation platforms 8. Minimizes the level of vendor lock-in. 9. Support in VoiceXML enables SIV use (without the application server) with intermittent/offline connectivity. 10. Standards are a sign of technology maturity. Ken: We need to do a better job of marketing what we are doing with VoiceXML 3. Some people feel VoiceXML is dead. In VBWG need to work on marketing. Cathy: Biometric Consortium conference – can try to get speaking spot in Biometric Consortium. Ken: Also do in contact center --------------- Big take aways --------------- 1. We have actually identified the main hidden security issues that people have concern about and discussed ways in which they can be realistically addressed. Those issues don’t disappear but we can address them 2. look at more detail about configuration possibilities but need to examine more of security threats 3. VoiceXML exists 4. SIV fits into the VoiceXML space. 5. There is an aveue for multimodal applications 6. mechanisms for configurations of the engine and can standardize some engine parameters 7. How VoiceXML fits into the larger standards picture 8. A lot of things were discussed but some of them are outside of the scope of an engine vendor. A good standard that would create a nice wrapper around engines would be needed. It does not exist. 9. VoiceXML V3 is one of the interaction managers of MMI. Collaboration with other working groups. 10. Classification, segmentation) were not discussed. The VBWG needs to determine whether to included in VoiceXML V3. 11. The VBWG should address the synchronization and markup integration of multiple modalities. 12. There are likely to be multiple modalities/factors involved in an interaction using VoiceXML. Consequently, developers need a way to not completely separate those modalities. Integrated use of multiple modalities/factors should be made easy.
The Call for Participation, the Logistics, the Presentation Guideline and the Agenda are also available on the W3C Web server.
Judith Markowitz, Ken Rehor and Kazuyuki Ashimura, Workshop co-Chairs
$Id: minutes.html,v 1.13 2009/06/16 14:36:54 ashimura Exp $