Workshop on Speaker biometrics and VoiceXML 3.0 — Minutes

Menlo Park Workshop

Jordan Cohen - SRI International
Homayoon Beigi - Recognition Technologies, Inc.
Judith Markowitz - J. Markowitz, Consultants
Jim Larson - Intervoice
Bruce Balentine - EIG
Ingmar Kliche - Deutsche Telekom AG, Laboratories
Ross Summerfield - Centrelink
Chuck Johnson - iBiometrics, Inc.
Cathy Tilton - Daon
Ken Rehor - Cisco Systems, Inc.
Valene Skerpac - iBiometrics, Inc.
Dan Burnett - Voxeo
JP Shipherd - Nuance
Frankie James - General Motors
Alfred Tom - General Motors
Kaz Ashimura - W3C

Thursday, 5 March 2009

8:45-9:15  Session1: Welcome

Jim Larson
Jim Larson
Welcome and logistics — Jordan Cohen, SRI
Welcome from the W3C — Kazuyuki Ashimura, W3C [ Slides ]
(No specific discussion during this session)

9:15-10:30  Session2: Background

Kaz Ashimura
Dan Burnett
Overview of applications — Judith Markowitz, J. Markowitz, Consultants [ Paper / Slides ]
Topics to discuss:
Question from JL: will the concerns about privacy cause public
backlash that could negatively impact the industry?

Valene: public perception is good.  How the biometric is used and
handled is what is most relevant.

JC: problem is that police or government often have a different
intention for the technology than the scientists do, and that leads to

JM: this kind of battle is already beginning.  Parents are concerned
about how the technology is being used to authenticate their children
in schools.

General question:  what can the speech industry do?

JC: need to include intelligence/law enforcement community in
discussion.  Biggest issue is stealing of biometric id.

There was brief discussion of a court case in Michigan where forensic
investigators claimed that they could tell from small amounts of
speech precisely who the individuals were, and it was thrown out of
consideration, but this could be non-binding on future cases.

Should SIV capabilities in VoiceXML be designed to also support
non-IVR applications?  In discussion of this question, the group asked
the question of whether SIV belongs in VoiceXML 3 at all.  Is it an
IVR technology?

RS: "Whatever we do in VoiceXML 3 must be compatible with what's done
elsewhere".  We consider this to be in the security camp and not IVR.
We will likely use it over the web.

DB: The key is determining which pieces are "ripe for
standardization", and whether there are enough pieces that it is worth
the effort.

KA: there is a Security activity with W3C that we should be working

Judith says we've already contacted them.
SIV in VoiceXML 3.0 — Ken Rehor, Cisco Systems [ Slides ]
Topics to discuss:
We got into a discussion when we reached Ken's security slides.  Some
key points: we need to enable any developer to use biometrics, without
setting policy for how they do it.  And yes, they may do it badly.

RS:  we have completely separated security components from everything else.

Key points:
- browser does the collection
- the comparison with the security token.  There is modest agreement
  that the process that does this is separate from the browser.
- decision incorporating result of one or more comparisons.  There are
  many places where this decision can occur.

10:30-11:00 BREAK

During the break we then had a discussion where Bruce suggested a
different way of thinking about this.  He was volunteered to write up
this model for us to shoot at later today :)

11:00-12:00  Session3: SIV Use Cases

Judith Markowitz
Bruce Balentine
Topics to discuss:
Speaker Verification in a Multi-Vendor Environment — Ross Summerfield, Centrelink   [ Paper / Slides ]
(Minites TBD)
Position Paper: VoiceXML Speaker Biometrics Workshop — Alfred Tom & Frankie James, General Motors   [ Paper / Slides (TBD) ]
* Listed uses of Identification, customizations (seat high radio
  presets), right to drive, etc

* Ways to Identify

  - Driver only settings

  - Keyfob (identifies owner of fob, but not the driver).  Isn’t
    useful for multiple drivers of the same vehicle

  - Identifying a the user by their cellphone in their pocket

  - Biometrics (could identify the drivers and the passengers).
    Ideally it would be nice if the care identifies you by your weight
    when you sit in the car.  You could ask users to say a phrase, or
    swipe their finger over a card reader

* There is a requirement for multiple modalities

* Might not be required for customization.  Definitely required for
  ability to drive/lockout decisions

  - Connected vs Onboard biometric engines both need to be considered.
    Network only isn’t good, but they think there are advantages to
    network, for example if a single user has multiple cars.

* There are synchronization issues to consider since the person could
  use speech or touch to change the radio station.

* The remainder of the discussion focused primarily on ways to
  implement these multi-modal systems and did not touch on biometrics.

* The basic ask of GM was that if any standard is adopted it considers

12:00-12:30  Session4: Users

Kaz Ashimura
Ross Summerfield
Speaker Identification & Verification — Bruce Balentine, EIG   [ Paper / Slides (TBD) ]
Topics to discuss:
Context: IVR centred.

Core issue: letting data interact.

Today's call center:
The lowest cost processing - touch tone and ASR.
Second chance - ID&V - next low cost processing.
For those that cannot be handled here are totally handled by operator.

Layers of processing:

bottom layer- voice biometric response and ASR response.  This is the
technology layer, which is uncertain.  Resulting is a high raw error
rate, e.g. false matches.

real-world layer - being able to combine information.  We know the
problems that can be had and know, from a security perspective, the

Also have the context and suspicion rules.  E.g, when users speak
their own date of birth, tend not to have disfluencies, whereas
someone reading a date of bith is more likely to.  These are just
probabilities and are therefore uncertain.

business layer - this is the level that the designer really wants to
be at.  Not concerned about being right or wrong, rather about
behaving correctly.  Interest is not EER, but how penalised will I be
if I behave incorrectly.

The interception model.  Assume I have no concern for self service.
But I have to track the person on the way to the agent.  This means
that you do not have to deal with security in the IVR.

Can just take the failures (a small, but expensive population,
probably lots of fraud users here), you process these prior to them
presenting to the agent.  The other (majority of) users are not
handled through this process.  The machine is asking questions and
passing answers to the agent.  The agent, hearing those customers
speak, can make a judgement.  The next time the caller calls, they get
the 'sock' and 'a sock' and 'a sock' treatment.

Next stage is to move up the line to the next most expensive group,
and so on, until you get to the full screen pop callers, say over a
period of 9 to 12 months.  Cost saving is the reduction in call
handling time (not in security).  Security (and privacy) increase is a
side effect.

This model trades off calendar time for boot strapping capacities for
the (large) user population.

The interception model - call goes into the IVR, then the sentry
selects some of those calls, then boot strap the enrolment as the call
goest to the agent.

The front door model - the IVR is (logically) integrated with the IVR.
You can then give custom menus to the caller.  This is about
customising the experience, and thereby streamlining the process.  In
this model, the sentry is a voice biometrics box.  A voice biometrics
is really a 'nearness' engine.

The eaves-dropping model - the Sentry sits with the agent (rather than
the IVR).  This model has many issues, but also has merrit.  This
process elicits confidence.  In addition to the automated
authentication, the agent gets to hear the user and their issues.  In
this model, the sentry may, but probably will not, issue the
challenges.  That is more likely to be done by the agent.  This model
readily allows step-up.  Note that details about interaction are
glossed over in this model explanation.  The idea is to stop the human
agent from hearing the security credentials.

All of the models have a lot of background work in terms of collecting
data for enrolment (and authentication).  Agreements or warnings would
be said: many of the explanations are not legally binding, but are
knee-jurk reactions on the part of the agent.  However for the
biometric model, there are not a lot of court case tests yet on what
you do need to inform the caller of.

In the ROI Grows Over Time, to start with, there are no biometrics
involved, but the minute they are, then the caller needs to be
informed that recordings are being made.  Once you have enough data,
then you have to work out whether the data could be used for
enrolment.  Controls will be influenced by the privacy laws of the

Bag of issues: do we need guidelines?

Service oriented dialogue box and security orientended dialogue box
models - by uncoupling, you get a better approach, leading to
development of a better migration strategy.  Often fraud is not a cost
effective argument for self service authentication level increases.

We need to think in the way that there may be already existing things
that can be leveraged off to help us out.  The reverse taging of data
becomes important.  There may be events in the future that redefine
events in the past.

We must be able ot invoke SIV functions from a VoiceXML dialogue.  But
there are functions that may be required of SIV outside of VoiceXML.
VoiceXML is about the voice user interface.

Streaming - support for analysis part way through the delivery of a
speech stream.  Issues with speaking too soon, etc., knowing where you
are relative to the mark.  It is specified in VXML 2.  You can use
this for changing thresholds, for example.  Issues for double
barge-in, "I wan", barge-in stops speech, then user "I want to ..." -
the recogniser has serious issues with this (but people don't).

The suggestion is that there is streaming ASR and streaming SIV in
parallel, potential.  (Different to the end-pointing environment.)

If you are using the same engine for doing ASR and SV, then they can
leverage off each other.

In the 2x model right now, there is a wait for the results to come
back from both systems.  Should there be a wait for VXML V3?  E.g., if
we know that the caller is the wrong gender from one SV engine, why
bother to wait for the second engine to come back?

1.  First one wins
2.  Bayesian
3.  Dempster-Shafer (theory of evidence).

This decision needs to be pulled out of the engines, due to the

There are different kinds of decisions that need to be made.
E.g. what person, what word, etc.  In ASR, there is some decision
making going on at the device level, e.g. match or nothing heard or
poor audio quality, which leads to actions taking place.  This would
translate to SIV.

For processing speech recognition, we have moved to keeping the
decision fuzzy much longer (than was once the case).  We need to allow
that sort of movement in SIV.  Larry's perspective is to leave the
decision making to the engine, not to later.

12:30-2:00 LUNCH

2:00-2:45  Session5: Audio Formats

Ken Rehor
Chuck Johnson
A Standard Audio Encapsulation Method — Homayoon Beigi, Recognition Technologies   [ Paper / Slides ]
Topics to discuss:
How to these different entities send info to one another

What should be standardized - audio formats, recog results, and the integra=
tion with engines

- basic list of formats
- open source
- ease of adoption
- stability and quality
- need to have a small number of formats (Jim - no more than 3)

Sampling process - periodic and multi-rate (really means variable bit
rate coding)

Coding scenarios - lossless, compressed, multi-rate
0 not Wav
0 G.711 alaw, ulaw
0 not ADPCM - too many variations
0 not MP3 - not open
0 OGG Vorbis
0 OGG Media Stream

Jim - now we have more than 3, HB - this list is really only 3

OGG Vorbis is very easy to implement

John - Google is using the Speex speech codes for their speech search

OGG Vorbis can stream many types of media/formats

What is not covered -

Interchange requirements
Missing features
Compelling reasons to=20


JP: Customers say "this is the audio that I am going to give you" [in
    our own particular format]
HB: You can convert to other formats to the standard - the standard
    for interchange and storage

Dan: There are many new types of codes coming online all the time
HB: This is the audio standard for SIV engines

Ken: who is responsible for conversions - the server?
HB: If this is a standard, then the audio that is sent to the SV
    engine wi= ll conform to the standard

John: There are good motivations for new codes.  All compression codes
      are non-linear.  Trans coding of non-linear codecs is crazy
HB: The uses need to convert to the standard

HB: Engine does not really have different from ends for all the
    different types of audio formats
HB: Do the conversion at the proper place - at/on the client is the
    best place

Dan: engines use 3-rd party libs to convert to their internal

Judith: Government agencies that we are dealing with are
        prompting/requiri= ng standards for audio codes - if there is
        not an audio standard, then the = DEFF could not become a
        standard for data exchange

Jim: Is this really necessary

Ken: I have pushed this rope [standards for audio formats] for many

Ken: There is now wide band audio for many types of phones - skype.
     Requirement for new codes are all over the map

Val: There are security issues with file formats

Jim: what are our options here?
(1) Try to please everyone =E2=80=93 snort [not]
(2) Have a small # of things [in the standard] that are useful to a
    large number of users

Jim to Judith: should we wait for the standards groups to render a

Judith - the groups have already accepted the [our] format standard

Jim: should we encourage the stds groups to decide?

Dan: "We should go to the VBWG and tell them about this (proposed)
     standard and ask whether the VBWG wants to use this standard as
     either the complete set or minimum set required for one or more
     of its specifications."

Ken: I want convergence - I want it to be correct =E2=80=93 our
     original w= ork was not specific enough to be useful.  I buy into
     what Dan is saying.  = I personal do not think that we should
     have the other groups set the standa= rds for us.  I am
     discourages by the prospect in setting a narrow set of st=
     andards.  My exp is that there is not going to be a standard

Dan: the is a reason that the IETF can up a media registration - allow
     the market to decide what formats are going to survive
Dan: if you [the customer] are worried about trans-coding - then you
     shoul= d use best internal format

HB: How may SV errors do you get from trans-coding due to
HB: If we have a standard, then you the customer, if you want the best
    performance, then send it to the SV server in one of the supported
    (standard) format.  It will give you everything that you need to
    support audio transfer.

Jim: Let's go in strong and set a minimal set for the audio formats

Ken: Headers may not be compatible.

JP: Too may versions of RIFF headers

Ken: There are too many and too broad of an
     interpretation/implementation = of RIFF headers

2:45-3:30  Session6: Relationship of SIV Specification with MRCP V2

Judith Markowitz
JP Shipherd
SIV in MRCP — Dan Burnett, Voxeo   [ Slides ]
Topics to discuss:
Dan gave a historical view of MRCP

- created by the IETF

- Protocol only carries control messages (no media)

- MRCP v1 developed in 2001 by Cisco, Nuance Speechworks, but IETF
  disagreed with some of the RTSP tunneling aspects of it

- It was broadly adopted though so it was released as an Informational RFC #4463

- It included only ASR/TTS as the engines

- MRCP v2 was done as a standards track document in SPEECHSC Working group

  o Has it’s own protocol, not tunneling

- MRCP v2 has the concept of allowing the ability to use data recorded
  in earlier sessions

- attempts to define protocols for controlling

  o ASR

  o TTS

  o Speaker Verification

  o Speaker Identification

Verification Resource in MRCP

- Has a concept of session

- Audio Buffering which can be enabled/disabled

- Simultaneous ASR/SIV can used or SIV alone

- Supports both Enrollment and Verification mode

- The Result structure is the same for ASR   (NLSML result.   EMMA is also available)

- Small Scale Speaker Identification is available via a “GROUP” voiceprint

Homayoon asked if the SVAPI was considered

- SVAPI included identification and classification

- SVAPI is much more complicated that MRCP

- Dan indicated that a standard is going to only work if all the
  participants in the creation are comfortable that they can work
  within the framework

Security Model

- Audio is expected to be secured via channel-specific mechanism, eg sRTP

- Voiceprints are handles as references that are meaningful only to
  the underlying engine

- Cookies can be used to pass credential information from client to
  server, for server use downstream (ie: to get the voiceprint out of
  some type of secure database)

Dan showed an example of the messages between client/server in an
example call.

- It included a sample result

- Homayoon asked if engines could include results not specifically
  defined in the standard.
  Dan said that if the namespace is done correctly this is OK

- JP asked how the Verifier knows when to stop processing audio.  Dan
  said the expectation is that endpointing occurs in the engine.

Dan described that the MRCP standards group is thinking of breaking
out the standard so that there is a server which handles the control
messages but that there will also be underlying protocols specific to
each engine.

3:30-4:00 BREAK

4:00-5:30  Session7: Architecture and functionality

Jim Larson
Ken Rehor, JP Shipherd
Topics to discuss:

A possible SIV architecture discussed in Menlo Park (Menlo Park Model) — Bruce Balentine, EIG   [ Slides ]
[Notes by JP Shipherd]

Architecture Discusion based on Bruce's updated picture

The team looked at a proposed architecture that puts the SIV engine
"under" the browser and accessible via VXML (much like ASR and TTS
resource are.

JP asked us to weight the pros and cons to the SIV via VXML approach:

* Allows for streaming authentication
* Allows for ASR/SIV in parallel
  - lower caller perceived latency
  - potentially improved ASR/SIV accuracy
* Simpler credential collection (all done in VXML)

* "Security Concerns" -- need to ask the group to enumerate these, but
  a few might be:
  - audio in transit could be purloined and used for breakin attempts
  - comparison has now moved to the IVR/Call Center environment, where
    other comparisons are happening in the application environment

* Standard will likely:
  - include functionality that is not available in some vendors
  - not include functionality that is available with some vendors

Proposal of an SIV Architecture and Requirements — Ingmar Kliche, Deutsche Telekom   [ Paper / Slides ]
[Notes by JP Shipherd]

Ingmar - DT Labs

* DT Labs is both the technical arm of DT, but also affiliated with a

* DT's VoiceIdent product comforms to the "Common Criteria of the
  EAL2" (which I think is analogous to an ISO standard, but for

* SIV in VoiceXML needs to support

  - SIV Only, SIV in Parallel with ASR, SIV in the same engine as the

  - Event management issues are in play when two engines are running
    in parallel

  - Text Dependent, Text Independent, Text Prompted

  - Enrollment, Verification and Identification

  - Rollback of last turn

  - Query SIV results, Catch SIV events (noinput, nomatch)

  - Voiceprint management (query, delete, copy) should be outside the
    scope of VoiceXML

* Judith asked if the browser interface needs to know details such as
  if Text Dependent or Text Independent is necessary.


* DT wants the ability to be vendor independent and want to ensure
  that there is a decoupling between the VXML browser and the
  underlying resources.

* DT wants to be able to move voiceprints from storage via HTTP/HTTPS.
  They see that the SIV engine would make calls to this interface, but
  so would administrative functions.  Homayoon says that this opens a
  security hole.  Igmar indicated that his big issue is that he doesn’
  t want to rewrite administrative tools for each SIV vendors data
  storage mechanisms.  He wants the application to only have to pass
  references to a voiceprint location and have the engine be able to
  fetch a voiceprint whether it be remote or local.


* He described a few use cases:

  - SIV without ASR.  IN this model SIV must do endpointing to do
    barge-in, End of Speech, etc.  He also wants the vendor to support
    the concept of session, so that multiple turns can be taken and
    cumulative scores can be provided to the app

  - SIV with ASR.  In this model two engines are running in parallel
    and they could provide a different set of endpointing events and
    even success failure conditions.  Who's job is it to "normalize"
    these events.  Dan said that this would only work if the MRCP
    server takes care of this.

  - ASR with SIV from buffered audio.
[Notes by Ken Rehor]

Ingmar Kliche
T-Systems Laboratories
What SIV functions should be supported in V3?
- SIV only
- SIV in parallel with ASR (separate resources)
- SIV integrated with ASR in one combined resource
SIV types
- Text independent
- Text dependent
- Text prompted
Decision Control
- Either the SIV engine or the application may control decisions
(regarding acceptance/rejection)
SIV core functionality
- Enrollment
- Verification
- Identification
==> requires: Save voiceprints (after enrollment) and Load voiceprints
    (before verification/identification)
V3 should load/store voiceprints implicitly without explicit markup
Further basic/core functions
- adaptation
- buffering utterances for later use
- rollback/undo of last turn
- quary SIV results
- catch SIV events
- Query, copy, delete (and other administrative operations) of
  voiceprints are out of scope of V3
What should the interface be between the SIV engine and voice model

What administrative functions should we provide?
- Support MRCPv2 for integration of SIV engines
- Extend MRCP vs. limited SIV functionalities
- Use EMMA for representation of SIV results
- Use web protocols for voice print transport
Use Cases
#1 Standalone SIV without ASR
- SIV needs to support endpointing (like ASR)
- SIV needs ot support timeouts (like ASR)
- SIV should provide bargein
SIV may need multiple turns (within one SIV session)
#2 SIV + ASR
e.g. single utterance of account number used for ASR and SIV
- SIV and ASR independently might return events such as noinput, thus
  the application must deal with them
- rollback/undo must be supported
#3 ASR + SIV from buffer
requires new ASR function -- able to accept an utterance from a buffer
#4 ASR + SIV from file
Sessions: need to define a session that comprises of one or more
dialog turns.

Configuration and Management of Speaker Verification Systems — Chuck Johnson, iBiometrics   [ Paper / Slides ])
[Notes by JP Shipherd]
Chuck Johnson - Configuration Management Issues

* Results that come back differ from engine vendor to engine vendor.

  - Normalized Scores (different engines have different scales)

  - Confidence Scores (this is the confidence of the how well the
    normalized score fits within the configured operating point)

  - Raw scores are also sometimes returned

  - It would be hard to all vendors to have the same meaning for a
    given score, but perhaps we could standardize on a scale (ie:

  - It would also be nice to recommend a set of “error conditions”

  - Should engines return a binary decision

* Database Management

  - Database setup is clearly outside of the scope of an application

  - Applications may be precluded from being able to delete
    voiceprints.  If this is the case apps need to have some way of
    "cleaning up" partially enrolled voiceprints

* Enrollment

  - Apps need to have control of enrollment and manage the flow of the
    dialog that creates voiceprints

  - Engine vendors need to have a defined set of return/error codes
    for enrollment too

* Background Models

  - Out-of-the-box applications come with background models that might
    not be appropriate for the calling population

    o Children, high frequency women’s voices are particularly

  - Should apps be able to specify background models on the fly

    o Engines would need to be able to load them on the fly

    o Obviously vendors would need to expose the ability to create

* Different Classes of Users

  - Apps should be able to adapt threshold based on knowledge about
    the user or the task they are doing.  (Isn’t this almost
    universally available today since it is out of the purveouw of the

  - Apps should be able to dynamically set supervised/unsupervised
    adaptation.  JP asked if there is a real use case for unsupervised
    adaptation.  Chuck said there definitely is in circumstances where
    the engine is making a yes/no decision.

  - Judith said adaptation can also be used to “extend and enrollment”
    over the first couple of calls into the system.

* Voice Model Adaptation

  - Clients should be able to rollback adaptation.

  - JP raised the question of “how much rollback do you want”, since
    single utterance, multiple utterance and multiple session rollback
    all have different costs from a provisioning and data storage
[Notes by Ken Rehor]
Chuck Johnson
iBiometrics, Inc.
Configuration and Management of SIV Engines
VoiceXML Forum Speaker Biometrics Committee surveyed major SIV vendors
regarding features and functions

All results were different - no consistency in interfaces or values of
data such as confidence scores
Some possible standards:
- normalized scores
- consistent minimum set of error codes
- etc.
Need to work with engine vendors to explore which areas might be standardized
Voice Model Database Management
--> Applications shouldn't be permitted to create a voice model database
Enrollment - Voice Model Creation
--> in scope of the application, assuming the database is setup properly and securely
Need a minimum set of return results, data, and error codes
Distinct User Populations
World models might be tailored for specific application scenarios
--> gender, age, regional differences
--> Application may need to dynamically load and update a specific
    world/background model
Different Classes of Users
- Corrections applications, finance applications
- classes of users based on risk profile
Voice Model Adaptation
--> Adaptation is necessary
enable/disable, set thresholds, query adaptation outcome, rollback
adaptation (from last turn)
Rollback requires more storage -- is that a problem?
One rollback is simple, across sessions is more complex: how far back
should this be supported?
Single-utterance rollback?
Entire-session rollback?

5:30 Day1 ends

7:00- DINNER

Friday, March 6 2009

8:30-9:20  Session8: Security and identity management

Cathy Tilton
Ken Rehor

[Quick reminder that there is security terminology in the glossary]

Security, Identity Management, and Risk Management — Valene Skerpac, iBiometrics   [ Paper / Slides ]
Topics to discuss:
Privacy, Security and Risk Management considerations
- Incorporate Security and Privacy in Development Model
Confidentiality, Integrity, Availability
see NIST Security Software Development Lifecycle standard
What 5 most important things needed to make an SIV application secure?

SIV applications will be subject to many security breaches

V3 environment is more complex (multimodal, open source, multiple
networks, etc) and thus subject to more vulnerabilities

5 steps to take to make an SIV application secure
1. Envision - Identify Threats/Risks
2. Plan - Profile, threat/risk modeling, generate requirements
3. Develop - control check
4. Release - Handle threat/risk
5. Stabilize - Learn and Educate
---> Recommendations to the industry to foster secure SIV software
     design, development & deployment
ANSI M1 produced a document on threats

See free report on ANSI/INCITS website Study Report on Biometrics in
How to keep my database of voice models secure?
- Utilize standard methods and best practices which are consistent
  with the organization's security framework
How to keep my voice models and other data secure when I transmit them
to others?
- ditto
Should voice models be encrypted?  What's the value of getting access
to a voice model?
Biometric information is categorized as personally identifiable
information, thus may require encryption. Depends on overall system
requirements and deployment scenario.
Choices for specification of security in V3:
- specify nothing
- require specific encryption and security standards
Facilitate strong security but don't require it
How to ensure V3 doesn't preclude security standards
OMB M-04-04, E-Authentication Guidance for Federal Agencies,
How can we structure V3 SIV support so that it can be governed by
security and privacy policies of an organization?
- ISO 19092
What are security and privacy regulations of which we must be
--> MANY and expect more!!
Who should specify and/or recommend policies and regulations?
Complementary activities:
--> V3 language development in W3C VBWG, aligned with recommendations
    from 3rd parties
--> VoiceXML Forum develops collection of recommendations, policies,
    standards to be used in V3
Need to make sure other related standards are considered, as well as
entire application environment -- Security isn't only specific to SIV
Additional references -
Other security standards - in progress

- ISO/IEC 24745, "Biometric Template Protection", WD3

- ISO/IEC 24761, "Authentication Context for Biometrics (ACBio)", FCD

- ISO/IEC 19792, "Biometric Security Evaluation", CD3

- ITU-T Study Group 17 Question 8, "Telebiometrics System Mechanism (TSM)"

- ITU-T X.1081, Telebiometric Multimodal Model Framework (TMMF), Q.8/17

9:20-10:00  Session9: Relationship with CBEFF and INCITS 456

Valene Skerpac
Homayoon Beigi
Common Biometric Exchange Format Framework (CBEFF) — Catherine Tilton, Daon   [ Slides ]
Topics to discuss:
INCITS 456 – Data Interchange format for SIV — Judith Markowitz, J. Markowitz, Consultants   [ Slides ]
Topics to discuss:
Cathy Tilton
CBEFF (Common Biometric exchange formats framework)
 - a set of metadata elements for exchanging biometric data particularly over

History of CBEFF:
  started in Feb. 1999 in a bio-consortium workshop
   -> NISTIR 6529 
     -> NISTIR 6529-A
       -> 3980-2005 ...

  metadata header (SBH)
  biometric data  (BDB)
  security information BLK (SB) -- optional
  defines abstract data elements used to describe the data
    registration of biometric data via IBIA
     allows for new adaptations

Mandatory header is a format owner and type (4 bytes)
  - It allows you to recognized the format- could be vendor specific or
    standard, processed
    The header - standard of prop
                 published or unpublished
                 raw, intermediate or processed
                 latest version has a format ID for the security block

Format Owner (1st two bytes of the mandatory header)
  INCITS M1   0x001b
  ISO/IEC     0x0101
  NIST        0x000f

  IBIA website has place where vendors can register to become format owners.

format type is the 2nd 2 byte field
  0x201    minutiae (basic)        INCITS...
  0x0701-6  Signatures (various)   ......

The actual data goes in the BDB (Data Block)
  - See CBEFF minutes for an example

examples of header elements are mandatory and optional

  registration authority procedurespatron formats
  security block formats

patron (a secondary specification that implements  CBEFF in a specific way)

OASIS XCBF is an older version

CBEFF is being used in 

ICAO - E-passports, (logical data structure, LDS)
PIV Federal employee credentials
transportation worker identification credential (TWIC)
registered traveler (RT) cards
other standards,

Judith Markowitz
INCITS 456 Speaker recognition format for raw data interchange (SIVR-1)

- Used for storage and exchange of speech or spoken data
  - only concentrates on the raw data block (BDB) of CBEFF
  - but it does not have to be a part of CBEFF, it could be used with a
    different wrapper -- e.g., FBI has a wrapper it uses around it
  - Draft standard which passed the first public review and has received
    some comments
  - 1st BDB for spoken data and 1st one in XML
  - It was developed jointly by ANSI/INCITS M1 and VoiceXML forum's speaker
    biometrics committee
  - In Australia they have been using XML in the security community for a
  - does not capture features or models like some other BDBs may -- this
    one only uses raw data
  - Goal is to provide information that will enable recipient to analyze the data
     - audio format
     - input device and channel
     - Does include Speaker (sex, age), but not claim, also does not talk
       about whether it is a part of a verification or identification session
     - language/dialect

  - Data sharing can benefit
  - used for watch list creation
  - internal system audits
  - automatic enrollment of users
  - multi-biometric fusion
  - product/algorithm testing
  - SIV registry/service
    (BT was interested in using this is ISO for this purpose)

- The ISO/SC37 is doing raw and feature data and they do binary and XML
  and they are considering the same data format of the INCITS456

Two levels
  - session header (one session per BDB)
    - things that do not change during the session such as sex, date of session,
      device and channel, audio format, etc.
    start and end time, channel, audio format, sex, etc.
  - instance header (at least one instance should exist, but more can be there)
      informations that changes from turn to turn of a dialogue
    prompt used, utterance, quality rating
    - audiofromatheader, audio, quality (score, algorithmvendorid, etc.)
    - EMMA may be able to use the DEFF format, it could also use it as a part
      of a derivation (further processing of a single input or more).

10:00-10:05 SHORT BREAK

10:05-10:30  Session10: Data format for Multi-modal and Multi-Factor Applications

Dan Burnett
Ingmar Kliche

[Review of terminology — multi-factor and multi-modal including multi-SIV engine]

Extensible Multi-Modal Annotation Markup Language (EMMA) — Jim Larson, Intervoice   [ Slides ]
Topics to discuss:
Session 8: Data format for Multi-modal and Multi-Factor Applications 

Jim Larson: Extensible Multi-Modal Annotation Markup Language (EMMA)

Slides: http://www.w3.org/2008/08/siv/Slides/Intervoice/EMMAjl.pdf

Jim Larson: 
- EMMA has been developed by the MMI working group of the W3C
- Reference: http://www.w3.org/TR/emma
- EMMA defines various annotations, such as confidence, timestamps or medium
- EMMA represents user input

Valene Skerpac:
Is "video" as medium also supported?

Jim Larson:

Jim Larson:
- Verification might also be represented using EMMA
- Example on Page 8: shows standardized EMMA annotations (within EMMA
  namespace) and possible annotation of claim and verification result.

Chuck Johnson:
Does EMMA define also the <claim> and <result> tags shown in the
example (see example on page 8)?

Dan Burnett:
No, <claim> and <result> are within application specific namespace
(i.e. would have to be defined by the application)

Jim Larson:
- EMMA may also be used to represent fused user input
- fusion process is responsibility of application, EMMA is just for

Cathy Tilton:
- In December we discussed on how to combine EMMA with CBEFF
- action item from the teleconference to develop a joint use case, has
  not been looked at again

10:30-11:30  Session11: Relationship with the BIAS Project and BIOAPI

Ken Rehor
Kaz Ashimura
The Biometric Identity Assurance Services (BIAS) Project and BioAPI — Catherine Tilton, Daon   [ Paper / Slides on BioAPI / Slides on BIAS ]
Topics to discuss:
BioAPI and BIAS - Cathy Tilton


cathy: explains high level summary of BioAPI
* history,  basic architecture, functions
* operations: basic functions and privitive ones
* handling security concerns
* technology modules
* implementations: Win, Liniux
* BioAPI BIR (data format): includes Biometric Data Block (BDB)
* related projects: US, ISO

jim: asks about the data format

cathy: explains BIR is a binary data format, but the next version
... would be able to contain XML as well

homayoon: for Speech recognition?

cathy: should be some implementation. maybe L&H?

homayoon: streaming application?

cathy: static vs. dynamic usage...
... not easy, though

homayoon: any plan for update?

cathy: ISO version might be update for several new operations
... BioAPI interoperable framework should be useful for server/client
... applications
... Japanese committee uses BioAPI for some dynamic search system
... also conformance tests
... US version is based on the BioAPI Ver. 2.0

dan: what does "framework" mean?

cathy: communicate with BSP and applications

dan: that's why two levels of API (1) between app and framework and
... (2) between framework and BSP

ken: framework could be a Voice Browser

cathy: right

homayoon: what's registry?

cathy: schema, what kind of BSP supported

homayoon: device support info as well?

cathy: in v2.0, there is device scheme as well

ken: see "Biometric operations" slide

homayoon: specific device for BSP?

cathy: rather query what's available

jordan: is anybody using that?

cathy: good question
... quite a few products
... v2.0, which is much more powerful, doesn't have many reference
... implementations, though


cathy: explains BIAS using short version slides
* many biometrics standards but what's missing?
* BISA - driving requirments
* INCISTS (spec) & OASIS (web service binding) collaboration for Web
* BIAS system context (INCITS vs. OASIS)
* OASIS defines SOAP profile and their work includes XML schema
* BIAS operations
* Representing biometric data: XML, CBEFF, and Metadata
* BIRs are binary, though
* twho methods: set of individual data items, existing formats
* status: INCITS 442, OASIS BIAS SOAP Profile
* plan for next meetings: INCITS on Apr. 14-15, OASIS on Mar. 17

* possible relationshps: architectual relationship to W3C's SIV
* communicate with bigger apps using BIAS while talking with voice MC
 using SIV?
* data/organizational relationship?

jim: how to combine these with EMMA?

cathy: compatibility problem

jim: compatible with what then?

cathy: BIAS can invoke operation
... can use EMMMA itself or converted data

jim: would suggest to Debbie Dahl to work with you to consider
... that compatibility

cathy: don't see any down side for doing that
... EMMA may be able to contain any data format within it
... and could be used for BIAS

judith: BioAPI is an existing biometric API
... even though it's now using binary data format
... what extent can we import their idea into our SIV spec?

ken: have some idea and would like to show his slides
(slides TBD)

ken: 1. VoiceXML 2.X SIV Integration via BIAS web service
... BIAS is used for communication between VoiceXML app and verification
... app

ken: 2. VoiceXML SIV Integration via 
... VoiceXML browser directly communicate with verification app
... using  and 

ken: 3. VoiceXML SIV Integration
... SIV engine is not under verification app
... but under VoiceXML browser
... and communicate with the VoiceXML browser

cathy: mentions the official comment period for BIAS OASIS is ending
... tomorrow
... but we'll accept comments anytime
... so please give your feedback
... can forward the information on the "call for review"

Additional session on SRI's research

sri_guy: explains SRI star lab's work
... on speech/speaker recognition

sri_guy: Anatomy of TalkPrinting
* not only voice print but also prosody based on anatomy, etc.
* pitch, energy, speech style context (e.g., schwas, pause)
* not considering speaker's culture
* NIST SRE08 test result


11:30-12:00 LUNCH

12:00-12:45  Session12: Proposed SIV Architecture

Jim Larson
Judith Markowitz
VoiceXML 3 SIV Extensions: Pros and Cons — JP Shipherd   [ Slides ]
Define what we are suggesting to change and then go into pros and cons. 

Slide - Typical architecture for call center. Use of record/record
utterance tag. Usually feeds application server that talks to SIV
engine usually thru proprietary interface.

All we are really talking about is moving 
a.where SIV engine lives
b.how you talk w it

access directly from VoiceXML interpreter. extensions

Will vary with platform implementation. 

Jim: how does concept of challenge dialogue fit?  Challenge dialogue
data should be held where voice models are.

JP: that is not true. Credential database

Ross: collect biometric data separately and store it separately.

Jim: does application access credential database and do challenge

Dan: it could be done that way but if you are using existing VOICEXML
environemt. Usually have a grammar that includes the data want to
get. In that senst the credential is not secret.

Jim: do we need to have controls over it?

Ken: it may not be secret. 

Dan: in either case recognizer recognizes and gives string to
authentication process. Collecting what person said but not audio.

JP: the way you deal with it does not matter whether use this

Jim: want a uniform way to deal with models and challenge information.

Ken: bank already has to deal with the security of that information. 

Ross: for us, security is a separate layout. Security is dictated to
us by the security director. We have layers within our security. This
being a security must be a separate layer from the application
layer. The data collection is application layer but SIV is at one of
security layers. Once hand data to security layer it is out of

JP: collections on left side and interpretations on right side of
picture. If move SIV interpretation to left side you may have security
issues - even though you may get better performance and it may be

Ken: left side not have had much security concern. You still need to
apply security to things like logs. You have to keep logs because of
regulation. This is a security hole for both the audio and the text

JP: two ways to plug hole

  a.put ASR where no log of utterance. So, no trail. Must be clean all
  the way. Audit is problem.

  b.Apply security

Ross: we have requirements to log. So we must log. Write-only
memory. And there are read-only methods to access it to show it is
only going one way and is tamper resistant.

JP: to next logical step is there anything different for SIV and ASR?

Ross: it is comparable for SIV. Very careful about how secure entire

Ken: I have seen specific requirements that are the opposite to
that. Tap it at incoming and record everything and cannot blank
anything out so that have a complete audit trail. Financial and

Ross: have recording requirements, as well. Don’t record taking of a
credential but log/record the results.

Ken: recording requirements are being develop to record all or record
specific things. Have many conflicting requirements. Usually driven by
regulations throughout the World that conflict

Valene: not only requirements. 

Ross: do threats vary depending on the two models you presented?

Chuck: yes, it does because of access of VoiceXML interface. Could
have VoiceXML interpreter on left

Dan: interpreter is not doing interaction w SIV engine. It is possible
for the application to be assured that the VoiceXML interpreter has
not interfered with the data. Except for the utterance issue.This
applies to the control

JP: the utterance issue exists in both scenarios. Trusting interpreter
is collecting utterance properly and giving it unaltered.

Dan: utterance issue is the same either way so look at control issue.

JP: voice model storage changes. 

Alfred: question of how to delete and adapt model

JP: don’t allow an interpreter to do. Must manage that thru another
interface. That leads into an issue of who is using this

Second framework is not specific to call center. Talk to agent and
scoring using framework or logon using it. Other components in
enterprise can access. Could still have SIV engine on left or right.

Ken: if I have multiple modalities in this architecture. 

JP: another issue is that the call center could be outsourced. 

Ross: Most call centers in Australia are outsourced. We are an exception. 

Bruce: also see opposite. See service providing sedurity is hosted. 

JP: could do a hash on the identity so oursourcer doesn’t know who the person is. 

Dan: several different pieces of information that flow across the line
  Audio information
  Identity claim
  Control information

Don't think there is too much concern about control except that MRCP
allows you to do a delete.

Haven't talkedabout identity claim and result flowing over horizontal

Ross: we encrypt the data for result. We would not allow voicexml
interpreter see the actual result. But, our stuff goes thru the IVR
services but it is in an encrypted packet. The engine does the

Dan: if do that then application will never see the result. 

JP: could be two separate binaries. 

Ken: interpreter box and SIV box are going to be close to each
other. Could put that into a secure network

Ross: we have security stuff physically locked. Our engine is
physically locked up. All physically secure and logically but it is
close together.

Ken: this isn’t unusual

Dan: can have SIV engine on left thatgives encoded results
information. The safest thing is to dynamic generation of new pages so
that it sees only current page. Even in application model the
interpreter only gets pages where the decision is made or not
made. Interpreter is a slave to the application. Varies with how build
application and security requirements

There are implications for syntax of pages. Might have syntax to run
SIV process. When a result comes back that is sent to the application
- it is encrypted.

This gets to the heart if issue of having SIV syntax in
VoiceXML. Uncomfortable having interpreter involved.

Ken: should be able to do iteither way. 

Dan: what kind of thresholding information might be sent to the SIV engine?

JP: SIV doesn’t know much about thresholds unless it issending back normalize sxores.

Chuck: in Nuance engine can set thresholds of several times. Some are runtime settable. 

Dan: should we send an encrypted configuration block to engine?
Someone would ask about that - to make it entirely opaque. You could
run it this way or not.

JP: there is a configuration file which must be hidden. 

Dan: if it is not a runtime settable file it could be verified by the
SIV engine. The problem is the runtime settable stuff that must pass
thru the interpreter. That is worrisome

Chuck: do you see the ability of the interpreter to modify parameters
of the engine is a security issue?

Dan: Cahty suggested that the ability to change such things as a security issue. 

Example: If I have different places in application where need to
change thresholds for authentication. I might set some thresholds to
50. Someone could set to 0. The interpreter would be compromised.

Valene: this would be true of other things as well. Could have denial of service. 

JP: could substitute my voice model for anyone else’s

JP; why are we doing all of this? 

1.simplercredential collection - so it isn’t distributed across a
  lot of different things. If you could do it in one call do all of
  authentication. Ease of use

2.bit customer latcncy problems with existing applications. Call ASR
  engine and get result. App gets results and sends it to SIV
  engine. That has latencies. Result is returned. Get a4-5 seconds. In
  the proposal the interpreter is working on it immediately.

3. potential improved accuracy of ASR/SIV - because engines share

4.Ken - benefit to application developers so don’t have to buy,
  maintain, etc. SIV egine.

5.to platform vendors because use same engine across apps. Service in
  a network you can access. SIV is just another resource.

Question: if automobile must do SIV but has no access to
network. Offline processing. Running over a cellular network.

Valene - from a security view could offer security to ASR
side. Potential to make security overflow.

JP: would that be vendors?

Dan: vendors - it is end-to-end encryption.

Valene: however you do it there is a benefit. It would be consistent
and include ASR. More streamlined as well.

JP: disadvantages. 

big question - is market demand such that platform and engine
providers would be onboard?

Jim: could do 

Dan: one of things taking away from this - what needs to be enabled as
possible in VoiceXML in order to provide security. May be simple -
encrypted results and encrypted configuration block sent to engine. It
may be that simple. Allowing those two to be possible but not
requiring them to do it.that enablesas tight security as you
want. Make sure you can send and accept and encoded block. Not know
what is in it?

I think it affects architecture

One issue in EMMA is how represent results. MRCP V2 does not talk
about sending back an encrypted block rather than filling in
fields. That is a modification may need to be done. Encrypted
configuration block as well. Same as for voicexml. Implication is mild
- allows different kind of data be sent.

Dan: effort for browsers would be just change kind of data. Very
little work. Work is on vendor side. Vendor needs to support this ONLy
if customers are asking for it. Vendor can say if you want an encoded
encryption block could add it if want to do it.

JP: endors would like this because gives vendors more of a
role. Before concerned about our role in all of this.

Dan: I can now go to my CEO and say here is how we can add SIV
securely in our platform

Ken: the ability to support multiple engines is helped. It makes SIV
look more like ASR. This is from browser implementer standpoint. If
have SIV, ASR, and other engines have not yet made those
decisions. Assume wait until all results in but not necessary to do it
that way.

Valene: if I'm prompting for something I might use TTS and someone
hacks and system asks person to call a fraudulent phone number or
changes something that is requested.

12:45-1:00 SHORT BREAK

1:00-2:00  Session13: SUMMARY: WHAT HAVE WE DECIDED?

Jim Larson
Judith Markowitz
Summarization — all participants
Topics to discuss:
The following were discussed and agreed to at the Menlo Park meeting.
These comments are not endorsed by the W3C or the Voice Browser
Working Group and do not appear in order of importance.

Each person is asked to provide one idea or topic or insight from the meeting

1. Ability to delay tagging of audio and other structures based on
   downstream information.  Example: machine that asks caller to
   answer a challenge question. I whisper an answer to agent who then
   can pass the answer to the system. Associates meta-data with
   recording in non real time.

2. Obtaining CBEFF format ID for EMMA data. Figure out how to
   integrate EMMA with CBEFF. Cathy will offer to take the action to
   get a format ID in the M1 meeting in April.

3. There are multiple ways of integrating SIV with VoiceXML, including
   adding SIV functions as native features to VoiceXML 3 (e.g.,
   enroll, identify, verify) or as Web services.

4. Support the transfer of secure encrypted control block from the
   application server to the SIV engine through the browser and the
   return of a secure, encrypted result from the SIV engine to the
   application server by the browser. This reduces the risk of
   tampering by the browser or the VoiceXML code.  If you are in a
   hosted environment this can increase the trust of the hosted

5. Recommend that the language specifiers, platform developers, and
   the application developers use and promote a secure lifecycle,
   leveraging existing standards and methods, where applicable.

6. We did not explicitly discuss speaker identification. There needs
   to be more discussion of speaker identification.

7. Standards will be built bearing in mind good security practices.

8. The information-fusion algorithm is not a solved issue in
   scientific community. Standards should not proscribe the
   decision-making algorithm.

9. The browser does the collection of the biometric sample. The
   process that does the comparison with the reference
   (voiceprint/voicemodel) is separate from the browser. The decision
   can incorporate the result of one or more comparisons. There are
   many places where this decision can occur.

10. In addition to enroll, identify, and verify we agreed that some
    administrative functions (e.g., copy) are necessary. We did not
    know whether they belong in VoiceXML 3 or not.

11. We should co-ordinate with other organizations, notably
    ANSI/INCITS M1, OASIS/BIAS and Speaker Biometrics Committee of
    VoiceXML Forum.  This includes work on developing a standards
    framework. We should reference existing work rather than
    re-creating it. This includes security and policies/best

12. We agreed to collaborate on security activities with other W3C
    working groups, notably the security working group.

13. Public relations is an important issue to be considered during the
    development and use of the standard.  There is an ISO publication
    related to this (ISO 24714-1).

14. There was a specification that includes a minimal set of audio
    formats (ANSI/INCITS 456). The VBWG should consider using or
    including this specification.

Higher level speaker-recognition functions (e.g., Discussion is
government the only real market for SIV

- Canadaian phone company

- IDZ bank in Australia deploying

- Government and banks will need to know that they are dealing w the

GM – what am I doing to make things easier. This is important for
making life easier

Parameters are different 

Jim- wants SIV to replace passwords – Ken agreed When someone calls I
want to know who is calling

Jordan – put things on the device. ON calls want to know who is talking. 

Agree that SIV has markets beyond government. 
A market is opportunity + money. I haven’t seen the money yet. 
JP – anytime you have an enterprise where someone must authenticate you need SIV. 
Frankie: anytime have an enterprise where people call about once a year you need SIV
JP: anyone who needs to know who you are could be interested in SIV


Why SIV functionality should be added to VoiceXML:

1. Performance benefits / Customer-perceived latency. The system is
   more responsive.

2. Easier for developers. Programming interface is consistent with the
   way they use other VoiceXML resources and shields application
   developers from low-level operations.

3. Efficiencies of scale in hosted environment. Can share services.

4. From a business side. Makes it portable. Standardizations of easy
   to use API will grow the market. If you can show a standards-based
   implementation you can grow the market.

5. Facilitates integration with web model integration. 

6. Makes SIV applications consistent with the Web model. 

7. Drive towards service portability across implementation platforms

8. Minimizes the level of vendor lock-in.

9. Support in VoiceXML enables SIV use (without the application
   server) with intermittent/offline connectivity.

10. Standards are a sign of technology maturity. 

Ken: We need to do a better job of marketing what we are doing with
VoiceXML 3. Some people feel VoiceXML is dead. In VBWG need to work on

Cathy: Biometric Consortium conference – can try to get speaking spot
in Biometric Consortium.

Ken: Also do in contact center

Big take aways

1. We have actually identified the main hidden security issues that
   people have concern about and discussed ways in which they can be
   realistically addressed. Those issues don’t disappear but we can
   address them

2. look at more detail about configuration possibilities but need to
   examine more of security threats

3. VoiceXML exists

4. SIV fits into the VoiceXML space. 

5. There is an aveue for multimodal applications

6. mechanisms for configurations of the engine and can standardize
   some engine parameters

7. How VoiceXML fits into the larger standards picture

8. A lot of things were discussed but some of them are outside of the
   scope of an engine vendor. A good standard that would create a nice
   wrapper around engines would be needed. It does not exist.

9. VoiceXML V3 is one of the interaction managers of
   MMI. Collaboration with other working groups.

10. Classification, segmentation) were not discussed. The VBWG needs
    to determine whether to included in VoiceXML V3.

11. The VBWG should address the synchronization and markup integration
    of multiple modalities.

12. There are likely to be multiple modalities/factors involved in an
    interaction using VoiceXML. Consequently, developers need a way to
    not completely separate those modalities. Integrated use of
    multiple modalities/factors should be made easy.
Goodbye — Kazuyuki Ashimura, W3C

2:00 Workshop ends

The Call for Participation, the Logistics, the Presentation Guideline and the Agenda are also available on the W3C Web server.

Judith Markowitz, Ken Rehor and Kazuyuki Ashimura, Workshop co-Chairs

$Id: minutes.html,v 1.13 2009/06/16 14:36:54 ashimura Exp $