This project involves the design, implementation, and testing of a unique voiceprint biometric system.
Key Idea: Common-Phrase Speech Utterance
Voiceprint systems can base user authentication on four possible types of passphrases:
This study will focus primarily on using the specified, specially-designed passphrase, "My name is", which is common to all users.
The main reasons for using a common authentication phrase are as follows:
- a user-specified phrase, like the user's name
- a specified phrase common to all users
- a random phrase that the computer displays on the screen
- a random phrase that can vary at the user's discretion
If time permits, as an alternative and for contrast, the study will also briefly explore the user saying his name as the passphrase.
For this reason the recorded utterances will include the spoken user names - for example, "My name is John Smith" -
and the user names could be used in an additional authentication stage.
Thus, for increased performance a dual authentication system could be investigated.
The second portion of the system operating on the user names would likely not attempt to segment the names into phonetic units.
Also, to obtain imposter utterances, some participants would need to practice and then record utterances of the authentic users' names.
- It simplifies the segmentation problem.
Earlier work indicated that features extracted from the individual phonetic units of an utterance increased authentication performance (Trilok, 2004),
and this requires segmenting the utterance into phonetic units. Segmenting one known utterance into its phonetic units is much easier than segmenting many unknown utterances into their phonetic units.
- It allows for the careful selection of the common phrase to optimize the variety of phonetic units for their authentication value. The "My name is" phrase contains seven phonetic units: three nasal sounds, the two [m]'s and one [n]; three vowel sounds, [aI], [eI], and [I]; and one fricative [z]. The nasal and vowel sounds characterize the user's nasal and vocal tracts, respectively, and the fricative characterizes the user's teeth and front portion of the mouth.
- It facilitates testing for imposters since the common utterance spoken by non-authentic users can be employed as imposter utterances.
- It permits the measurement of true voice authentication biometric performance. For example, in the area of keystroke biometrics for password authentication, the manufacturers of the password "hardening" systems typically claim performance levels of 99% or higher. However, Killourhy & Maxion (2009) used a common password to show that the best actual keystroke biometric performance for password authentication is only about 90%.
- It avoids potential experimental flaws. The combination of using the same authentication utterance for all users, and one that consists of frequently used words that are easy to pronounce, avoids many of the experimental flaws that Maxion (2011) describes in measuring the performance of biometric systems.
Speech Utterance Data Samples
We will obtain 20 sample utterances from each of about 100 experimental participants.
The utterance samples (wav files) should preferably be collected in groups of five samples over a period of a week or more --
for example, collect five samples per day on each of four different days.
In the worst case, record ten samples from a participant per day, requiring data collection on two different days.
For recording the utterances, the built-in microphone can be used with a laptop or an inexpensive microphone with a desktop computer.
The instructions to the participants should be: "please speak naturally but clearly in producing the utterance samples."
Each participant should practice recording the utterance about ten times,
and the participant and the experimenter should review the practice utterances for clarity and naturalness.
The experimenter will keep a record of the date the recordings were made, the microphone used,
and participant demographic information (gender, age, nationality, first language).
Some utterances for this work have been recorded (Morris, 2012), see SpeechSamples.
The following would be nice but is not required.
A database of the speech recordings (.wav files) could be made accessible through
a Web interface so users can input new recordings or listen to selected existing recordings.
An example of a similar database for speech samples can be found at
George Mason University's Speech Database.
Because the biometric system will operate on the initial "My name is" portion of the utterance,
that portion must be separated (isolated) from the background noise at the beginning of the recording and the person's name at the end.
The speech portion of the signal will first be extracted from the background noise by threshholding the signal's energy function.
Then, the [z] sound in the word "is" will be located to mark the end of the common portion of the utterance,
and the signal's energy function to detect the end of the person's name.
Each utterance will therefore be segmented into the common portion, "My name is", and the person's name portion of the utterance.
- A tool to segment the speech portion of the signal from the background noise
by threshholding the signal's energy function
- A tool or function to locate the [z] sound in the word "is" to locate the end of the fixed portion of the utterance
The initial speech processing of the utterance samples will consist of a standard spectral analysis.
This is a standard speech visualization tool
that typically gives a grey-scale plot of frequency bands as a function of time.
The spectrographic tool must provide access to the actual numerical data
which is usually represented in a matrix of frequency bands versus time intervals.
These data will be used by both the elastic matching and feature extraction components of the system.
One possibility is as follows (Trilok, 2004).
Compute the 13 lowest Mel-frequency Cepstral coefficients (MFCC) from 40 Mel-spaced filters:
13 spaced linearly with 133.33 Hz between center frequencies, and 27 spaced logarithmically by a frequency factor of 1.07 between adjacent filters.
The spectral analysis time frame could be a 30 msec Hamming window with 10 msec overlap between adjacent windows.
The number of time windows per utterance will vary because they are of fixed size and the lengths of the voice samples varies.
Dynamic Time Warping (DTW)
The standard "elastic matching" DTW algorithm (Deller, Hansen and Proakis, 2000)
will be used locate the seven phonetic sounds ([m], [aI], [n], [eI], [m], [I], [z]) in the utterance.
This will be performed by aligning each sample speech signal of the utterance with one that has been pre-segmented into the seven sounds.
Note that this step can be omitted if features are not obtained from the individual sounds.
Features (data measurements) will be designed and code written to extract them from each sample utterance.
The output of the feature extractor will be a fixed-length vector of measurements appropriate for input to the Pace University biometric authentication system.
Several feature sets will be explored, and all features will be normalized over the varying lengths of the speech utterances.
An understanding of phonetics is important to select information that is characteristic to each user as input to the biometric system.
Features will then be extracted from the numeric values of the spectral analysis.
For example, one feature set could consist of the means and variances of each of the 13 frequency bands over the entire utterance, for a total of 26 features per utterance.
Additional features will be extracted from each of the seven sound regions of the utterance.
For example, the same 13 frequency bands could be divided into its 7 speech sounds each utterance,
averaging the energy in the 13 frequency bands within each of the 7 sounds, for a total of 91 features.
The first Cepstral component might be omitted because it represents the energy of the signal and is probably not speaker specific.
Authentication Classification System
The Pace University Biometric Authentication (Tappert et al., 2010; Zack, Tappert & Cha, 2010; Stewart et al., 2011) will be used in this study.
This vector-difference authentication model transforms a multi-class problem into a two-class problem (Figure 1).
The resulting two classes are within-class ("you are authenticated") and between-class ("you are not authenticated").
This is a strong inferential statistics method found to be particularly effective in large open biometric systems
and in multidimensional feature-space problems (Cha & Srihari, 2000; Yoon et al., 2005).
Key references to previous work/ideas on this problem
Rationale for choosing the same phrase for all users
- Maxion, Roy A. (2011).
Making Experiments Dependable.
The Next Wave / NSA Magazine, Vol. 19, No. 1, pp. 13-22, March 2012, National Security Agency, Ft. Meade, Maryland.
Reprinted from Dependable and Historic Computing , LNCS 6875, pp. 344-357, Springer, Berlin, 2011.
- Maxion, Roy A. and Killourhy, Kevin S. (2010).
Keystroke Biometrics with Number-Pad Input.
In IEEE/IFIP International Conference on Dependable Systems & Networks (DSN-10), pp. 201-210, Chicago, Illinois, 28 June to 01 July 2010.
IEEE Computer Society Press, Los Alamitos, California, 2010.
- Killourhy, Kevin S. and Maxion, Roy A. (2009)
Comparing Anomaly-Detection Algorithms for Keystroke Dynamics.
In International Conference on Dependable Systems & Networks (DSN-09), pp. 125-134,
Estoril, Lisbon, Portugal, 29 June to 02 July 2009. IEEE Computer Society Press, Los Alamitos, California, 2009.
Pace University biometric authentication system
- Cha, S. & Srihari, S.N. (2000). Writer Identification: Statistical Analysis and Dichotomizer. Proc. SPR and SSPR 2000, LNCS - Advances in Pattern Recognition, v. 1876, 123-132.
- Yoon, S., Choi, S-S., Cha, S-H., Lee, Y., & Tappert, C.C. (2005).
On the individuality of the iris biometric.
Proc. Int. J. Graphics, Vision & Image Processing, 5(5), 63-70.
- R.S. Zack, C.C. Tappert & S.-H. Cha (2010).
Performance of a Long-Text-Input Keystroke Biometric Authentication System Using an Improved k-Nearest-Neighbor Classification Method.
Proc. IEEE 4th Int Conf Biometrics: Theory, Apps, and Systems (BTAS 2010), Wash. D.C.
- Tappert, C.C, Cha, S.-H., Villani, M., & Zack, R.S. (2010).
A Keystroke Biometric System for Long-Text Input,
invited paper, Int. J. Info. Security and Privacy (IJISP), Vol 4, No 1, 2010, pp 32-60.
- J.C. Stewart, J.V. Monaco, S. Cha, and C.C. Tappert (2011).
An Investigation of Keystroke and Stylometry Traits.
Proc. Int. Joint Conf. Biometrics (IJCB 2011), Wash. D.C., October 2011.
Links to some speech processing tools (these are very old)