Thursday, May 13, 2010

Fricative phonemes and interdental fricatives.

As a fairly typical member of our species, I occasionally wonder how stuff works. I was one of the people who gave up their blackberries last year and bought an android phone, in my case the Motorola Droid.

I have been mostly satisfied with my decision, although the two year term might be problematic, considering how fast technology is changing.

One area where the Droid is a clear winner is in the voice search capability. Their technology has improved considerably, at least for me. Voice Search allows you to unlock the many caverns of the web and Google Maps, hands free. It works remarkably well.

I am not the sharpest tool in the shed, so I am trying to figure out what makes the technology go. Perhaps there is somebody out there who can help shed some light for me.

My friend BigDave came down last week with his new HTC Incredible, an android cousin of my phone without a physical keyboard but having a superior camera and some cooler pinch and grab functionality. His voice search didn't seem to be quite as accurate as my own. He said that he had only had the system for two days. I tried to press him on this and he said that he believed that the VRS might be a self adaptive technology. He may have been in the initial dating, voice library building portion of his relationship with his phone. Holding hands but no necking.

I thought about this for a few days and got to wondering? How do you create an adaptive technology if the subject is not able to give back positive or negative cues? The phone has no way of knowing if it has been successful or not. Does a  VRD tailor itself to every individual customer? Perhaps my friend is not absolutely correct in his understanding.

I started researching Voice Recognition today and here is what I have found.

Automatic Speech Recognition (ASR) was first successfully achieved in 1952 and then reappeared with IBM's Shoebox at the 1964 World's Fair in New York. Accuracy is generally rated by three tests: SWER (Single Word Error Rate), WER (Word Error Rate with Real Time Factor) and CSR (Command Success Rate). Voice recognition has improved past the 80% rate today from a paltry 10% accuracy in 1993. Google reportedly now has a trillion word corpus.

Language and acoustic modeling help scientists compute the algorithms that define voice recognition. The primary algorithm is called Hidden Markov Models that break speech into short time stationary signals (e.g. 10 millisecond bursts). There is another method called Dynamic Time Warping, which is less widely applied and compares sequence differentiation in time and speed.

I reprint a passage on Hidden Markov Modeling from Wikipedia:


In speech recognition, the Hidden Markov model would output a sequence of n-dimensional real-valued vectors (with n being a small integer, such as 10), outputting one of these every 10 milliseconds. The vectors would consist of cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech and decorrelating the spectrum using a cosine transform, then taking the first (most significant) coefficients. The hidden Markov model will tend to have in each state a statistical distribution that is a mixture of diagonal covariance Gaussians which will give a likelihood for each observed vector. Each word, or (for more general speech recognition systems), each phoneme, will have a different output distribution; a hidden Markov model for a sequence of words or phonemes is made by concatenating the individual trained hidden Markov models for the separate words and phonemes.
Described above are the core elements of the most common, HMM-based approach to speech recognition. Modern speech recognition systems use various combinations of a number of standard techniques in order to improve results over the basic approach described above. A typical large-vocabulary system would need context dependency for the phonemes (so phonemes with different left and right context have different realizations as HMM states); it would use cepstral normalization to normalize for different speaker and recording conditions; for further speaker normalization it might use vocal tract length normalization (VTLN) for male-female normalization and maximum likelihood linear regression (MLLR) for more general speaker adaptation. The features would have so-called delta and delta-delta coefficients to capture speech dynamics and in addition might use heteroscedastic linear discriminant analysis (HLDA); or might skip the delta and delta-delta coefficients and use splicing and an LDA-based projection followed perhaps by heteroscedastic linear discriminant analysis or a global semitied covariance transform (also known as maximum likelihood linear transform, or MLLT). Many systems use so-called discriminative training techniques which dispense with a purely statistical approach to HMM parameter estimation and instead optimize some classification-related measure of the training data. Examples are maximum mutual information (MMI), minimum classification error (MCE) and minimum phone error (MPE).
Decoding of the speech (the term for what happens when the system is presented with a new utterance and must compute the most likely source sentence) would probably use the Viterbi algorithm to find the best path, and here there is a choice between dynamically creating a combination hidden Markov model which includes both the acoustic and language model information, or combining it statically beforehand (the finite state transducer, or FST, approach).


It's a mouthful to be sure. I started reading about fricative phonemes, that elusive bunch. There are 42 of these strange beasts in the english language - the f in fish, the v in very, the th in this. These vocalizations are what tend to trip voice recognition up. Interdental fricatives are sounds produced with the tongue between the upper and lower teeth and not the backs of the teeth like dental consonants. Fricatives are consonants produced by forcing air through a narrow channel made by two articulators. Sibilants are loud members of the fricative family. Obstruent, sonorant  or palato-alveolar, this stuff gets confusing fast.

So I will start dialing back. Can someone please tell me if the system is indeed self adaptive? Or truly break this down for me? Is there a breaking in period that conforms to your individual voice and if so how does it self correct without signals? Here is a link to a pdf report by some of the sages of the subject. And if you come from a family where it is not considered polite to discuss linear predictive coding and Benford's Law of Scale Invariation, please accept my most sincere apology.