See also: ASR
The voice recognition or (Automatic Speech Recognition ASR) is a technology of transcription of an exploitable organic phonatory system by a machine. The voice recognition coupled to methods of Voice synthesis, of vocal Order, vocal Identification, and comprehension forms an element of a man-machine Interface ideal (10 times more quantity of information than with a seizure keyboard, comfort…). The voice recognition belongs to the branch Speech processing.
Field of research
The voice recognition can be attached to many sides of the science :
automatic Treatment of the languages, Linguistic, Theory of the formal languages,
Information theory,
Treatment of the signal, neural Networks,
Artificial intelligence,…
History
Work on the voice recognition dates from the beginning of the 20th century. The first system which can be regarded as making voice recognition goes back to 1952. This electronic system developed by Davis, Biddulph, and Balashek at the laboratories Bell Labs was primarily composed of relay and its performances were limited to recognize isolated figures (see reference). Research increased then considerably during the Seventies with work of Jelinek at IBM (1972-1993). Today, the voice recognition is a field with strong growth thanks to the beachcomber of the embarked systems.
Basic principle
A recorded and digitized sentence is given to the program of
voice recognition . In formalism ASR, functional cutting is the following:
- the acoustic treatment (face-end) mainly makes it possible to extract from the signal of word the acoustic image most significant possible on sections of time of approximately 30ms. This image is appeared as a vector of characteristics (features extraction) from 10 to 15 principal components, to which the differences in first and second order are added to obtain a size of 30-45 into final.
- the acoustic treatment (face-end) aims at digitizing the signal of word in the form of acoustic vectors which constitute the data of observation for the system of recognition. One uses for that the techniques of treatment of the signal: one cuts out the signal in sections of 30ms while carrying out for each section a shift of 10ms (technique of fenestration of Hamming) in order to obtain 10ms significant data for each vector. The signal then is digitized and parameterized by a technique of frequential analysis using the transformed of Fourier (for example MFCC (Mel-Frequency Cepstral Coeffcients).
- the machine Learning which carries out an association between the elementary segments of words and the lexical elements. This association amongst other things calls upon a statistical modeling by model of Markov hidden (HMM, Hidden Markov Models) and/or by networks of artificial neurons (ANARCHIST, Artificial Neural Networks).
- the recognition (back-end) which by concaténant the elementary segments of words previously learned reconstitutes the most probable speech. It is thus about a correspondence of reason (pattern matching) temporal, often carried out by the algorithm of temporal deformation dynamic (DTW).
Models
Such a system is based on 3 principal models:
- acoustic Model: this model is able, starting from the acoustic signal, more precisely of the result of the acoustic treatment, to give the probability that the signal corresponds to each possible phoneme of the target language.
- Model of pronunciation: this model gives for each word of the vocabulary the possible pronunciations at the phonetic level with associated probabilities.
- Model of language: this model gives for each continuation of words its probability in the target language.
The combination of these three models makes it possible to calculate for any continuation of words the probability that the aural signal corresponds to him. To carry out the recognition, often called decoding, consists in finding that which with the highest probability.
Classification
A system of voice recognition classifies by a small number of parameters called modes of recognition which are correlate with the following difficulties:
- Variability inter and intra-speaker: The systems single-speakers (in English dependant announcer ) carry out a training in-situ of the words. The multi-speaker systems (in English independent announcer ) are able to recognize a fixed corpus (50 words surroundings) whatever the speaker. The systems single-speakers are spread and in particular tend to spread thanks to the synthesis Text to speech which avoids the phase of training.
- Natural of the speech: The systems can work on continuous word (in English continuous speech ), words isolated ( isolated Word ) or from the key words ( key spotting ).
- Size of the vocabulary
- Environment
Performance
The rough performances of an engine of voice recognition are often measured in rate error of words ('' Word error misses ''). One can, reciprocally, evaluate the success rate. Here some results in terms of error rate, for French:
Existing software
- HTK Software developed with the CUED
- Sphinges 4 Software developed with the CMU
- teliSpeech Software professional of telisma
- Software professional of G2 Speech
- Dragon naturally occupational Software Speaking of Scansoft
- Voice recognition Crescendo Software professional dedicated to the medical sector
- MacSpeech Voice recognition for Macintosh
- TRADE WIND Platform developed at the Data-processing Laboratory of Avignon (BOUND)
Microsoft Windows Vista Voice recognition integrated has Microsoft
Windows Vista
See too
- VoiceXML : standard of vocal interaction
- Linguistic computational
- Linguistic, or science of the language
- Voice synthesis, the process reverses
- vocal Commande.
- G2 Speech
More
- '' Automatic Recognition off Spoken Digits '' the historical article on the first system of voice recognition
Transparent - FPSL of the GTPB on the Speech processing
- CSLU Survey off the state off the art in human language technology
- Jean-Paul Haton, Automatic speech recognition: Signal with its interpretation , Dunod Paris, 2006
- '' BLOG '' the BLOG of vocal technologies