Speech

Recognition

REFERENCES

Sphinx-4: A Flexible Open Source Framework for  Speech Recognition  Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, Joe Woelfel  [2005]

Xuedong Huang, Alex Acero, Hsiao, Wuen Hon Spoken Language Processing A Guide to Theory, Algorithm and System Development [2001]

Creating a Mexican Spanish Version of the  CMU Sphinx-III Speech Recognition System Armando Varela, Heriberto Cuayáhuitl, and Juan Arturo Nolazco-Flores [2003]


http://wiki.audacityteam.org/wiki/

http://cmusphinx.sourceforge.net/wiki

THE PROBLEM

The task of spoken language communication requires a system to recognize, interpret, execute and respond to a spoken query.

Difficulties arise when speech signal corrupted by many sources

In addition the system has to cope with non-grammaticality of spoken communication and ambiguity of language.


Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology.



History note

Since the early 1970s, researchers at ATT, BBN, CMU, IBM, Lincoln Labs, MIT, and SRI have made major contributions in Spoken Language Understanding Research. In 1971, the Defense Advanced Research Projects Agency (Darpa) initiated an ambitious five-year, $15 million, multisite effort to develop speech-understanding systems.








 researcher or a professional in the field

 MUST ACHIEVE a thorough grounding iN

probability, statistics and information theory

pattern recognition, bayesian THEORY and machine learning

digital signal processing and stochastic processes

sPEECH signal REPRESENTATION, HIDDEN MARKOV MODELS

ACOUSTIC and LANGUAGE MODELING

SPOKEN LANGUAGE SYSTEM ARCHITECTURE

Spoken language processing refers to technologies related to speech recognition, text-to-speech, and spoken language understanding.


Spoken Language Structure

Speech signals are composed of analog sound patterns that serve as the basis for a discrete, symbolic representation of the spoken language – phonemes, syllables, and words.


Building blocks : Phonemes

To achieve language meaning,  cohesive phonetic spans characteristic patterns : Syllables and Words.

Major : Phrases and Sentences


Semantics, Lexicology, Syntax, Pragmatics, Etymology


sphinx 4

  It has been jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard's Cambridge Research Lab with contributions from UC Santa Cruz and Massachusetts Institute of Technology


Mobility and scalability, pocketsphinx &  Hadoop



Perl, C,  Java,  Shell, XML, Python

CONFIGURATION



Recognizer

 Decoder

Acoustic model

 Dictionary 

Grammar/Language model






Performance


Dual CPU running at 1015 MHz and 2G of memory


WSJ5K 7.323 6.97 1.36 1.22 0.96 5,000 trigram
HUB4 18.845 18.756 3.06 ~4.4 3.95 60,000 trigram




MEXICAN SPANISH MODEL


Dr. Arturo Nolazco Flores

Electronic Systems Engineering  (ITESM 1986)

Masters in Control Engineering  (ITESM 1986)

Masters in Electronic processing of voice and Language  (Cambridge, 1991)

Ph D in  Automatic Recognition of atmosphere Voice highly contaminated by noise (1994)


Mx Spanish Acoustic Model (2004)

MODEL IMPROVEMENTS

ACOUSTIC MODEL ADAPTATION:  N/A


 ACOUSTIC AND LANGUAGE MODEL TRAINING


50 hrs of 200 speakers for many speakers dictation

30 hrs from 2000 speakers

 Knowledge on phonetic structure of the language

Already done by somebody

Time to train the model and optimize parameters

(min 1 month)






DATA PREPARATION

Learn the parameters of the models of the sound units using a set of sample speech signals:training database.The database contains information required to extract statistics from the speech in form of the acoustic model.The trainer needs to be told which sound units you want it to learn the parameters of, and at least the sequence in which they occur in every speech signal in your training database.


 speech signalstranscripts for the database (in a single file) 

two dictionaries, one dictionary in which legitimate words in the language are mapped sequences of sound units (or sub-word units),

and another filler dictionary in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units

language model for the testing stage


Good WAYs

Manually segment audio recordings with existing transcription (podcasts, news, etc)


Setup automated collection on Voxforge


Record your friends and family and colleagues


You have to design database prompts and postprocess the results to ensure that audio actually corresponds to prompts

HARD TIMES

Audio Extraction and Metadata (ffmpeg)


Noise Decontamination/Reduction (sox/alsa/python)

(accuracy improvement  ~ 15%)


Audio Segmentation (sox/¿?)

Transcript Segmentation (¿?)


TRAIN (+2 months)

Speech Recognition

By csampez

Speech Recognition

  • 1,245