Speech
Recognition
REFERENCES
Sphinx-4: A Flexible Open Source Framework for
Speech Recognition
Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, Joe Woelfel [2005]
Xuedong Huang, Alex Acero, Hsiao, Wuen Hon Spoken Language Processing A Guide to Theory, Algorithm and System Development [2001]
Creating a Mexican Spanish Version of the
CMU Sphinx-III Speech Recognition System Armando Varela, Heriberto Cuayáhuitl, and Juan Arturo Nolazco-Flores [2003]
http://wiki.audacityteam.org/wiki/
http://cmusphinx.sourceforge.net/wiki
THE PROBLEM
The task of spoken language communication requires a system to recognize, interpret, execute and respond to a spoken query.
Difficulties arise when speech signal corrupted by many sources
In addition the system has to cope with non-grammaticality of spoken communication and ambiguity of language.
Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology.
History note
researcher or a professional in the field
MUST ACHIEVE a thorough grounding iN
probability, statistics and information theory
pattern recognition, bayesian THEORY and machine learning
digital signal processing and stochastic processes
sPEECH signal REPRESENTATION, HIDDEN MARKOV MODELS
ACOUSTIC and LANGUAGE MODELING
SPOKEN LANGUAGE SYSTEM ARCHITECTURE
Spoken language processing refers to technologies related to speech recognition, text-to-speech, and spoken language understanding.
Spoken Language Structure
Speech signals are composed of analog sound patterns that serve as the basis for a discrete, symbolic representation of the spoken language – phonemes, syllables, and words.
Building blocks : Phonemes
To achieve language meaning, cohesive phonetic spans characteristic patterns : Syllables and Words.
Major : Phrases and Sentences
Semantics, Lexicology, Syntax, Pragmatics, Etymology
sphinx 4
It has been jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard's Cambridge Research Lab with contributions from UC Santa Cruz and Massachusetts Institute of Technology
Mobility and scalability, pocketsphinx & Hadoop
Perl, C, Java, Shell, XML, Python
CONFIGURATION
Recognizer
Decoder
Acoustic model
Dictionary
Grammar/Language model
Performance
Dual CPU running at 1015 MHz and 2G of memory
WSJ5K | 7.323 | 6.97 | 1.36 | 1.22 | 0.96 | 5,000 | trigram |
---|---|---|---|---|---|---|---|
HUB4 | 18.845 | 18.756 | 3.06 | ~4.4 | 3.95 | 60,000 | trigram |
MEXICAN SPANISH MODEL
Dr. Arturo Nolazco Flores
Electronic Systems Engineering (ITESM 1986)
Masters in Control Engineering (ITESM 1986)
Masters in Electronic processing of voice and Language (Cambridge, 1991)
Ph D in Automatic Recognition of atmosphere Voice highly contaminated by noise (1994)
Mx Spanish Acoustic Model (2004)
MODEL IMPROVEMENTS
ACOUSTIC MODEL ADAPTATION: N/A
ACOUSTIC AND LANGUAGE MODEL TRAINING
50 hrs of 200 speakers for many speakers dictation
30 hrs from 2000 speakers
Knowledge on phonetic structure of the language
Already done by somebody
Time to train the model and optimize parameters
(min 1 month)
DATA PREPARATION
Learn the parameters of the models of the sound units using a set of sample speech signals:training database.The database contains information required to extract statistics from the speech in form of the acoustic model.The trainer needs to be told which sound units you want it to learn the parameters of,
and at least the sequence in which they occur in every speech signal in your training database.
speech signals, transcripts for the database (in a single file)
two dictionaries, one dictionary in which legitimate words in the language are mapped sequences of sound units (or sub-word units),
and another filler dictionary in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units
a language model for the testing stage
Good WAYs
Manually segment audio recordings with existing transcription (podcasts, news, etc)
Setup automated collection on Voxforge
Record your friends and family and colleagues
You have to design database prompts and postprocess the results to ensure that audio actually corresponds to prompts
HARD TIMES
Audio Extraction and Metadata (ffmpeg)
Noise Decontamination/Reduction (sox/alsa/python)
(accuracy improvement ~ 15%)
Audio Segmentation (sox/¿?)
Transcript Segmentation (¿?)
TRAIN (+2 months)
Speech Recognition
By csampez
Speech Recognition
- 1,245