Sphinx-4: A Flexible Open Source Framework for
Speech Recognition
Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, Joe Woelfel [2005]
Xuedong Huang, Alex Acero, Hsiao, Wuen Hon Spoken Language Processing A Guide to Theory, Algorithm and System Development [2001]
Creating a Mexican Spanish Version of the
CMU Sphinx-III Speech Recognition System Armando Varela, Heriberto Cuayáhuitl, and Juan Arturo Nolazco-Flores [2003]
http://wiki.audacityteam.org/wiki/
http://cmusphinx.sourceforge.net/wiki
The task of spoken language communication requires a system to recognize, interpret, execute and respond to a spoken query.
Difficulties arise when speech signal corrupted by many sources
In addition the system has to cope with non-grammaticality of spoken communication and ambiguity of language.
Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology.
Speech signals are composed of analog sound patterns that serve as the basis for a discrete, symbolic representation of the spoken language – phonemes, syllables, and words.
Building blocks : Phonemes
To achieve language meaning, cohesive phonetic spans characteristic patterns : Syllables and Words.
Major : Phrases and Sentences
Semantics, Lexicology, Syntax, Pragmatics, Etymology
It has been jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard's Cambridge Research Lab with contributions from UC Santa Cruz and Massachusetts Institute of Technology
Mobility and scalability, pocketsphinx & Hadoop
Perl, C, Java, Shell, XML, Python
Dual CPU running at 1015 MHz and 2G of memory
WSJ5K | 7.323 | 6.97 | 1.36 | 1.22 | 0.96 | 5,000 | trigram |
---|---|---|---|---|---|---|---|
HUB4 | 18.845 | 18.756 | 3.06 | ~4.4 | 3.95 | 60,000 | trigram |
Dr. Arturo Nolazco Flores
Electronic Systems Engineering (ITESM 1986)
Masters in Control Engineering (ITESM 1986)
Masters in Electronic processing of voice and Language (Cambridge, 1991)
Ph D in Automatic Recognition of atmosphere Voice highly contaminated by noise (1994)
Mx Spanish Acoustic Model (2004)
ACOUSTIC MODEL ADAPTATION: N/A
ACOUSTIC AND LANGUAGE MODEL TRAINING
50 hrs of 200 speakers for many speakers dictation
30 hrs from 2000 speakers
Knowledge on phonetic structure of the language
Already done by somebody
Time to train the model and optimize parameters
(min 1 month)
Learn the parameters of the models of the sound units using a set of sample speech signals:training database.The database contains information required to extract statistics from the speech in form of the acoustic model.The trainer needs to be told which sound units you want it to learn the parameters of,
and at least the sequence in which they occur in every speech signal in your training database.
speech signals, transcripts for the database (in a single file)
two dictionaries, one dictionary in which legitimate words in the language are mapped sequences of sound units (or sub-word units),
and another filler dictionary in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units
a language model for the testing stage
Manually segment audio recordings with existing transcription (podcasts, news, etc)
Setup automated collection on Voxforge
Record your friends and family and colleagues
You have to design database prompts and postprocess the results to ensure that audio actually corresponds to prompts
Audio Extraction and Metadata (ffmpeg)
Noise Decontamination/Reduction (sox/alsa/python)
(accuracy improvement ~ 15%)
Audio Segmentation (sox/¿?)
Transcript Segmentation (¿?)
TRAIN (+2 months)