Speech

Recognition

REFERENCES

Sphinx-4: A Flexible Open Source Framework for Speech Recognition Willie Walker, Paul Lamere, Philip Kwok, Bhiksha Raj, Rita Singh, Evandro Gouvea, Peter Wolf, Joe Woelfel [2005]

Xuedong Huang, Alex Acero, Hsiao, Wuen Hon Spoken Language Processing A Guide to Theory, Algorithm and System Development [2001]

Creating a Mexican Spanish Version of the CMU Sphinx-III Speech Recognition System Armando Varela, Heriberto Cuayáhuitl, and Juan Arturo Nolazco-Flores [2003]

http://wiki.audacityteam.org/wiki/

http://cmusphinx.sourceforge.net/wiki

THE PROBLEM

The task of spoken language communication requires a system to recognize, interpret, execute and respond to a spoken query.

Difficulties arise when speech signal corrupted by many sources

In addition the system has to cope with non-grammaticality of spoken communication and ambiguity of language.

Spoken language processing is a diverse subject that relies on knowledge of many levels, including acoustics, phonology, phonetics, linguistics, semantics, pragmatics, and discourse. The diverse nature of spoken language processing requires knowledge in computer science, electrical engineering, mathematics, syntax, and psychology.

History note

Since the early 1970s, researchers at ATT, BBN, CMU, IBM, Lincoln Labs, MIT, and SRI have made major contributions in Spoken Language Understanding Research. In 1971, the Defense Advanced Research Projects Agency (Darpa) initiated an ambitious five-year, $15 million, multisite effort to develop speech-understanding systems.

researcher or a professional in the field

MUST ACHIEVE a thorough grounding iN

probability, statistics and information theory

pattern recognition, bayesian THEORY and machine learning

digital signal processing and stochastic processes

sPEECH signal REPRESENTATION, HIDDEN MARKOV MODELS

ACOUSTIC and LANGUAGE MODELING

SPOKEN LANGUAGE SYSTEM ARCHITECTURE

Spoken language processing refers to technologies related to speech recognition, text-to-speech, and spoken language understanding.

Spoken Language Structure

Speech signals are composed of analog sound patterns that serve as the basis for a discrete, symbolic representation of the spoken language – phonemes, syllables, and words.

Building blocks : Phonemes

To achieve language meaning, cohesive phonetic spans characteristic patterns : Syllables and Words.

Major : Phrases and Sentences

Semantics, Lexicology, Syntax, Pragmatics, Etymology

sphinx 4

It has been jointly designed by Carnegie Mellon University, Sun Microsystems laboratories, Mitsubishi Electric Research Labs, and Hewlett-Packard's Cambridge Research Lab with contributions from UC Santa Cruz and Massachusetts Institute of Technology

Mobility and scalability, pocketsphinx & Hadoop

Perl, C, Java, Shell, XML, Python

CONFIGURATION

Recognizer

Decoder

Acoustic model

Dictionary

Grammar/Language model

Performance

Dual CPU running at 1015 MHz and 2G of memory

WSJ5K	7.323	6.97	1.36	1.22	0.96	5,000	trigram
HUB4	18.845	18.756	3.06	~4.4	3.95	60,000	trigram

MEXICAN SPANISH MODEL

Dr. Arturo Nolazco Flores

Electronic Systems Engineering (ITESM 1986)

Masters in Control Engineering (ITESM 1986)

Masters in Electronic processing of voice and Language (Cambridge, 1991)

Ph D in Automatic Recognition of atmosphere Voice highly contaminated by noise (1994)

Mx Spanish Acoustic Model (2004)

MODEL IMPROVEMENTS

ACOUSTIC MODEL ADAPTATION: N/A

ACOUSTIC AND LANGUAGE MODEL TRAINING

50 hrs of 200 speakers for many speakers dictation

30 hrs from 2000 speakers

Knowledge on phonetic structure of the language

Already done by somebody

Time to train the model and optimize parameters

(min 1 month)

DATA PREPARATION

Learn the parameters of the models of the sound units using a set of sample speech signals:training database.The database contains information required to extract statistics from the speech in form of the acoustic model.The trainer needs to be told which sound units you want it to learn the parameters of, and at least the sequence in which they occur in every speech signal in your training database.

speech signals, transcripts for the database (in a single file)

two dictionaries, one dictionary in which legitimate words in the language are mapped sequences of sound units (or sub-word units),

and another filler dictionary in which non-speech sounds are mapped to corresponding non-speech or speech-like sound units

a language model for the testing stage

Good WAYs

Manually segment audio recordings with existing transcription (podcasts, news, etc)

Setup automated collection on Voxforge

Record your friends and family and colleagues

You have to design database prompts and postprocess the results to ensure that audio actually corresponds to prompts

HARD TIMES

Audio Extraction and Metadata (ffmpeg)

Noise Decontamination/Reduction (sox/alsa/python)

(accuracy improvement ~ 15%)

Audio Segmentation (sox/¿?)

Transcript Segmentation (¿?)

TRAIN (+2 months)