Full multicondition training for robust i-vector based speaker recognition

Paper authors: Dayana Ribas,
Emmanuel Vincent, Jose Ramon Calvo

Quick Revision

What were i-vectors used for again?

v1=[x1, y1, z1]

v1=[x1, y1, z1]

v2=[x2, y2, z2]

v2=[x2, y2, z2]

Since we know only Bob and Dave, which one of them is it? Or is it some stranger?

This is Bob:

And this is Dave:

v3=[...]

v3=[...]

Who is this?

Remember!

Unlike in previous example, i-vectors dimensionality usually ranges from 300 up to even 1000!

So, How did we get that i-vector?

These guys helped.

Those were GMM's, representing the Universal Background Model

We took their means...

m_1=[0.1, 0.7, ...]

m_1=[0.1, 0.7, ...]

m_2=[0.9, 0.4, ...]

m_2=[0.9, 0.4, ...]

...

...

m_n=[0.5, 0.2, ...]

m_n=[0.5, 0.2, ...]

...and stacked them into a supervector!

m_{super}=[0.1, 0.7, ..., 0.9, 0.4, ........, 0.5, 0.2, ...]

m_{super}=[0.1, 0.7, ..., 0.9, 0.4, ........, 0.5, 0.2, ...]

However, it may be as large as
13.000 dimensions...

And That's why we performed A linear transformation!

our i-vector
(300 dimensions)

our T matrix (maps the 13000~ dimensions to the 'most important' 300 ones)

our supervector
(13000 dimensions)

Don't forget

Usually, we also use LDA in order to project our i-vector into another vector space which maximises discriminability.

Something like that...

Summary of this part

We've got different representations so far:

Universal Background Model (the big GMM)
T matrix (maps the big GMM means vector to a smaller one)
i-vector (the low dimensional representation of speaker-adapted GMM)
i-vector after LDA (or one of its variants)

The Problem

We haven't accounted much for channel variability so far.

We can train speaker recognition system on a clean set of data...

... but will it generalise well on real-life data?

... especially clean data we usually get in lab conditions...

(no)

Solution?

Multicondition training! (MCT)

Add some reverberation and noise to make the data dirty...

... add the dirty data to the training set...

... voila!

(basically, that's it)

Their Dataset

NIST-SRE 2004
NIST-SRE 2005
NIST-SRE 2008
Room impulse reponses and additive noise from
Track1 of the 2nd CHIME Challenge

In total,

3285 five minute speech signals for training
[approx. 1 min speech each]
470 speech signals for enrollment
671 speech signals for test
121 room impulse responses
Several hours of bakground noise (TV, voices, games, etc.)

Other observations of the authors

Multichannel systems

Speaker recognition databases are still mostly single-channel; but today, with multi-microphone systems being everywhere, we can use multiple recording channels and use some pretty awesome denoising multichannel algorithms to improve our input!

Partial vs full mct

The authors observed that it's common tendency to apply MCT only during LDA training; They claim to be the first to try to use it while training the UBM and the T matrix.

Clean Dataset Performance

Experimental Results

Conclusions

MCT works better when applied during UBM
and T matrix training
Using multichannel denoising algorithms helps the performance even more
(as opposed to single-channel denoising algorithms)
No model parameters had to changed - the only things that changed were the data

Thanks for Attention!

Full multicondition training for robust i-vector based speaker recognition

By Piotr Żelasko

Full multicondition training for robust i-vector based speaker recognition

Presentation about a paper by Dayana Ribas, Emmanuel Vincent and Jose Ramon Calvo (it's not mine, all copyright belongs to the authors) from Interspeech 2015

Piotr Żelasko

Research scientist at CLSP, John's Hopkins University. PhD @ AGH-UST in Cracow. My interests are automatic speech recognition, natural language processing, C++ and Python, machine learning and deep learning, and jazz music.

Full multicondition training for robust i-vector based speaker recognition

Quick Revision

Remember!

So, How did we get that i-vector?

Those were GMM's, representing the Universal Background Model

And That's why we performed A linear transformation!

Don't forget

Summary of this part

The Problem

Solution?

Their Dataset

Other observations of the authors

Multichannel systems

Partial vs full mct

Clean Dataset Performance

Experimental Results

Conclusions

Thanks for Attention!

Full multicondition training for robust i-vector based speaker recognition

More from Piotr Żelasko