Full multicondition training for robust i-vector based speaker recognition

Paper authors: Dayana Ribas,
Emmanuel Vincent, Jose Ramon Calvo

Quick Revision

What were i-vectors used for again?

v1=[x1, y1, z1]
v1=[x1,y1,z1]v1=[x1, y1, z1]
v2=[x2, y2, z2]
v2=[x2,y2,z2]v2=[x2, y2, z2]

Since we know only Bob and Dave, which one of them is it? Or is it some stranger?

This is Bob:

And this is Dave:

v3=[...]
v3=[...]v3=[...]

Who is this?

Remember!

Unlike in previous example, i-vectors dimensionality usually ranges from 300 up to even 1000!

So, How did we get that i-vector?

These guys helped.

Those were GMM's, representing the Universal Background Model

We took their means...

m_1=[0.1, 0.7, ...]
m1=[0.1,0.7,...]m_1=[0.1, 0.7, ...]
m_2=[0.9, 0.4, ...]
m2=[0.9,0.4,...]m_2=[0.9, 0.4, ...]
...
......
m_n=[0.5, 0.2, ...]
mn=[0.5,0.2,...]m_n=[0.5, 0.2, ...]

...and stacked them into a supervector!

m_{super}=[0.1, 0.7, ..., 0.9, 0.4, ........, 0.5, 0.2, ...]
msuper=[0.1,0.7,...,0.9,0.4,........,0.5,0.2,...]m_{super}=[0.1, 0.7, ..., 0.9, 0.4, ........, 0.5, 0.2, ...]

However, it may be as large as
13.000 dimensions...

And That's why we performed A linear transformation!

our i-vector
(300 dimensions)

our T matrix (maps the 13000~ dimensions to the 'most important' 300 ones)

our supervector
(13000 dimensions)

Don't forget

Usually, we also use LDA in order to project our i-vector into another vector space which maximises discriminability.

Something like that...

Summary of this part

We've got different representations so far:

  • Universal Background Model (the big GMM)
  • T matrix (maps the big GMM means vector to a smaller one)
  • i-vector (the low dimensional representation of speaker-adapted GMM)
  • i-vector after LDA (or one of its variants)

The Problem

We haven't accounted much for channel variability so far.

We can train speaker recognition system on a clean set of data...

... but will it generalise well on real-life data?

... especially clean data we usually get in lab conditions...

(no)

Solution?

Multicondition training! (MCT)

Add some reverberation and noise to make the data dirty...

... add the dirty data to the training set...

... voila!

(basically, that's it)

Their Dataset

  • NIST-SRE 2004
  • NIST-SRE 2005
  • NIST-SRE 2008
  • Room impulse reponses and additive noise from
    Track1 of the 2nd CHIME Challenge

 

In total,

  • 3285  five minute speech signals for training
    [approx. 1 min speech each]
  • 470 speech signals for enrollment
  • 671 speech signals for test
  • 121 room impulse responses
  • Several hours of bakground noise (TV, voices, games, etc.)

Other observations of the authors

Multichannel systems

Speaker recognition databases are still mostly single-channel; but today, with multi-microphone systems being everywhere, we can use multiple recording channels and use some pretty awesome denoising multichannel algorithms to improve our input!

Partial vs full mct

The authors observed that it's common tendency to apply MCT only during LDA training; They claim to be the first to try to use it while training the UBM and the T matrix.

Clean Dataset Performance

Experimental Results

Conclusions

  • MCT works better when applied during UBM
    and T matrix training
  • Using multichannel denoising algorithms helps the performance even more
    (as opposed to single-channel denoising algorithms)
  • No model parameters had to changed - the only things that changed were the data

Thanks for Attention!

Full multicondition training for robust i-vector based speaker recognition

By Piotr Żelasko

Full multicondition training for robust i-vector based speaker recognition

Presentation about a paper by Dayana Ribas, Emmanuel Vincent and Jose Ramon Calvo (it's not mine, all copyright belongs to the authors) from Interspeech 2015

  • 704