zero-shot learning (ZSL)

Can you say, which of the creatures, presented in these photos
is an "Aye-aye"?

I think you probably have never seen them. And because of this, you cannot give an answer.

Just like typical ML model.
Never seen - can't predict.
(much better than random)

But what if i give you some knowledge about "Aye-aye"?

Let's say, it lives on land.
In this case, only 4 answers remain.
It's relatively small.
Then there are 2 options left.
And it doesn't have a long neck
Ta-daaaa, we've found it!

What just happened? You gained knowledge about the animal and only then you were able to give a prediction.

Just to consolidate and move on

"Zero-shot learning is a method that allows to predict something we've never seen."

Nikita Detkov

"Zero-shot learning is being able to solve a task despite not having received any training examples of that task..."

Ian Goodfellow

So how is it working in terms of classification?

Given:

We have pictures with class labels - it is our train set.
We have pictures without class labels - it is our zero-shot set.

Result:

We want to find class labels for zero-shot dataset even if these classes did not occur in the train set.

But... How? :

Remember previous example. Imagine that you have never seen a single cat in your life. And then someone gave you his detailed description. From that moment, if you see a cat, you will recognize him. We can implement this logic with image embeddings and semantic relationship between classes.

Get image embeddings

You can use any backbone to get vector representation of picture in latent space: VGG, ResNET, etc. - as you like (BUT THERE IS A LITTLE DETAIL).

Word2Vec
fastText
GloVe
etc.

GET CLASS LATENT SPACE

Get function mapping embedding space to class space

So, for example,
embedding space \(\in R^{1024}\), class space \(\in R^{300}\) and 15 classes in train set.

Then we can do something like this and train it:

After all, we need just to drop last layer and your mapping function is done!

pretty good example OF HOW IT WORKS

or at least should

and So, again, approximate algorithm

Get embeddings of pics from train set
Create class latent space, where distance between classes depends on their semantic relationship
Find some mapping function from embedding space to classes latent space.
Get embeddings of pics from zero-shot set
Get vectors corresponding to embeddings from previous step in classes latent space with mapping function
Label received vectors with, lets say, 1-NN algorithm or use some clustering method with label propagation

Now you have class labels for pics from zero-shot set.

Other applications

other ways to implement

Applications
CV:

Classification
Detection
Multi-class Labeling
Image Captioning
etc.;

NLP,
Audio Recognition,
and so on

Implementations

Word2vec and others are unsupervised methods, but we can use supervised ones. For example, if we have attributes for each class, we can use them as class latent space.

Also we can get an embeddings in a different way with some other alorithms

Some problems

Two major problems faced by ZSL algorithms are the hubness problem and the bias towards the seen classes.

Questions?

https://t.me/NikitaDetkov