Técnicas de Processamento de Linguagem natural

CBOW e Skip-Gram

Robson Cruz

Word Embeddings?

word embedding é uma "representação aprendida" para texto onde palavras que possuem o mesmo significado tem uma representação similar

É atualmente considerado um dos maiores avanços e responsáveis para o Processamento de Linguagem Natural (NLP).

A ideia geral é representar uma palavra de um vocabulário utilizando vetores reais em um espaço vetorial pré-definido.

O ponto chave da abordagem é a utilização de vetores de alta dimensionalidade.

associate with each word in the vocabulary a distributed word feature vector … The feature vector represents different aspects of the word: each word is associated with a point in a vector space. The number of features … is much smaller than the size of the vocabulary

Bengio, Y. & Ducharme, Réjean & Vincent, Pascal. (2000). A Neural Probabilistic Language Model. Journal of Machine Learning Research.

Isso apresenta vantagens em relação ao BoW

Bag-of-Words?

É um modelo que se baseia na frequência de ocorrência de palavras. A intuição é que documentos são similares se possuem conteúdos similares.

Vocabulário
Medida de frequência de ocorrências (TF-IDF)

Coleta de dados

Desenvolvimento do vocabulário

Criação de vetores de documentos

It was the best of times,
it was the worst of times,
it was the age of wisdom,
it was the age of foolishness

“it”
“was”
“the”
“best”
“of”
“times”
“worst”
“age”
“wisdom”
“foolishness”

"it was the worst of times"

[1, 1, 1, 0, 1, 1, 1, 0, 0, 0]

Problemas?

Gerenciamento do vocabulário
Representações esparsas
Discarte de significado

Algoritmos

Diferentes técnicas são utilizadas para word embedding

Embedding layer

Word2Vec

gloVe

Word2Vec

É um método estatítico publicado em 2013 por Tomas Mikolov (Google)

Baseada na ideia de similaridade de contextos

is a two-layer neural network that takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space

Mikolov, Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space"

One-Hot encoding

King, Queen, Man, Woman, Child

[0, 1, 0, 0, 0]

King

Queen

Woman

Man

Child

Distributed Encoding

A ideia é representar o conteúdo de maneira abstrata

Mikolov, Tomas; et al. (2013). "Efficient Estimation of Word Representations in Vector Space"

Utilização do Contexto

CBOW

Continuous Bag of Words

Utiliza o contexto da palavra para determinar a "saída"

The training objective is to maximize the conditional probability of observing the actual output word (the focus word) given the input context words, with regard to the weights.

[0, 1, 0]

Input

1 \times V

\begin{bmatrix} w_{11} & w_{12} & \ldots & w_{1v} \\ \vdots & \ddots & & \vdots \\ w_{n1} & w_{n2} & \ldots & w_{nv} \end{bmatrix}

N \times V

=[w1, w2, w3]

Output

Skip-Gram

A ideia do Skip-Gram é trabalhar de maneira inversa ao CBOW.

Dada uma palavra de entrada, são gerados vetores para palavras possíveis relacionadas

This implies that the link (activation) function of the hidden layer units is simply linear (i.e., directly passing its weighted sum of inputs to the next layer).

Otimização

A atualizaçãod e cada vertor de saída é um processo computacionalmente caro...

To solve this problem, an intuition is to limit the number of output vectors that must be updated per training instance. One elegant approach to achieving this is hierarchical softmax; another approach is through sampling.

The main advantage is that instead of evaluating V output nodes in the neural network to obtain the probability distribution, it is needed to evaluate only about log2(V) words… In our work we use a binary Huffman tree, as it assigns short codes to the frequent words which results in fast training.

Subamostragem para diminuir o desbalanceamento

Demo

CBoW and Skip-Gram