Data mining from text?

  • Extract entities (NER)
  • Extract parts of speech (POS)
  • Classification (Sentiment Analysis, etc)
  • Natural language understanding (Co-reference resolution, relation extraction)

Classic approach

Use parsers and narrowly specialized algorithms.

Works well with small data.

A more recent way

Just throw it into a big neural network, which was already pre-trained on huge amount of data.

Isn't clear how it works, but it works, better than the classic approach (to my knowledge)

Generic Pipeline Overview

How to use ML to conquer text?

  • Use text vectors (TF-IDF -> Word2Vec -> Smth2Vec)
  • Or, use a full blown neural network (LSTM and variants)
  • Or, use a pre-trained model (BERT, OpenAI's GPT-2, etc)

Need a way to make text understandable for algorithms



  • Term Frequency - Inverse document frequency
  • Reflects the importance of a word in a document corpus
  • One of the most used and robust methods in practice
0.5 + 0.5 \frac{f_{t,d}}{\max_{t',d}{f_{t',d}}}\cdot{\log{\frac{N}{n_{t}}}}


Skip-gram and CBoW

CBoW - continuous bag of words

  • Tries to predict the middle word given a context window
  • Uses probabilities of every word in the vocabulary

Mikolov et al - Distributed "Representations of Words and Phrases and their Compositionality"




Transformers use a mechanism called attention


Attention is a mechanism for creating context vectors that are sum of hidden states weighted by some alignment scores


Initially created for NMT attention helped propagate the important signals further in time and also provide interpretability, in a way

\begin{aligned} \mathbf{c}_t &= \sum_{i=1}^n \alpha_{t,i} \boldsymbol{h}_i & \small{\text{; Context vector for output }y_t}\\ \alpha_{t,i} &= \text{align}(y_t, x_i) & \small{\text{; How well two words }y_t\text{ and }x_i\text{ are aligned.}}\\ &= \frac{\exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_i))}{\sum_{i'=1}^n \exp(\text{score}(\boldsymbol{s}_{t-1}, \boldsymbol{h}_{i'}))} & \small{\text{; Softmax of some predefined alignment score.}}.\\ & \text{score}(\boldsymbol{s}_t, \boldsymbol{h}_i) = \mathbf{v}_a^\top \tanh(\mathbf{W}_a[\boldsymbol{s}_t; \boldsymbol{h}_i]) \end{aligned}
class TransformerBlock(nn.Module):
  def __init__(self, k, heads):

    self.attention = SelfAttention(k, heads=heads)

    self.norm1 = nn.LayerNorm(k)
    self.norm2 = nn.LayerNorm(k)

    self.ff = nn.Sequential(
      nn.Linear(k, 4 * k),
      nn.Linear(4 * k, k))

  def forward(self, x):
    attended = self.attention(x)
    x = self.norm1(attended + x)

    fedforward = self.ff(x)
    return self.norm2(fedforward + x)
class SelfAttention(nn.Module):
  def __init__(self, k, heads=8):
    self.k, self.heads = k, heads
    self.tokeys    = nn.Linear(k, k * heads, bias=False)
    self.toqueries = nn.Linear(k, k * heads, bias=False)
    self.tovalues  = nn.Linear(k, k * heads, bias=False)
    self.unifyheads = nn.Linear(heads * k, k)
  def forward(self, x):
    b, t, k = x.size()
    h = self.heads

    queries = self.toqueries(x).view(b, t, h, k)
    keys    = self.tokeys(x)   .view(b, t, h, k)
    values  = self.tovalues(x) .view(b, t, h, k)

    keys = (keys.transpose(1, 2)
            	.contiguous().view(b * h, t, k))
    queries = (queries.transpose(1, 2)
               		.contiguous().view(b * h, t, k))
    values = (values.transpose(1, 2)
              		.contiguous().view(b * h, t, k))
    queries = queries / (k ** (1/4))
    keys    = keys / (k ** (1/4))

    dot = torch.bmm(queries, keys.transpose(1, 2))

    dot = F.softmax(dot, dim=2) 
    out = torch.bmm(dot, values).view(b, h, t, k)
    out = (out.transpose(1, 2)
           		.contiguous().view(b, t, h * k))
    return self.unifyheads(out)

Think that's it?


*entering the cutting edge of the domain*

Multi-task learning

Experiments show that having a single Deep Learning model learn to perform multiple related tasks enhance its performance

For example BERT uses 2 pre-training tasks, missing word insertion (Cloze test) and next sentence prediction


Self-supervision - extracting a supervision signal from the data itself. Feels like usupervised learning, works like supervised

In case of BERT, it actually uses self-supervision for pre-training.

Language modeling, back-translation, next sentence prediction, Cloze test. These are some self-supervised methods for text

