BERT:

Bidirectional Encoder Representations from
Transformers

Language model pre-training has been shown to be effective for improving many natural language processing tasks.

  • sentence-level tasks : predict the relationships between sentences

    • natural language inference, paraphrasing

  • token-level tasks : produce fine-grained
    output at the token level

    • entity recognition, question answering

Existing strategies for applying pre-trained language representations to downstream
tasks:

1. feature-based

  • ELMo

2. fine-tuning

  • Generative Pre-trained Transformer (OpenAI GPT)

Limitation to learn general language representations:

  • standard language models are unidirectional

  • choice of architectures used during pre-training is limited

Parsing / composition is a framework for language understanding.

How humans are able to understand language?

Compositionality principle: states that the meaning of word compounds is derived from the meaning of the individual words, and the manner in which those words are combined.

Hierarchical structure of language: states that through analysis, sentences can be broken down into simple structures such as clauses. Clauses can be broken down into verb phrases and noun phrases and so on.

Let us take a quick look at two different parse trees derived from a sentence "Bart watched a squirrel with binoculars".

Successive compositions yield the meaning of the sentence.

Example:

composing “a” and “squirrel”, “watched” with “a squirrel”, “watched a squirrel” and “with binoculars”

Parse tree 1

Composition relies on the result of parsing to determine what ought to be composed.

Composition and parsing are both hard tasks, and they need one another.

"Bart watched a squirrel with binoculars"

"Bart watched a squirrel with binoculars"

Parse tree 1

Parse tree 2

BERT is a new language representation model based on two steps.

  1. pre-training

  2. fine-tuning

 

Several models have tried to put the combination of parsing and composition in practice.

  • To alleviate the unidirectionality constraint, BERT is pre-trained on two different, but related, NLP tasks: Masked Language Modeling and Next Sentence Prediction. 

  • BERT is also the first NLP technique to rely solely on self-attention mechanism, which is made possible by the bidirectional Transformers at the center of BERT's design.

Pre-training:

  • model is trained on unlabeled data over different pre-training tasks.

 

Fine-tuning:

  • the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks.

  • unified architecture across different tasks.

 

BERT is a bidirectional model.

More powerful than either a left-to-right model or the shallow concatenation of a left-to-right and a right-to-left model.

Let us consider a question-answering example for the downstream task and WordPiece embeddings (Wu et al.,2016) with a 30,000 token vocabulary.

BERT Input representation: 

  • sentence” can be an arbitrary span of contiguous text, rather than an actual linguistic sentence.
  • sequence” refers to the input token sequence to BERT, which may be a single sentence or two sentences packed together.

Example: <Question, Answer>

[CLS]: a special classification token which is the first token of every sequence.

[SEP]: a special token to separate sentence pairs packed together in a single sequence

E: input embedding

C: final hidden vector of the special [CLS] token, \( C \in \mathbb R^H\)

\(T_i\): final hidden vector for the \(i^{th}\) input token, \( T_i \in \mathbb R^H\)

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

BERT input representation

 Pre-training BERT

We pre-train BERT using two unsupervised tasks.

Task #1: Masked Language Model (MLM)

Task #2: Next Sentence Prediction (NSP)

Task #1: Masked LM

  • A language model predicts a token using the context on its left.
  • To encode context bidirectionally for representing each token, BERT randomly masks tokens and uses tokens from the bidirectional context to predict the masked tokens in a self-supervised fashion.
  • The final hidden vectors corresponding to the mask tokens are fed into an output softmax over the vocabulary.
  • We only predict the masked words rather than reconstructing the entire input.

Input: "this movie is great"

  • a special “<mask>” token for 80% of the time
    • (e.g., “this movie is <mask>”);

If “great” is selected to be masked and predicted, then it will be replaced with:

  • a random token for 10% of the time
    • (e.g., “this movie is drink”);
  • the unchanged label token for 10% of the time
    • (e.g., “this movie is great”).

 Pre-training BERT

We pre-train BERT using two unsupervised tasks.

Task #1: Masked Language Model (MLM)

Task #2: Next Sentence Prediction (NSP)

Task #2: Next Sentence Prediction (NSP)

  • A language model does not does not explicitly model the logical relationship between text pairs.
  • In order to train a model that understands sentence relationships,
    we pre-train for a binarized next sentence prediction.

Input: sentences A and B

  • 50% of the time B is the actual next sentence that follows A
    • labeled as IsNext

When generating sentence pairs for pretraining,

  • 50% of the time it is a random sentence from the corpus
    • labeled as NotNext

 Fine-tuning BERT

The self attention mechanism in the Transformer allows BERT to model many downstream tasks.

For each task, we simply plug in the task specific inputs and outputs into BERT and finetune all the parameters end-to-end.

At the input, sentence A and sentence B from pre-training are analogous to

  • sentence pairs in paraphrasing
  • hypothesis-premise pairs in entailment
  • question-passage pairs in question answering
  • a degenerate text pair in text classification or sequence tagging

At the output,

  • the token representations are fed into an output layer for tokenlevel tasks, such as sequence tagging or question answering
  • [CLS] representation is fed into an output layer for classification, such as entailment
    or sentiment analysis.

 BERT Model Architecture

  • BERT’s model architecture is a multi-layer bidirectional Transformer encoder.
  • Apart from output layers, the same architectures are used in both pre-training and fine-tuning.
  • The same pre-trained model parameters are used to initialize models for different downstream tasks.
  • During fine-tuning, all parameters are fine-tuned.

There are two model sizes based on,

  • L - number of layers (i.e., Transformer blocks)
  • H - hidden size
  • A - number of self-attention heads

\(\mathbf {BERT_{BASE}}\)

\(\mathbf {BERT_{LARGE}}\)

  1. L=12
  2. H=768
  3. A=12,
  4. Total Parameters= 110M
  1. L=24
  2. H=1024
  3. A=16,
  4. Total Parameters= 340M

These models are fine-tuned on General Language Understanding Evaluation (GLUE) benchmark.

BERT

By Amrutha

BERT

  • 164