BERT TLDR

BERT is an encoder-only model that is pre-trained on text conditioning on both the left and right text (vs. just the text to the left). You can then finetune the learned model for other tasks.

Abstract

  • Pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers (as opposed to GPT which does left-to-right only).
  • Pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models
  • Obtains new state-of-the-art results on eleven natural language processing tasks,

Method

  • BERT Transformer uses bidirectional self-attention, while the GPT Transformer uses constrained self-attention where every token can only attend to context to its left
  • Use WordPiece embeddings (Wordpiece Model) with a 30,000 token vocabulary to embed sequences as vectors.
  • The first token of every sequence is always a special classification token ([CLS]). The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks.
  • Sentence pairs are joined into a single sequence. The sentences in two ways:
    • First, we separate them with a special token ([SEP]).
    • Second, we add a learned embedding to every token indicating whether it belongs to sentence A or sentence B.

For a given token, its input representation is constructed by summing the corresponding token, segment, and position embeddings.

Pre-training

  • Do not use traditional left-to-right or right-to-left language models to pre-train BERT.
  • They use both of the following pre-training tasks in conjunction. Ex:
logits_lm, logits_clsf = model(input_ids, segment_ids, masked_pos)
loss_lm = criterion(logits_lm.transpose(1, 2), masked_tokens) # for masked LM
loss_lm = (loss_lm.float()).mean()
loss_clsf = criterion(logits_clsf, isNext) # for sentence classification
loss = loss_lm + loss_clsf

Source

Masked LM (Masked Language Modeling)

  • Intuition: deep bidirectional model is strictly more powerful than either a left-to-right model or the shallow concatenation of a left-to- right and a right-to-left model.
  • We simply mask some percentage of the input tokens at random, and then predict those masked tokens.
  • MLM does converge marginally slower than a left- to-right model (which predicts every token), but the empirical improvements of the MLM model far outweigh the increased training cost
  • Cross entropy loss is used as the criterion to evaluate how well the masked words were predicted.

Next Sentence Prediction

  • We pre-train for a binarized next sentence prediction.
  • When choosing the sentences A and B for each pre-training example, 50% of the time B is the actual next sentence that follows A (labeled as IsNext), and 50% of the time it is a random sentence from the corpus (labeled as NotNext). The model then has to say whether the following sentence IsNext or NotNext.