Notes on GPT-2 and BERT

2020-12-06

Repost from this page.

GPT-2

GPT-2 is sentence-generative language model that came out of OpenAI in 2019. It caused quite a stir, in particular due to some “too dangerous to humanity” nonsense marketing around it.
GPT-2 was one of the hottest occurances in ML in 2019. This notebook is an exploration of this algorithm. For some precursor details see the following two notebooks: “Notes on word embedding algorithms” and “Transformer architecture, self-attention”.
GPT-2 and BERT at the two leading language models out there at time of writing in early 2020. They are the same in that they are both based on the transformer architecture, but they are fundamentally different in that BERT has just the encoder blocks from the transformer, whilst GPT-2 has just the decoder blocks from the transformer.
GPT-2 works like a traditional language model is that it takes word vectors and input and produces estimates for the probability of the next word as outputs. It is auto-regressive in nature: each token in the sentence has the context of the previous words. Thus GPT-2 works one token at a time. BERT, by contrast, is not auto-regressive. It uses the entire surrounding context all-at-once. (Q: so what?)

Architecture details

GPT-2 consists of solely stacked decoder blocks from the transformer architecture. In the standard transformer architecture, the decoder is fed a word embedding concatenated with a context vector, both generated by the encoder. In GPT-2 the context vector is zero-initialized for the first word embedding (presumably?).
Furthermore, in the standard transformer architecture self-attention is applied to the entire surrounding context, e.g. all of the other words in the sentence. In GPT-2 masked self-attention is used instead: the decoder is only allowed (via obfuscation masking of the remaining word positions) to glean information from the prior words in the sentence (plus the word itself).
Besides that GPT-2 is a close copy of the basic transformer architecture.
The word vectors used for the first layer of GPT-2 are not simple one-hot tokenizations but byte pair encodings. The byte pair encoding scheme compresses an (arbitrarily large?) tokenized word list into a set volcabulary size by recursively keying the most common word components to unique values (e.g. ‘ab’=010010, ‘sm’=100101, ‘qu’=111100, etecetera).
The corpus used to train GPT-2 is an industrial-scale web scrape.
GPT-2 is trained in the standard transformer way, with a batch size of 512, a well-defined sentence length, and a volcabulary size of 50,000. At evaluation time, the model switches to expecting input one word at a time. This is done by temporarily saving the necessary past context vectors as object properties.
GPT-2 is trained on the standard task: given a sequence of prior words, predict the next word.

BERT

BERT predates GPT-2 slightly; it was released in 2018 (note: before BERT there was ULMFIT and ELMO, and before that there was word2vec).
BERT, like GPT-2, uses the transformer architecture. However, it uses the encoder part instead of the decoder part.
The transformer decoder is a natural fit for the word embedding learning task because it works backwards only: e.g. the token at a certain position in the sentence only has access to the previous tokens. This is considered a good thing because it’s been shown, in practice, that a sufficiently complex network that naively includes the context of the posterior sequence of words in determining a word suffers from target leakage (see “Leakage, especially knowledge leakage”).

AFAICT, this has to do with the fact that the task asked of the model is “predict the next word in the sequence”, not “predict the next word in the sequence given all of the words that come both before and after it”.

“Everybody knows bidirectional conditioning would allow each word to indirectly see itself in a multi-layered context.”

ELMO has access to both prior and posterior information. However, it uses a bidirectional LSTM, which is a weaker context transfer learner (given a sufficiently long sequence) than self-attention is. This is a sort of implicit regularization built into ELMO that BERT, with its over-eager learner, doesn’t have.

BERT gets around this by modifying the task. Whereas GPT-2 learns on the “predict next” task directly, BERT learns on the task “learn the word in a sentence in which 15% of the words are masked out”. The masking is a form of regularization; it withholds just enough at preventing the algorithm from cheating through rote memorization. It will also rarely replace a word with a different (probably wrong) word (basically reverse teacher forcing?).
To make BERT better about generalizing to two-sentence tasks (e.g. compare two sentences semantically, is this sentence likely to follow this other one, etcetera) BERT also recieves training on a “is this sentence likely to follow this other one?” task. For this training task the sentence input is formatted like so: [CLS] …sentence1 [SEP] …sentence2 [SEP]. The output result is softmax on the first input value to the model. In fact! Every sentence inputted to BERT starts with the [CLS] token for precisely this reason.

Notes on GPT-2 and BERT

GPT-2

Architecture details

BERT

Blog

Technique

Theory