Transformers are the basis of GPT: Giving attention to transformers
NanoGPT
https://github.com/karpathy/nanoGPT
GPT-3
Specific changes from the architecture of the transformer: Further notes and status quo
Dropout, layer norms previous to the heads rather than after, and the encoder is removed since it is just generating text based on input data, not decoding language such as in the original paper.
Training
- Pre-training: The transformer paper proposed using the model for various downstream tasks after pre-training on a large amount of unlabeled data, while the GPT-3 paper focused on fine-tuning the pre-trained model on specific tasks with a smaller amount of labeled data.
- This means that after the pre-training it is generally trained to output random things from it’s dataset of 300 billion tokens which in this case is the entirety of the internet. Undefined internet behavior
Regularization
- Dropout: The transformer paper used dropout as a regularization technique to prevent overfitting, while GPT-3 used weight dropout and dropconnect.
Computation
- GELU: GPT-3 used GELU as the activation function instead of ReLU which is used in transformer.
- Learning rate: The transformer paper proposed using a fixed learning rate, while the GPT-3 paper used a decaying learning rate schedule.
- Attention: GPT-3 uses bidirectional attention, which allows the model to attend to all tokens in the input regardless of their position, while the transformer paper only uses unidirectional attention.
- Generative vs discriminative: The original Transformer paper focused on the use of the model for language understanding tasks, whereas GPT-3 is used for generating human-like text.
- Weight initialization: GPT-3 uses a different initialization strategy for its weights, which is said to be more effective for large models like GPT-3.
Efficiency
- Distributed training: GPT-3 was trained on hundreds of GPUs and thousands of CPU cores, while the transformer paper used a single GPU for training.
- Quantization: GPT-3 model uses 8-bit quantization while transformer uses 32-bit floating point numbers.