Thoughts

Behind GPT-4: Exploring the latest language model

9 months ago
3 min read

In the world of artificial intelligence, the arrival of GPT-4 has been nothing short of a revelation. Recently, details about this groundbreaking AI model have been leaked, giving us a glimpse into the complexities and intricacies that make it tick. As a software engineer, I've been fascinated by these revelations, and I'm excited to share what I've learned.

Size and Structure
GPT-4 is a behemoth, with a total of approximately 1.8 trillion parameters across 120 layers. This is more than 10 times the size of its predecessor, GPT-3. The model employs a Mixture of Experts (MoE) approach, with 16 experts each containing about 111 billion parameters for MLP. Interestingly, only two of these experts are routed to per forward pass.

Data and Training
GPT-4 was trained on a staggering 13 trillion tokens. These tokens aren't unique; they include multiple epochs of the same data. The model underwent two epochs for text-based data and four for code-based data. It also benefited from millions of rows of instruction fine-tuning data from ScaleAI and internally.

Inference and Cost
Each forward pass inference (generation of 1 token) utilizes around 280 billion parameters and 560 TFLOPs. This is a stark contrast to the 1.8 trillion parameters and 3,700 TFLOPs required per forward pass of a purely dense model. Despite its size and complexity, GPT-4's inference cost is three times that of the 175 billion parameter Davinci, largely due to the larger clusters required for GPT-4 and much lower utilization achieved.

Parallelism and Training Cost
To parallelize across their A100s GPUs, OpenAI used 8-way tensor parallelism and 15-way pipeline parallelism. The training FLOPS for GPT-4 is approximately 2.15e25, on around 25,000 A100s for 90 to 100 days at about 32% to 36% MFU. 

Mixture of Expert Tradeoffs
The use of MoE comes with its own set of tradeoffs. While more experts could achieve better loss, they are more difficult to generalize at many tasks and to achieve convergence with. Therefore, OpenAI chose to be more conservative with the number of experts.

Vision Multi-Modal and Speculative Decoding
GPT-4 also includes a separate vision encoder from the text encoder, with cross-attention. The architecture is similar to Flamingo. This adds more parameters on top of the 1.8 trillion of GPT-4. It is fine-tuned with another ~2 trillion tokens, after the text-only pre-training. Speculative decoding might also be in use, where a smaller, faster model decodes several tokens in advance, feeding them into a larger oracle model as a single batch.

Dataset Mixture
The model was trained on 13 trillion tokens, with CommonCrawl & RefinedWeb both contributing 5 trillion. There are rumors that parts of the data came from Twitter, Reddit, and YouTube, as well as from a custom dataset of college textbooks collected by hand. This would explain why GPT-4 seems to be so knowledgeable in a wide range of subjects.

In conclusion, the creation of GPT-4 is a testament to the rapid advancements in AI. From its massive size to its complex training process and powerful capabilities, GPT-4 is a remarkable achievement in the field of artificial intelligence. As we continue to uncover more about this impressive model, we can only imagine what the future of AI holds.

Technologies: AI