GPT-3的核心理念极其简洁:把模型做到足够大,它就能在不需要梯度更新的情况下学会新任务。
在GPT-3之前,NLP任务的标准流程是"预训练→微调"(需要标注数据和梯度更新)。GPT-3提出了全新的范式:预训练→提示(In-Context Learning),只需给模型几个示例,它就能理解任务并执行。
| 参数 | GPT-3 (175B) | GPT-2 (1.5B) | BERT-Large |
|---|---|---|---|
| 参数量 | 1,750亿 | 15亿 | 3.4亿 |
| 层数 | 96 | 48 | 24 |
| 隐藏维度 | 12,288 | 1,600 | 1,024 |
| 注意力头数 | 96 | 25 | 16 |
| 上下文窗口 | 2,048 | 1,024 | 512 |
| 训练数据 | 570GB文本 | 40GB | 16GB |
GPT-3定义了三种无需梯度更新的学习方式:
Translate to French: "Hello world" →
Translate to French:
"Hello" → "Bonjour"
"How are you" →
Translate to French:
"Hello" → "Bonjour"
"Goodbye" → "Au revoir"
"Thanks" → "Merci"
"Please" →
GPT-3最令人震惊的发现:某些能力只有在模型达到一定规模后才突然出现,小模型完全不具备:
这是"涌现能力"概念在AI领域的标志性发现,直接推动了后来的Scaling Laws研究。
We demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance, sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches. We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance on dozens of NLP tasks. For all tasks, GPT-3 is applied without any gradient updates or fine-tuning, with tasks and few-shot demonstrations specified purely via text interaction with the model.
评论区