2018年之前的NLP预训练只有两条路:
BERT的核心洞察:真正的双向表示比单向表示强大得多。但直接让模型看到左右两侧会"作弊"(预测词时看到了自己),于是BERT设计了巧妙的预训练任务来解决这个问题。
随机遮蔽15%的输入Token,让模型预测被遮蔽的词:
输入: "The [MASK] sat on the [MASK]"
目标: 预测 [MASK] → "cat", "mat"
具体策略:80%替换为[MASK],10%替换为随机词,10%保持不变。混合策略防止模型只学[MASK]位置。
给定句子A和B,判断B是否是A的下一句:
输入: [CLS] The cat sat on the mat [SEP] It was happy [SEP]
标签: IsNext ✓
输入: [CLS] The cat sat on the mat [SEP] Stock prices fell [SEP]
标签: NotNext ✗
NSP帮助模型理解句子间关系,对问答、自然语言推理等任务至关重要。
BERT使用Transformer Encoder(没有Decoder),因为需要双向注意力:
| 模型 | 层数 | 隐藏维度 | 注意力头数 | 参数量 |
|---|---|---|---|---|
| BERT-Base | 12 | 768 | 12 | 1.1亿 |
| BERT-Large | 24 | 1024 | 16 | 3.4亿 |
BERT开创了"预训练+微调"范式:
关键优势:同一个预训练模型可以微调到任何NLP任务,无需重新设计架构。
| 任务 | 数据集 | BERT成绩 | 提升幅度 |
|---|---|---|---|
| 综合NLU | GLUE | 80.5% | +7.7% |
| 自然语言推理 | MultiNLI | 86.7% | +4.6% |
| 问答v1.1 | SQuAD 1.1 | 93.2 F1 | +1.5 |
| 问答v2.0 | SQuAD 2.0 | 83.1 F1 | +5.1 |
BERT在11项任务上全面刷新SOTA,震撼了整个NLP社区。
We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial task-specific architecture modifications.
评论区