Transformer虽然强大,但有一个根本缺陷:注意力计算是O(n²)复杂度。序列长度翻倍,计算量翻四倍。当上下文窗口从2K扩展到128K、1M时,计算成本爆炸式增长。
大量工作试图替代Transformer:线性注意力、RNN变体、状态空间模型(SSM)等,但它们在语言任务上始终不如注意力机制。
Mamba解决了这个困局:既保持线性复杂度,又在语言建模上匹配甚至超越Transformer。
SSM用隐状态h(t)建模序列,类似连续版的RNN:
# 连续时间
h'(t) = A·h(t) + B·x(t) # 状态更新
y(t) = C·h(t) # 输出
# 离散化后(可并行训练)
h_t = Ā·h_{t-1} + B̄·x_t
y_t = C·h_t
之前的S4模型证明SSM在长序列建模上有优势,但传统SSM的A、B、C参数是固定的(与输入无关),无法做内容相关的推理。
Mamba的关键洞察:让SSM参数成为输入的函数:
# 传统SSM(参数固定)
B, C, Δ = 固定参数
# Mamba(选择性SSM,参数随输入变化)
B(x) = Linear(x) # 输入决定"看什么"
C(x) = Linear(x) # 输入决定"输出什么"
Δ(x) = softplus(Linear(x)) # 输入决定"记住多少"
这让模型可以选择性地传播或遗忘信息——遇到重要Token就"记住",遇到无关Token就"忽略"。
直觉理解:就像阅读时,重要内容仔细读(大Δ=慢更新=记住),废话快速跳过(小Δ=快更新=遗忘)。
选择性机制让SSM无法用卷积高效实现(因为参数依赖输入)。Mamba设计了硬件感知的并行扫描算法:
输入x
→ 线性投影 → 分成x, z两路
→ x路: Conv1d → SiLU → 选择性SSM → 与z路相乘(SiLU)
→ 线性投影 → 输出
# 没有注意力!没有MLP!只有SSM + 卷积 + 门控
Foundation models are almost universally based on the Transformer architecture and its core attention module. Many subquadratic-time architectures have been developed to address Transformers' computational inefficiency on long sequences, but they have not performed as well as attention on important modalities such as language. We identify that a key weakness of such models is their inability to perform content-based reasoning, and make several improvements. First, simply letting the SSM parameters be functions of the input addresses their weakness with discrete modalities. We integrate these selective SSMs into a simplified end-to-end neural network architecture without attention or even MLP blocks (Mamba). Mamba enjoys fast inference (5x higher throughput than Transformers) and linear scaling in sequence length.
评论区