Bug Report: Incorrect Pre-LN Formula
The text contains formulation 1):
1) Current (incorrect):
If module is attn:
2) Correct version:
If module is attn:
Why version 2 is correct:
In Pre-Layer Normalization:
- Layer Norm is applied to the input:
LN(x)
- The result passes through the module:
module(LN(x))
- The residual connection is added after the module:
+ x
Version 1 places the residual inside the module's argument, which defeats both the purpose of normalization (normalizing the module's input) and the residual connection (allowing gradient flow to bypass the module).
Bug Report: Incorrect Pre-LN Formula
The text contains formulation 1):
1) Current (incorrect):
If module is attn:
2) Correct version:
If module is attn:
Why version 2 is correct:
In Pre-Layer Normalization:
LN(x)module(LN(x))+ xVersion 1 places the residual inside the module's argument, which defeats both the purpose of normalization (normalizing the module's input) and the residual connection (allowing gradient flow to bypass the module).