Where is the actual code for LayerNorm (torch.nn.functional.layer_norm)

zeyuyun1 · January 27, 2021, 7:39am

I am looking for the implementation for torch.nn.functional.layer_norm, it links me to this doc, which then link me to this one

But I can’t find where is torch.layer_norm.

According to the documentation, it seems like the math is following:


x = torch.randn(50,20,100)
mean = x.sum(axis = 0)/(x.shape[0])
std = (((x - mean)**2).sum()/(x.shape[0])).sqrt()

LayerNorm = torch.nn.LayerNorm(x.shape, elementwise_affine = True)

torch_layernorm = LayerNorm(x)
My_LayerNorm = (x - mean)/std*LayerNorm.weight+LayerNorm.bias

print(My_LayerNorm)
print(torch_layernorm)

However, the my output and LayerNorm output is different…

ptrblck · January 28, 2021, 8:55am

You can find the (CPU) C++ implementation here.

Xinchengzelin · June 16, 2023, 6:44am

Hi, @ptrblck , could you tell me where I can find native_layer_norm in the line return std::get<0>(at::native_layer_norm(input, normalized_shape, weight, bias, eps));

Ice-Citron · August 4, 2024, 6:55pm

Heres what i written in my class which when I use it to train a 124m parameter LLM model. The loss curve looks the exact same (refer to forward pass method below):

for reference. the run with the layer norm class used below can be seen in the run “radiant-flower-16”. “absurd-glitter-13”, “crisp-flower-10” and “kind-energy-14” are previous runs using torch’s default layernorm class. And hence you can see their loss is very simillar, hence I think my implementation is a direct replica afaik (haven’t tested it beyond this).
P.S. ignore the ethereal-oath-9, that was some bugs with logging.

class LayerNorm(nn.Module):

    __constants__ = ['normalized_shape', 'eps', 'elementwise_affine']
    normalized_shape: Tuple[int, ...]
    eps: float
    elementwise_affine: bool

    def __init__(self, normalized_shape: _shape_t, eps: float = 1e-5, elementwise_affine: bool = True, bias: bool = True, device=None, dtype=None) -> None:
        factory_kwargs = {'device': device, 'dtype': dtype}
        super().__init__()
        if isinstance(normalized_shape, numbers.Integral):
            # mypy error: incompatible types in assignment
            normalized_shape = (normalized_shape,)  # type: ignore[assignment]
        self.normalized_shape = tuple(normalized_shape)  # type: ignore[arg-type]
        self.eps = eps
        self.elementwise_affine = elementwise_affine
        if self.elementwise_affine:
            self.weight = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
            if bias:
                self.bias = Parameter(torch.empty(self.normalized_shape, **factory_kwargs))
            else:
                self.register_parameter('bias', None)
        else:
            self.register_parameter('weight', None)
            self.register_parameter('bias', None)

        self.reset_parameters()

    def reset_parameters(self) -> None:
        if self.elementwise_affine:
            init.ones_(self.weight)
            if self.bias is not None:
                init.zeros_(self.bias)

    def forward(self, x: Tensor) -> Tensor:
        """Forward pass of LayerNorm."""
        mean = x.mean(dim=-1, keepdim=True)
        mean_x2 = torch.square(x).mean(dim=-1, keepdim=True)
        var = mean_x2 - torch.square(mean)

        x_norm = (x - mean) / torch.sqrt(var + self.eps)

        if self.elementwise_affine:
            x_norm = self.weight * x_norm + self.bias

        return x_norm
        
    def extra_repr(self) -> str:
        return '{normalized_shape}, eps={eps}, elementwise_affine={elementwise_affine}'.format(**self.__dict__)```

Ice-Citron · August 4, 2024, 7:25pm

actually nevermind. There’s a chance that my code isn’t the exact thing. And it might have worked by chance, as I tested again.

yuhsianglin · September 20, 2024, 9:10pm

One would only need to change the calculation by

My_LayerNorm = (x - mean) / (std**2 + epsilon)**0.5 * LayerNorm.weight + LayerNorm.bias

according to the documentation. It will match the output of LayerNorm

rk_evolving · January 17, 2025, 1:58am

if you are dealing with text sequence, the target of layernorm should be the last dimension. in your computing, you choosed the first dimension. this might be the reason why the results are different.