Layer Norm

Motivation

LayerNorm is often preferred in models where sequences are involved, sequence length is variable within the batch size. Batch Norm, on the other hand, works well in many standard deep learning models with large, fixed-size sequence length (like image ) where its ability to stabilize training through batch-wide statistics is beneficial. LayerNorm is widely used in many architecture that often process sequence data like RNN, Transformer.

Layer Norm

Layer normalization is a technique used to normalize the inputs across the features in a layer. Given an input vector $x = (x_{1}, x_{2}, \dots, x_{n})$ of length $n$ , layer normalization is performed as follows:

Compute the Mean: $μ = \frac{1}{n} \sum_{i = 1}^{n} x_{i}$
Compute the Variance: $σ^{2} = \frac{1}{n} \sum_{i = 1}^{n} (x_{i} - μ)^{2}$
Normalize: Each feature in the input is normalized by subtracting the mean and dividing by the standard deviation: $\overset{x}{^}_{i} = \frac{x _{i} - μ}{σ ^{2} + ϵ}$ Here, $ϵ$ is a small constant added for numerical stability.
Apply Scaling and Shifting: Finally, the normalized values are scaled and shifted using learned parameters $γ$ and $β$ : $y_{i} = γ \overset{x}{^}_{i} + β$ where $γ$ and $β$ are trainable parameters that allow the network to maintain the representational power.

Hoai-Chau Tran

Explorer

Layer Norm

Motivation

Layer Norm

Graph View

Table of Contents

Backlinks