Revision

Back to Deep Learning

Activation function

The activation function is very important in a neural network. Without activation function, an ANN would be just a serie of linear regressions (which is equivalent to a single linear regression).

The non linear activation function allows the ANN to represent non linear output.

ReLU

ReLU for Rectified Linear Unit is a commonly used activation function. It’s a really simple function.

\[ReLU(x) = \max(0, x)\]

\[ReLU'(x) = \begin{cases} 1 && \text{if } x \gt 0\\ 0 && \text{if } x \leq 0 \end{cases}\]

\(ReLU\) works well in general but is prone to vanishing gradient.

Leaky ReLU

Leaky ReLU (Rectified Linear Unit) is a little twick of the ReLU activation function.

\[LeakyReLU(x) = \max(\alpha x, x)\]

\[LeakyReLU'(x) = \begin{cases} 1 && \text{if } x \gt 0\\ \alpha && \text{if } x \leq 0 \end{cases}\]

Leaky ReLU is not prone to vanishing gradient as the original ReLu activation function as it does not have a null gradient for negative values.

ELU

ELU (exponential linear unit) is another alternative to ReLU.

\[ELU(x) = \begin{cases} x && \text{if } x \gt 0\\ \alpha (e^x - 1) && \text{if } x \leq 0 \end{cases}\]

\[ELU'(x) = \begin{cases} 1 && \text{if } x \gt 0\\ ELU(x) + \alpha && \text{if } x \leq 0 \end{cases}\]

ELUis not prone to vanishing gradient as the original ReLu activation function as it does not have a null gradient for negative values.

SELU

SELU for Scaled Exponential Linear Unit is a self normalizing activation function. That means that this activation function preserves the mean and variance of its input.

It is particularly useful for network initialized with a gaussian distribution \(\mathcal{N}(0,1)\) as the activation function will preserve the normality all along.

It may be used with AlphaDropout which is a dropout method that also preserves the normality of the network.

\[SELU(x) = \lambda \begin{cases} x && \text{if } x \gt 0\\ \alpha e^x - \alpha && \text{if } x \leq 0 \end{cases}\]

\[SELU'(x) = \lambda \begin{cases} 1 && \text{if } x \gt 0\\ \alpha e^x && \text{if } x \leq 0 \end{cases}\]

Where:

\(\lambda = 1.0507009873554804934193349852946\),
\(\alpha = 1.6732632423543772848170429916717\).

\(\lambda\) and \(\alpha\) have be computed by the author of the method in their paper.

GELU

Gaussian Error Linear Unit is an activation function that has been used in the Transformer models BERT and GPT-2.

\[GELU(x) = \frac{1}{2} x \left(1 + \tanh \left( \frac{\sqrt{2}}{\pi} \left(x + 0.044715 x^3 \right)\right)\right)\]

\[GELU'(x) = \frac{1}{2} \tanh (0.0356774x^3 + 0.797885 x) + (0.0535161 x^3 + 0.398942 x) sech^2 (0.0356774 x^3 + 0.797885 x ) + \frac{1}{2}\]

Tanh

Tanh is another activation function but which is less used.

\[Tanh(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\]

\[Tanh'(x) = 1 - \tanh^2(x)\]

Sigmoid

Sigmoid is another activation function but which is less used (it is widely used as an output function to map value between 0 and 1). See also Sigmoid in Logistic Regression.

\[Sigmoid(x) = \frac{1}{1+e^{-x}} = \frac{e^{x}}{e^{x}+1}\]

\[Sigmoid'(x) = = Sigmoid(x)\left(1-Sigmoid(x)\right)\]

Resources

See: