One of the key advantages of neural networks is their ability to learn complex relationships between inputs and outputs. Most real-life problems deal with nonlinearity which is more complex to learn than linear relationships. The choice of the activation function (applied to transform the output of each node) is crucial in introducing non-linearity into the network and thus directly impacts the learning capabilities of any neural network. Activation functions are broadly classified into two categories - Linear and non-linear activations. Some of the popular and widely used nonlinear activations are ReLU, Tanh, Sigmoid, etc. In what follows we discuss why Rectified Linear Unit (ReLU) activations are a preferred choice over linear activations.

Linear activation functions are also called identity functions or no activations because they apply the identity transformation to the output of each neuron. f(x) = x, output follows input.

To formulate the impact of using linear activations between the hidden layers of the Deep neural networks (DNN) we use the following notations to represent a DNN with L hidden layers.

Intermediate output of the lth hidden layer (before applying the activation function) is given by hl = Wlol−1 + bl where Wl and bl are weights and biases associated with the hidden layer, output of the hidden layer ol = Ψ(hl) where Ψ represents the activation function.

The inference in the forward pass with input x is given by

y = Ψ(WL · Ψ(WL−1..Ψ(W1x + b1).... + bL−1) + bL)

When the activation function is linear Ψ(x) = x the forward pass reduces to

y = (WL · WL−1..W1)x + (bL + bL−1... + b1)

y = Wx + b, where W = WL · WL−1..W1, b = bL + bL−1... + b1

This resulting linear equation is equivalent to a network with a single layer and is incapable of fitting non-linear mappings. The depth of the DNN allows it to extract features at different levels of abstraction, making it one of the most widely used solutions for various computer vision and natural language process- ing applications. Using linear activation squeezes the network and reduces its capacity to a single linear layer network, eliminating all advantages of having a deep architecture.

On the other hand, ReLU (Rectified Linear Unit) activations allow DNNs to approximate nonlinear functions. ReLU is a piecewise linear function defined as

f (x) = max(x, 0), maximum value between zero and the input.

Using ReLU activation Ψ(x) = max(x, 0), forward pass is given by

y = ReLU(WL · ReLU(WL−1...ReLU(W1x + b1).... + bL−1) + bL), which is a nonlinear function.

Even though ReLU activation is used to learn complex structures from the data, it is fast and effective to compute. Gradient computation (one of the key steps in back-propagation) for ReLU is the simplest one among all other non-linear activations. If we observe the function closely we see that if the input to a neuron is negative then applying ReLU transformation leads to a zero output, this introduces sparsity into the networks. At a given time only a subset of neurons are active, this helps in combating the problem of overfitting which occurs often in deep architecture with millions of parameters to learn. Therefore the ability of ReLU to introduce non-linearity into the network with less computation overhead makes it a preferred choice over linear activations.