header

What is a good activation function for neural nets ?

The wish list is :
- cheap to compute
- cheap to train (quick to train & stable with depth)
- expressive (keeps the neural net small, works with less parameters)

We want to avoid the function to be "flat" because it kills the gradient, and thus learning.

Here is an idea : what about lying about the derivative of a function ? So that it can have many wide flat regions, but globally increase by smooth steps, we could just pretend that the derivative is strictly non $0$ on those flat regions.

This paper says that polynomials activations are not versatile (this is ignoring the learning process)

With polynomial, we get the following symmetry : $x↦p(⟨x|x_0+b_0⟩)$ is also a polynomial. This means that along any direction, the function behaves like a 1d polynomial. This doesn't hold for ReLU, which is visible in the following interactive illustrations.

Contrary to initial intuition, using sin as an activation doesn't make the neural net do fourier. Indeed Fourier isn't about putting the signal into a $exp(iωt)$, but rather to look at the relationship of the signal to it's underlying space. It is the spatial (or temporal) coordinates that are put into the ℂ-exp, and then we project the signal onto the obtained oscillators.

This means we need those coordinates. To get a chance that this "fourier" like operation happens, we should feed the neural net those coordinates.

Let's look inside a neural net. Here we visualize the state of the neural net on the entire input space, which is chosen to be 2d for pragmatic reasons (though this is terribly low).

To be clear, here are a few equivalent formulations :
- for each image, the $x$ axis corresponds to the first input value, and the $y$ axis to the second input value.
- each image is an atlas of all "2 pixels image"
- the position (coordinates) on the atlas is the value of the 2 pixels of a 2 pixels image

$x^{(0)}:= A_0x+b_0$
$a^{(0)}:= σ_0(x^{(0)})$
$x^{(1)}:= A_1⋅a^{(0)}+b_1$
$a^{(1)}:= σ_1(x^{(1)})$
zoom
$A_0=[$
$]$
$b_0=[$
$]$
$σ_0(x)=$
$A_1=[$
$]$
$b_1=[$
$]$
$σ_1(x)=$
$α$
$β$
$γ$
$δ$

here is a version with more layers

visualize derivative for a given parameter ?

In the previous illustration, we can see how ReLU-like activation function are able to create some independance on the output. Indeed when a region maps to black (ie constant at $0$) on the image, the output cannot be influenced anymore by the data. The whole region is set to be constant, thus only the neural net's weights can decide the outcome of that region.

How to analyse and test

Most people seem to run empirical benchmarks on specific or sets of different problems. I'm personally not a huge fan of this because it's hard to understand why something is working, but I'm unable to provide better so far.

An analysis on distributions seems quite difficult as the objects to be manipulated get very complex quicky. Here is an example with fourier transform as an operator on signal distribution.

Here is my toy problem (source code): I made a convolutional auto-encoder like this : channels=128 latent_dim=16 encoder = nn.Sequential(*[ nn.Conv2d(1 ,channels ,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,channels ,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,channels ,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,channels ,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,channels ,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,channels ,(3, 3),padding=0,stride=3),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,channels ,(3, 3),padding=0,stride=3),σ, nn.BatchNorm2d(channels), nn.Conv2d(channels,latent_dim,(2, 2),padding=0,stride=1),σ, nn.BatchNorm2d(latent_dim), ]) decoder = nn.Sequential(*[ nn.ConvTranspose2d(latent_dim,channels,(2, 2),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,channels,(3, 3),padding=0,stride=3),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,channels,(3, 3),padding=0,stride=3),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,channels,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,channels,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,channels,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,channels,(3, 3),padding=0,stride=1),σ, nn.BatchNorm2d(channels), nn.ConvTranspose2d(channels ,1 ,(3, 3),padding=0,stride=1),nn.Sigmoid() ]) It was designed without pooling so that we maximize the use of non-linearities and allow a comparison with a fully linear model by setting $σ$ to be identity. It also makes sure that we get down to a single pixel. Thus the latent space dimension can be controlled easily.

BatchNorm is there to regularize the neural net.

Here are log(loss(epoch)) for many activation functions. We look at the loss on the test set of fashion-MNIST.

It is interesting to see how some non-linearities score much worse compared to a simply linear model.

RReLU which is random ReLU also produces poor scores in this model. This seems to indicate that consistency in evaluation is important even though this paper seems to disagree. I wonder if adding noise to the derivative evaluation (instead of function evaluation) would work with what they did.

Arbitrary activation such as a fixed Perlin Noise (random array with smoothstep interpolation) for different resolution seems able to do an acceptable job. They are not very consistent in training, but after 200 epochs get close to the score of sigmoid.

In my setup, the best performing activations are the ∗∗-LU type activation.

Cantor activation is obviously a lost cause. But I made Cantor2 which pretends the derivative to be $1$ instead of $0$. We can see how lying about the derivative by providing a "blured vision" of it helps significantly.

I think there are more functions to try here : we could design an activation with "flat" regions to cause sparsity (like ReLU or GELU do), while not killing the gradient. An obvious choice would be a staircase like function with hard-ish corner while the derivative is given for a soft-ish corner staircase.

Also could this be used on regular neural nets to accelerate convergence ?

What about a different derivative evaluation depending on if we want to go left or right? Ex :
ReLU'(+x) = {left:1, right:1}
ReLU'(-x) = {left:0, right:1}

On backprop we compute both options, and choose the direction that fits best. Should be careful with impulsion based optimizers though.

What else could we look at ?

1. where is the activation sampled ? where is the prefered region ? does learning rate change that ? does that change from run to run ?

2. where is the activation sampled through epochs ? does the prefered region change ?

3. distribution of sampled loss derivative ?