The are a lot of misconceptions regarding neural networks. This article, and subsequent ones on the topic will present the major ones one need to take into account.(1)
The learning algorithm of a neural network tries to optimise the neural network's weights until some stopping condition has been met. How do we do it?
In the image above, we have simply written “adjust weights”… but how is this really accomplished? The question is two-fold: when to stop? and how to actually adjust them?
The stopping condition is typically either when the error of the network reaches an acceptable level of accuracy on the training set, when the error of the network on the validation set begins to deteriorate, or when the specified computational budget has been exhausted.
The most common learning algorithm for neural networks is the backpropagation algorithm which uses stochastic gradient descent.
Backpropagation consists of two steps:
- The feedforward pass: the training data set is passed through the network and the output from the neural network is recorded and the error of the network is calculated
- Backward propagation: the error signal is passed back through the network and the weights of the neural network are optimised using gradient descent.
The are some problems with this approach.
Adjusting all the weights at once can result in a significant movement of the neural network in weight space, the gradient descent algorithm is quite slow, and is susceptible to local minima.
Local minima are a problem for specific types of neural networks including all where the connection weights are calculated using products. The first two problems can be addressed by using variants of gradient descent.
The following pictures and descriptions show different gradient algorithms, or rather, different families of algorithms.
Long Valley
Algorithms without scaling based on gradient information really struggle to break symmetry here - SGD gets nowhere and Nesterov Accelerated Gradient / Momentum exhibits oscillations until they build up velocity in the optimisation direction.
Algorithms that scale step size based on the gradient quickly break symmetry and begin descent.
Beale's function
Due to the large initial gradient, velocity based techniques shoot off and bounce around - Adagrad almost goes unstable for the same reason.
Algorithms that scale gradients/step sizes like Adadelta and RMSProp proceed more like accelerated SGD and handle large gradients with more stability.
Saddle Point
Behaviour around a saddle point.
NAG/Momentum again like to explore around, almost taking a different path.
Adadelta/Adagrad/RMSProp proceed like accelerated SGD.
That having been said, these algorithms cannot overcome local minima and are also less useful when trying to optimise both the architecture and weights of the neural network concurrently.
In order to achieve this global optimisation algorithms are needed. Two popular global optimisation algorithms are the Particle Swarm Optimisation (PSO) and the Genetic Algorithm (GA).
Here is how they can be used to train neural networks. In order to use, e.g., a genetic algorithm (GA), we have to represent the NN in a manner that is suitable for a GA. That is a vector representation.
Using a neural network vector representation - by encoding the neural network as a vector of weights, each representing the weight of a connection in the neural network - we can train neural networks using most meta-heuristic search algorithms, including GAs. The following figure shows how this is accomplished.
This technique does not work well with deep neural networks because the vectors become too large.
Using a GA, we proceed as follows. To train a neural network using a genetic algorithm we first construct a population of vector represented neural networks. Then we apply the three genetic operators on that population to evolve better and better neural networks. These three operators are
- Selection - Using the sum-squared error of each network calculated after one feedforward pass, we rank the population of neural networks. The top percentage (or selected in another manner) of the population are selected to survive to the next generation and be used for crossover.
- Crossover - The top percentage (or selected in another manner) of the population's genes are allowed to cross over with one another. This process forms offspring. In context, each offspring will represent a new neural network with weights from both parent neural networks.
- Mutation - this operator is required to maintain genetic diversity in the population. A small percentage of the population are selected to undergo mutation. Some of the weights in these neural networks will be adjusted randomly within a particular range.
I will deal with GAs at length in a future series of articles.
The GA may be used whenever necessary, when we feel that it would be advantageous for the overall algorithm.
The key takeaway from this article is that there are many ways of training a neural network, because after all the training is a process that adjusts the weights of the connections between the nodes in the layers, something that can be accomplished in a multitude of ways. Also recall that we actually may adjust the network as such as well (essentially either adding or removing nodes as required), in order to achieve an optimal result.
(1) The inspiration for the misconceptions is adapted from an article by Stuart Reid from 8 May 2014 available at http://www.turingfinance.com/misconceptions-about-neural-networks/.