October 04, 2020

ML misconceptions (4): size matters, but bigger is not always better

by Sam Sandqvist

The are a lot of misconceptions regarding neural networks. This article, and subsequent ones on the topic will present the major ones one need to take into account.(1)

Having selected an architecture one must then decide how large or small the neural network should be. How many inputs are there? How many hidden neurons should be used? How many hidden layers should be used, if any? And how many outputs neurons are required?

A small plate makes you eat less, but maybe not enough; and a large plate too much, more than necessary. What’s the right size?

As William of Ockham said, “Entities must not be multiplied beyond necessity”. Of course, the reverse also applies, as pointed out by Karl Menger, “Entities must not be reduced to the point of inadequacy”.

The reasons why these questions are important is because if the neural network is too large (too small) the neural network could potentially overfit (underfit) the data meaning that the network would not generalise well out of sample.

Let’s look at the various components: inputs, layers, and outputs, as shown below.

Blog 11-2

Inputs

The number of inputs depends on the problem being solved, the quantity and quality of available data, and perhaps some creativity.

Inputs are simply variables which we believe have some predictive power over the dependent variable being predicted.

If the inputs to a problem are unclear, you can systematically determine which variables should be included by looking at the correlations and cross-correlation between potential independent variables and the dependent variables. Inputs that do not influence the output (at least not too much) are simply discarded.

There are two problems with using correlations to select input variables.

  • using a linear correlation metric you may inadvertently exclude useful variables.
  • relatively uncorrelated variables could potentially be combined to produce a strongly correlated variable. If you look at the variables in isolation you may miss this opportunity.

To overcome the second problem you could use principal component analysis to extract useful eigenvectors (linear combinations of the variables) as inputs. That said a problem with this is that the eigenvectors may not generalise well and they also assume the distributions of input patterns is stationary.

Another problem when selecting variables is multicollinearity. Multicollinearity is when two or more of the independent variables being fed into the model are highly correlated. In the context of regression models this may cause regression coefficients to change erratically in response to small changes in the model or the data. Given that neural networks and regression models are similar this is also a problem for neural networks.

Last, but not least, one statistical bias which may be introduced when selecting variables is omitted-variable bias. Omitted-variable bias occurs when a model is created which leaves out one or more important causal variables. The bias is created when the model incorrectly compensates for the missing variable by over or underestimating the effect of one of the other variables, i.e., the weights may become too large on these variables or the error will be large.

Hidden layers

The optimal number of hidden units is problem-specific.

That said, as a general rule of thumb the more hidden units used the more probable the risk of overfitting becomes. Overfitting is when the neural network does not learn the underlying statistical properties of the data, but rather 'memorises' the patterns and any noise they may contain.

This results in neural networks which perform well in sample but poorly out of sample.

So how can we avoid overfitting? There are two popular approaches used in industry: early stopping, regularisation, and global search.

Early stopping involves splitting your training set into the main training set and a validation set.

Then instead of training a neural network for a fixed number of iterations, you train then until the performance of the neural network on the validation set begins to deteriorate.

Essentially this prevents the neural network from using all of the available parameters and limits its ability to simply memorise every pattern it sees.

Regularisation penalises the neural network for using complex architectures. Complexity in this approach is measured by the size of the neural network weights. Regularisation is done by adding a term to the sum-squared error objective function which depends on the size of the weights.

This is the equivalent of adding a term which essentially makes the neural network believe that the function it is approximating is a smooth curve by some measure.

The technique I favour is also by far the most computationally expensive, is global search.

In this approach a search algorithm is used to try different neural network architectures and arrive at a near optimal choice. This is most often done using genetic algorithms.

Outputs

Neural networks can be used for either regression or classification. Under a regression model a single value is output which may be mapped to a set of real numbers meaning that only one output neuron is required.

Under classification model an output neuron is required for each potential class to which the pattern may belong. If the classes are unknown unsupervised neural network techniques such as self-organising maps should be used.

In conclusion, the best approach is to follow Ockham’s Razor. Ockham’s razor argues that for two models of equivalent performance, the model with fewer free parameters will generalise better.

On the other hand, one should never opt for an overly simplistic model at the cost of performance. Similarly, one should not assume that just because a neural network has more hidden neurons and maybe more hidden layers it will outperform a much simpler network.

Unfortunately it seems that too much emphasis is placed on large networks and too little emphasis is placed on making good design decisions. In the case of neural networks, bigger isn't always better. 

(1) The inspiration for the misconceptions is adapted from an article by Stuart Reid from 8 May 2014 available at http://www.turingfinance.com/misconceptions-about-neural-networks/.

Sam Sandqvist
AUTHOR

Sam Sandqvist

Dr Sam Sandqvist is our in-house Artificial Intelligence Guru. He holds a Dr. Sc. in Artificial Intelligence and is a published author. He is specialized in AI Theory, AI Models and Simulations. He also has industry experience in FinServ, Sales and Marketing.