ML: Neural Networks
Neural Networks: Motivation
Non-linear Separators
Many practical problems involve the need for decision boundaries that are nonlinear in nature, and our linear perceptron model isn’t expressive enough to capture this relationship.
Consider the following set of data:
We would like to separate the two colors, and clearly there is no way this can be done in a single dimension (a single dimensional decision boundary would be a point, separating the axis into two regions).
Tip
To fix this problem, we can add additional (potentially nonlinear) features to construct a decision boundary from.
Consider the same dataset with the addition of \(x^2\) as a feature:
Now we were able to fix the problem by mapping our data to a higher dimensional space by manually adding useful features to data points.
A natural desire is to learn these featurization or transformation functions as well, perhaps using a nonlinear function class that is capable of representing a wider variety of functions.
Multi-layer Perceptrons
In fact, a multi-layer perceptron is a universal function approximator and can represent any real function, leaving us only with the problem of selecting the best set of weights to parameterize our network.
Theorem - Universal Function Approximators
A two-layer neural network with a sufficient number of neurons can approximate any continuous function to any desired accuracy.
Measuring Accuracy
The accuracy of the binary perceptron after making \(n\) predictions can be expressed as:
In this context, \(\operatorname{sgn}(x)\) represents an indicator function, which returns \(1\) if \(x\) is positive and \(-1\) otherwise. We can note that our accuracy function above is equivalent to dividing the total number of correct predictions by the raw total number of predictions.
Sometimes, we want an output that is more expressive than a binary label. It then becomes useful to produce a probability for each of the N classes we want to classify into, which reflects our a degree of certainty that the data point belongs to each of the possible classes.
We transition from storing a single weight vector to storing a weight vector for each class j, and estimate probabilities with the softmax function. The softmax function defines the probability of classifying \(x^{(i)}\) to class \(j\) as:
Given a vector that is output by our function \(f\), softmax performs normalization to output a probability distribution. To come up with a general loss function for our models, we can use this probability distribution to generate an expression for the likelihood of a set of weights:
This expression denotes the likelihood of a particular set of weights explaining the observed labels and data points. We would like to find the set of weights that maximizes this quantity.
Multi-layer Feedforward Neural Networks
Note
This is much like the multi-layer perceptron, however, we choose a different non-linearity to apply after the individual perceptron nodes. Note that it is these added non-linearities that makes the network as a whole non-linear and more expressive.
Sigmoid Function
Rectified Linear Unit (ReLU)
The choice of nonlinearity is a design choice that typically requires some experimentation to select a good one for each individual use case.
Loss Functions and Multivariate Optimization
We can find the optimal values of the parameters using the gradient ascent method described earlier.