Classification Part I: Logistic Regression

Logistic regression is a classification technique for classifying data points x (a vector) into one of K classes $C_{1:K}$.  It works by (essentially) projecting the datapoints onto a set of (pre-specified) features (which are simply vectors formed of functions of the datapoints’ components), and then finding linear separating (hyper-)planes in this feature-space.  It is widely applicable, and is used as a component in many other classification algorithms (sometimes as a final step after some other algorithm is used to determine the features on which to classify).

Generative Interpretation

There are several ways of looking at how logistic regression works, but one of the most interesting is to think about how to classify a point x, given a probabilistic generative model of the data for each class, i.e. $p(x | C_i)$ (for simplicity, in this section we’ll just consider the two-class case). The probability of x being in class 1 is then given by Bayes’ theorem as $p(C_1 | x) \propto p(x | C_1)p(C_1)$ where $p(C_1)$ is a prior probability of being in class 1. Calculating $p(x | C_i)p(C_i)$ for $i=1, 2$ is sufficient to discriminate between two classes, since we can use the fact that $p(C_2 | x) = 1 - p(C_1 | x)$ to normalize the class probabilities, giving the probability of a point x being in class 1 as

$p(C_1 | x) = \frac{p(x | C_1)p(C_1)}{p(x | C_1)p(C_1) + p(x | C_2)p(C_2)} = \frac{1}{1+exp(-a)} = \sigma(a)$

where $a = \log \frac{p(x | C_1)p(C_1)}{p(x | C_2)p(C_2)}$ and where $\sigma(a)$ is the logistic sigmoid function $\sigma(a) = \frac{1}{1+exp(-a)}$.

The problem is that, in a lot of cases, it is hard to come up with a good generative model for the data.  We might not have much idea how the data was generated, and it might be very hard to design a model that captures the true data-generating process.  But what if we choose a very general generating model: the exponential family?  This is a family of probability distributions that includes a large number of commonly encountered distributions, and which can be expressed in the form $p(x | C_i) = h(x)g(\lambda_i)\exp(\lambda_i^T\phi(x))$.  Here $\lambda_i$ is a vector of parameters of the distribution for class i, and $\phi(x)$ is a set of features of x; h(x) is simply a normalizing function.  Given this not-very-restrictive generative model, the expression for $a$ above is given by

$a = (\lambda_1-\lambda_2)^T\phi(x) + \log g(\lambda_1) - \log g(\lambda_2) + \log p(C_1) - \log p(C_2)$.

In terms of the datapoint x, this is a linear function of the features $\phi(x)$ (plus a constant), so can be written as $a = w^T\phi(x) + b$.

Therefore for a wide class of generative models, we would end up with the same expression for the probability a datapoint being in class 1:

$p(C_1 | x) = \sigma(w^T\phi(x)+b)$.

So, since this covers a wide variety of generative models, why not just use this expression directly to model $p(C_1 | x)$?  This gives a “discriminative model”, i.e. a model of how to directly assign the probabilities of a point being in each class.

The problem then becomes how to find the optimal parameters $w = \lambda_2 - \lambda_1$ and $b$ that lead to the best discrimination of the datapoints into the two classes.  One way to do this is to find values of these parameters that maximize the probability of a set of class labels for some training set of data (the class-label likelihood).  This is the problem that logistic regression solves.

Logistic Regression

Given a set of N training data $x_{1:N}$ with class labels $t_{1:N}$ (where each $t_i\in\{0,1\}$ is an indicator of the training label being in class 1), logistic regression consists in finding the parameter vectors in the above model that maximize the (log) classification likelihood $\log p(t_{1:N} | x_{1:N}) = \sum_{i=1}^N \log p(t_i | x_i)$ (assuming that all the points in the training set are independent).  The likelihood for an individual point is given by  $p(t_i | x_i) = p(C_1 | x_i)^{t_i}(1-p(C_1 | x_1))^{1-t_i}$ and so the overall classification log-likelihood is given by

$\log p(t_{1:N} | x_{1:N}) = \sum_{i=1}^N t_i \log p(C_1 | x_i) + (1-t_i)\log(1-p(C_1 | x_i))$ $= \sum_{i=1}^N t_i \log \phi(w^T\phi_i) + (1-t_i)\log(1-\phi(w^T\phi_i))$

where $\phi_i = [\phi(x_i); 1]$ is the feature vector corresponding to datapoint i augmented with a fixed bias feature with value 1.  This allows the bias term b to be included as the final element in an augmented weight vector $\tilde w=[w; b]$.

Unfortunately, this log-likelihood cannot be analytically optimized with respect to the parameters $w$, so numerical techniques have to be used.  Many of these require the gradient of the log-likelihood with respect to the parameters $\tilde w$, which is given by (using the fact that $\frac{d\phi(x)}{dx} = \phi(x)(1-\phi(x))$):

$\nabla_w\log(t_{1:N} | x_{1:N})=-\sum_{i=1}^N(\sigma(\tilde w^T\phi_i)-t_i)\phi_i$.

The negative of this term is sometimes known as the error or cross-entropy.

Multiclass Case

The arguments above can be extended to the multiclass case by replacing the logistic sigmoid function $\sigma(a)$ with its multiclass equivalent, the softmax function, given by, for class i, by

$s(x, W, i) = \frac{\exp(\tilde w_i^T\phi_i)}{\sum_{k=1}^K \tilde w_k^T\phi_k}$,

where W is a matrix of the weights for each class with $\tilde w_i$ being its i-th column.  Clearly in the two-class case this reduces to the logistic function. When applied to the classification problem, the softmax function gives the probability of datapoint x being in each class, i.e. $p(C_i | x)$ (it is clear that it is normalized correctly).  In order to map from the output of the softmax function to a class label, the maximum value of the softmax function can be used,  (i.e. the estimated label is given by $\hat C = \arg\max_i s(x, W, i)$), which corresponds to maximum-a-posteriori (MAP) class labelling for the generative model that has been learnt.

Optimization of Weights

There are numerous methods available for weight optimization.  Because the objective function here (the negative of the training-data log-likelihood) is actually convex, an efficient iterative scheme based on Newton’s method (called Iteratively Reweighted Least Squares) is available and fairly easy to implement (see, for example, chapter 4 in Pattern Recognition and Machine Learning (2006) by Bishop).  This, like many standard optimization algorithms (for example those included in scipy’s optimization package or Matlab’s fminsearch) are batch algorithms that work on the entire dataset in one go.  These are usually the best options for smallish datasets, but for very large datasets they can be unusable because they require the calculation of quantities over the entire data set.

A commonly used (and easy-to-implement) alternative to batch optimization methods is stochastic gradient descent (SGD).  This calculates the log-likelihood gradient for a single datapoint or small batch (minibatch) and then updates the current estimate of the parameters by taking a step in the direction of the calculated gradient according to

$w:=w-\nabla E_i(w)$

where $E_i(w)$ is the error function we want to minimize (i.e. the negative log-likelihood) calculated for datapoint (or batch) i, rather than for the whole data set, at the current weight values w (for all classes).

Perhaps surprisingly, this can be shown under certain conditions (including a decreasing learning rate $\eta$) to converge to a local minimum (possibly after multiple iterations over the data).  As might be expected, it is not as efficient as batch methods, but it has the huge advantage that not all the data has to be dealt with when calculating each step, allowing very large datasets to be dealt with.

There are lots of references on SGD, but a useful “cheat-sheet” is the paper “Stochastic Gradient Descent Tricks” by Leon Bottou, which gives useful advice about choosing various parameters for the method.

Code

The following code implements logistic regression in Matlab.

I absolutely do not claim that this is the best code for linear regression, merely a fairly simple basic implementation.  There are a couple of interesting features of this code:

• The learning rate is set as $\eta = \frac{a}{b+t}$ where $t$ is the current epoch (pass over the dataset).  This is suggested in the Bottou paper as a fairly effective formula, although the tricky thing is to choose the a and b parameters.
• In order to give some feedback on how the logistic regression is doing (to, for example, help tune the learning rate), 5% of the training data is held back as a validation set.  After each pass through the data, the success of the classifier on the validation set is calculated and reported.  Nothing else is done with it, although it could perhaps be used in some sort of automated process to set the learning rate.
• The Bottou paper also notes that the behaviour of SGD is somewhat independent of the size of the dataset.  Therefore, it suggests small subsets of the training data can be used  as tuning sets to attempt to choose good learning rates.
• (There are a couple of weird comments like %’, which I’ve included because the syntax highlighting is treating Matlab’s transpose operator ‘ as the start of a string).

Results on Artificial Data

In order to see logistic regression in action, I have applied it to the problem of classifying two dimensional data drawn from one of four classes.  The data for each class is generated from a different 2d Gaussian distribution. Sample training data from the classes (coloured by class) is shown as dots on the figures below.

What we would like to do is to learn the weights in a 4 class logistic regression that will best allow points such as these to be classified.  For illustration, I am going to choose the features to simply be the x and y components of the data points.  This will mean that the feature space and datapoint space coincide.

The class boundary between two classes i and j is given by the locus in feature space on which the probabilities for each class $p(C_i | x)$ and $p(C_j | x)$ are equal.  Since these probabilities are given by the softmax function, this occurs at the points $\phi_*$ in (constant-augmented) feature space where $\tilde w_i^T\phi_* = \tilde w_j^T\phi_*$, i.e. where $(\tilde w_i-\tilde w_j)^T\phi_*=0$.  And this is the equation for a plane in feature space with normal $(\tilde w_i-\tilde w_j)$ which passes through the origin (remember that we included the bias term within the set of features).  If you would prefer to think of the feature space as the space over features excluding the constant term, the class divisions will be planes with a non-zero offset from the origin, given by the difference between the bias weights for the two classes (this can be seen by writing the feature weightings for each class as $w_i^T \bar\phi_* + b_i$, with $\bar \phi_*$ being the feature vector not including the bias term.

For linear features such as the ones chosen here, therefore, the divisions between classes will be planes (lines in 2d) in data space (since that coincides with feature space (excluding the bias)).  Using the stochastic gradient descent procedure above the following class boundaries can be learnt:

The multicoloured lines show the boundaries between each pair of classes (the two colours of each line indicate which ones).  Erasing the lines (rather crudely in MS Paint!) where they are not relevant gives the following division of the space into classes:

which looks fairly reasonable.  Testing this on out-of-sample data leads to about a 4% classification error rate, which again looks like the sort of rate you might expect given the boundaries above and the fact that the classes overlap each other.  If nonlinear features had been included in the feature set (for example the quadratic features $x^2$, $xy$ and $y^2$ for the (x,y) coordinates of each point, these class boundaries would very likely be non-linear in data space.

Regularization

One thing that is not mentioned (or implemented) above is the idea of regularization.  This involves introducing a penalty term to penalize elements of the weight vector.  This can be used to control problems with overfitting (which can be severe if the classes are linearly separable in feature space).  A common penalty term is the L2 penalty $0.5\lambda^{2}w^Tw$.  Inclusion of this penalty is easily accomplished in the above example by including its gradient $\lambda w$ in the calculation of the error gradient.  Inclusion of a regularizing term such as this seems to be particularly important for the numerical stability of the Iteratively Reweighted Least Squares algorithm.

Conclusion

Logistic regression is a basic supervised learning technique for classifying points into K classes.  It requires labelled training data.  It also requires a set of “features” to be defined in advance.  Although there is no closed-form solution to the logistic regression problem, it can easily be solved by numerical optimization methods.  Batch optimization methods are more efficient for small datasets, but for large sets of data stochastic gradient descent can be an effective tool.

In the next article, I will use logistic regression to classify the MNIST data set of handwritten digits.