7.3 Basic Models for Supervised Learning

Supervised learning methods take the input features, the target features, and the training data and return predictors, functions on input features that predict values for the target features. Learning methods are often characterized by how the predictors are represented. This section considers some basic methods from which other methods are built. Initially assume a single target feature and the learner returns a predictor on this target.

7.3.1 Learning Decision Trees

A decision tree is a simple representation for classifying examples. Decision tree learning is one of the simplest useful techniques for supervised classification learning.

A decision tree is a tree in which

  • each internal (non-leaf) node is labeled with a condition, a Boolean function of examples

  • each internal node has two branches, one labeled true and the other false

  • each leaf of the tree is labeled with a point estimate.

Decision trees are also called classification trees when the target (leaf) is a classification, and regression trees when the target is real-valued.

To classify an example, filter it down the tree, as follows. Each condition encountered in the tree is evaluated and the arc corresponding to the result is followed. When a leaf is reached, the classification corresponding to that leaf is returned. A decision tree corresponds to a nested if–then–else structure in a programming language.

Refer to caption
Figure 7.8: Two decision trees
Example 7.8.

Figure 7.8 shows two possible decision trees for the examples of Figure 7.1. Each decision tree can be used to classify examples according to the user’s action. To classify a new example using the tree on the left, first determine the length. If it is long, predict skips. Otherwise, check the thread. If the thread is new, predict reads. Otherwise, check the author and predict reads only if the author is known. This decision tree can correctly classify all examples in Figure 7.1.

The left tree corresponds to the program defining UserAction^(e):

define UserAction^(e):
if long(e): return skips
else if new(e): return reads
else if unknown(e): return skips
else: return reads

The tree on the right returns a numerical prediction for reads:

define UserAction^(e):
if long(e): return 0
else new(e): return 0.82

To use decision trees as a target representation, there are a number of questions that arise.

  • Given some training examples, what decision tree should be generated? Because a decision tree can represent any function of discrete input features, the bias that is necessary is incorporated into the preference of one decision tree over another. One proposal is to prefer the smallest tree that is consistent with the data, which could mean the tree with the least depth or the tree with the fewest nodes. Which decision trees are the best predictors of unseen data is an empirical question.

  • How should an agent go about building a decision tree? One way is to search the space of decision trees for the smallest decision tree that fits the data. Unfortunately, the space of decision trees is enormous (see Exercise 7.7). A practical solution is to carry out a greedy search on the space of decision trees, with the goal of minimizing the error. This is the idea behind the algorithm described below.

Searching for a Good Decision Tree

A decision tree can be seen as a branching program, that takes an example and returns a prediction for that example. The program is either

  • a function that ignores its argument and returns a point prediction for all of the examples that reach this point (which corresponds to a leaf), or

  • of the form “if c(e) then t1(e) else t0(e)” where c is a Boolean condition, and t1 and t0 are decision trees; t1 is the tree that is used when the condition c is true of example e, and t0 is the tree used when c(e) is false.

1: procedure Decision_tree_learner(Cs,Y,Es,γ)
2:   Inputs
3:    Cs: set of possible conditions
4:    Y: target feature
5:    Es: set of training examples
6:    γ: minimum improvement needed to expand a node (γ0)   
7:   Output
8:    function to predict a value of Y for an example
9:   c:=select_split(Es,Cs,γ)
10:   if c=None then stopping criterion is true
11:    v:=leaf_value(Es)
12:    define T(e)=v
13:    return T
14:   else
15:    true_examples:={eEs:c(e)}
16:    t1:=Decision_tree_learner(Cs{c},Y,true_examples,γ)
17:    false_examples:={eEs:¬c(e)}
18:    t0:=Decision_tree_learner(Cs{c},Y,false_examples,γ)
19:    define T(e)= if c(e) then t1(e) else t0(e)
20:    return T   
21:   procedure select_split(Es,Cs,γ)
22:    best_val:=sum_loss(Es)γ
23:    best_split:=None
24:    for cCs do
25:      val:=sum_loss({eEsc(e)})+sum_loss({eEs¬c(e)})
26:      if val<best_val then
27:       best_val:=val
28:       best_split:=c         
29:    return best_split best_split=None means stopping criterion is true   
Figure 7.9: Decision tree learner; returns a predicting function

The algorithm Decision_tree_learner of Figure 7.9 mirrors the recursive decomposition of a tree. It builds a decision tree from the top down as follows. The input to the algorithm is a set of input conditions (Boolean functions of examples that use only input features), the target feature, a set of training examples, and a real-valued parameter, γ, discussed below. If the input features are Boolean, they can be used directly as the conditions.

A greedy optimal split is a condition that results in the lowest error if the learner were allowed only one split and it splits on that condition. sum_loss(Es) gives the sum of losses of training examples Es for the loss function assumed, given that the optimal prediction is used for that loss function, as given in Figure 7.5. The procedure select_split returns a greedy optimal split if the sum of losses after the split improves the sum of losses before the split by at least the threshold γ. If there is no such condition, select_split returns None.

Adding a split increases the size of the tree by 1. The threshold γ can be seen as a penalty for increasing the size of the tree by 1. If positive, γ is also useful to prevent val<best_val holding solely due to rounding error.

If select_split returns None, the decision tree learning algorithm creates a leaf. The function leaf_value(Es) returns the value that is used as a prediction of the target for examples Es that reach this node. It ignores the input features for these examples and returns a point estimate, which is the case considered in Section 7.2.2. The decision tree algorithm returns a function that takes an example and returns that point estimate.

If select_split returns condition c, the learner splits on condition c by partitioning the training examples into those examples e with c(e) true and those examples with c(e) false. It recursively builds a subtree for each of these bags of examples. It returns a function that, given an example, tests whether c is true of the example, and then uses the prediction from the appropriate subtree.

Example 7.9.

Consider applying Decision_tree_learner to the classification data of Figure 7.1, with γ=0. The initial call is

decisionTreeLearner({known,new,long,home},User_action,
       {e1,e2,,e18},0)

where known is true when Author=known, and similarly for the other conditions.

Suppose the stopping criterion is not true and the algorithm selects the condition long to split on. It then calls

decisionTreeLearner({known,new,home},User_action,
       {e1,e3,e4,e6,e9,e10,e12},0)

where {e1,e3,e4,e6,e9,e10,e12} is the set of training examples with Length=long.

All of these examples agree on the user action; therefore, the algorithm returns the prediction skips. The second step of the recursive call is

decisionTreeLearner({known,new,home},User_action,
       {e2,e5,e7,e8,e11,e13,e14,e15,e16,e17,e18},0).

Not all of the examples agree on the user action, so assuming the stopping criterion is false, the algorithm selects a condition to split on. Suppose it selects new. Eventually, this recursive call returns the function on example e in the case when Length is short:

if new(e) then reads
else if unknown(e) then skips else reads.

The final result is the first predictor of Example 7.8.

When the loss is log loss with base 2, the mean of the losses, sum_losses(Es)/|Es|, is the entropy of the empirical distribution of |Es|. The number of bits to describe Es after testing the condition c is val, defined on line 25 of Figure 7.9. The entropy of the distribution created by the split is val/|Es|. The difference of these two is the information gain of the split. Sometimes information gain is used even when the optimality criterion is some other error measure, for example, when maximizing accuracy it is possible to select a split to optimize log loss, but return the mode as the leaf value. See Exercise 7.6.

The following example shows details of the split choice for the case where the split is chosen using log loss, and the empirical distribution is used as the leaf value.

Example 7.10.

In the running example of learning the user action from the data of Figure 7.1, suppose the aim is to minimize the log loss. The algorithm greedily chooses a split that minimizes the log loss. Suppose γ is 0.

Without any splits, the optimal prediction on the training set is the empirical frequency. There are nine examples with User_action=reads and nine examples with User_action=skips, and so known is predicted with probability 0.5. The mean log loss is equal to (18log20.5)/18=1.

Consider splitting on Author. This partitions the examples into [e1, e4, e5, e6, e9, e10, e12, e13, e14, e15, e16, e17] with Author=known and [e2, e3, e7, e8, e11, e18] with Author=unknown, each of which is evenly split between the different user actions. The optimal prediction for each partition is again 0.5, and so the log loss after the split is again 1. In this case, finding out whether the author is known, by itself, provides no information about what the user action will be.

Splitting on Thread partitions the examples into [e1, e2, e5, e8, e10, e12, e14, e15, e17, e18] with Thread=new and [e3, e4, e6, e7, e9, e11, e13, e16] with Thread=followup. The examples with Thread=new contains three examples with User_action=skips and seven examples with User_action=reads, thus the optimal prediction for these is to predict reads with probability 7/10. The examples with Thread=followup have two reads and six skips. Thus, the best prediction for these is to predict reads with probability 2/8. The mean log loss after the split is

(3log2(3/10)+7log2(7/10)+2log2(2/8)+6log2(6/8))/18
15.3/180.85.

Splitting on Length divides the examples into [e1, e3, e4, e6, e9, e10, e12] and [e2, e5, e7, e8, e11, e13, e14, e15, e16, e17, e18]. The former all agree on the value of User_action and predict with probability 1. The user action divides the second set 9:2, and so the mean log loss is

(7log21+9log29/11+2log22/11)/187.5/180.417.

Therefore, splitting on Length is better than splitting on Thread or Author, when greedily optimizing the log loss.

Constructing Conditions

In the decision tree learning algorithm (Figure 7.9), Boolean input features can be used directly as the conditions. Non-Boolean input features are handled in a number of ways.

  • Suppose input variable X is categorical, with domain {v1,,vk}. A binary indicator variable, Xi, can be associated with each value vi, where Xi(e)=1 if X(e)=vi and Xi(e)=0 otherwise. For each example e, exactly one of X1(e),,Xk(e) is 1 and the others are 0.

  • When the domain of a feature is totally ordered, the feature is called an ordinal feature. This includes real-valued features as a special case, but might also include other features such as clothing sizes (S, M, L, XL, etc.), and highest level of education (none, primary, secondary, bachelor, etc.).

    For an ordinal input feature X and for a given value v, a Boolean feature can be constructed as a cut: a new feature that has value 1 when X>v and 0 otherwise. Combining cuts allows for features that are true for intervals; for example, a branch might include the conditions X>9 is true and X>17 is false, which corresponds to the interval 9<X17.

    Suppose the domain of input variable X is totally ordered. To select the optimal value for the cut value v, sort the examples on the value of X and sweep through the examples to consider each split value and select the best. See Exercise 7.8.

  • For ordinal features (including real-valued features), binning involves choosing a set of thresholds, creating a feature for each interval between the thresholds. The thresholds α1<α2<<αk, make k+1 Boolean features, one that is true for X when Xα1, one for αk<X, and one for αi<Xαi+1 for each ii<k. A bin of the form αi<Xαi+1 would require two splits to represent using cuts. The αi can be chosen upfront, for example, using percentiles of the training data, or chosen depending on the target.

  • For categorical feature X, there might be a better split of the form XS where S is a set of values, rather than only splitting on a single value, as is done with indicator variables. When the target Y is Boolean, to find an appropriate set S, sort the values of X by the proportion of Y that are true; a greedy optimal split will be between values in this sorted list.

  • It is possible to expand the algorithm to allow multiway splits. To split on a multivalued variable, there would be a child for each value in the domain of the variable. This means that the representation of the decision tree becomes more complicated than the simple if–then–else form used for binary features. There are two main problems with this approach. The first is what to do with values of a feature for which there are no training examples. The second is that for most greedy splitting heuristics, including information gain, it is generally better to split on a variable with a larger domain because it produces more children and so can fit the data better than splitting on a feature with a smaller domain. However, splitting on a feature with a smaller domain keeps the representation more compact. A four-way split, for example, is equivalent to three binary splits; they both result in four leaves.

Alternative Design Choices

The algorithm does not split when select_split returns None. This occurs when there are no examples, when there are no conditions remaining, when all examples have the same value on each condition, when all of the examples have the same target value, and when the improvement of the evaluation is less than the parameter γ. A number of other criteria have been suggested for stopping earlier.

  • Minimum child size: do not split more if one of the children will have fewer examples than a threshold.

  • Maximum depth: do not split more if the depth reaches a maximum.

It is possible that one condition may only work well in conjunction with other conditions, and the greedy method may not work when this occurs. One particularly tricky case is a parity function of k Boolean variables that is true if an odd (or even) number of variables are true; knowing the values of fewer than k of the variables gives no information about the value of the parity function. The simplest parity functions (for k=2) are exclusive-or and equivalence. Parity functions have complicated decision trees.

In some cases, greedy splitting does not find a simplest decision tree and it is often useful to simplify the tree resulting from the top-down algorithm, as shown in the following example.

Example 7.11.

Consider a dataset with inputs x, y, and z and target t. The target is true if x is true and y is true, or x is false and z is true. Figure 7.10 (a) shows a tree representation of this function. This tree can generate the data in the center (b).

Refer to caption
x y z t
0 0 0 0
0 0 1 1
0 1 0 0
0 1 1 1
1 0 0 0
1 0 1 0
1 1 0 1
1 1 1 1
Refer to caption
(a) (b) (c)
Figure 7.10: Generator, training data, and learned tree for “if x then t=y else t=z”. In the decision trees, the left branches correspond to “true” and the right branches to “false”. The prediction for t is at the leaves

Although the simplest tree first splits on x, splitting on x provides no information; there is the same proportion of t true when x is true as when x is false. Instead, the algorithm can split on y. When y is true, there is a larger proportion of t true than when y is false. For the case where y is true, splitting on x perfectly predicts the target when x is true. The resulting tree is given in Figure 7.10(c). Following the paths to t= 1, this tree corresponds to t being true when (xy)(y¬xz)(¬y¬xz), which can be simplified to (xy)(¬xz). This is essentially the original tree.

7.3.2 Linear Regression and Classification

Linear functions provide a basis for many learning algorithms. This section first covers regression, then considers classification.

Linear regression is the problem of fitting a linear function to a set of training examples, in which the input and target features are real numbers.

Suppose the input features, X1,,Xm, are all real numbers (which includes the {0,1} case) and there is a single target feature Y. A linear function of the input features is a function of the form

Y^w¯(e) =w0+w1X1(e)++wmXm(e)
=i=0mwiXi(e)

where w¯=w0,w1,,wn is a vector (tuple) of weights, and X0 is a special feature whose value is always 1.

Suppose E is a set of examples. The mean squared loss on examples E for target Y is the error

error(E,w¯) =1|E|eE(Y^w¯(e)Y(e))2
=1|E|eE(i=0mwiXi(e)Y(e))2. (7.1)

Consider minimizing the mean squared loss. There is a unique minimum, which occurs when the partial derivatives with respect to the weights are all zero. The partial derivative of the error in Equation 7.1 with respect to weight wi is

wierror(E,w¯)=1|E|eE2δ(e)Xi(e) (7.2)

where δ(e)=Y^w¯(e)Y(e), a linear function of the weights. The weights that minimize the error can be computed analytically by setting the partial derivatives to zero and solving the resulting linear equations in the weights (see Exercise 7.11).

Squashed Linear Functions

Consider binary classification, where the domain of the target variable is {0,1}.

A linear function does not work well for such classification tasks; a learner should never make a prediction greater than 1 or less than 0. However, a linear function could make a prediction of, say, 3 for one example just to fit other examples better.

A squashed linear function is of the form

Y^w¯(e) =ϕ(w0+w1X1(e)++wmXm(e))
=ϕ(iwiXi(e))

where ϕ, an activation function, is a function from the real line [,] into some subset of the real line, such as [0,1].

A prediction based on a squashed linear function is a linear classifier.

Refer to caption
Figure 7.11: The sigmoid or logistic function

One differentiable activation function is the sigmoid or logistic function:

sigmoid(x)=11+exp(x)

where exp(v)=ev, where e is Euler’s number (approximately 2.718). The sigmoid function, depicted in Figure 7.11, squashes the real line into the interval (0,1), which is appropriate for classification because you would never want to make a prediction of greater than 1 or less than 0. The sigmoid function can be justified in terms of probabilities. It is also differentiable, with derivative

ddxsigmoid(x)=sigmoid(x)(1sigmoid(x)).

The problem of determining weights for the sigmoid of a linear function that minimize an error on a set of examples is called logistic regression.

The mean log loss for logistic regression is

LL(E,w¯)=1|E|eE(Y(e)logY^(e)+(1Y(e))log(1Y^(e)))

where Y^(e)=sigmoid(i=0mwiXi(e)). To minimize this, consider weight wi. The partial derivative with respect to weight wi is

wiLL(E,w¯)=1|E|eEδ(e)Xi(e) (7.3)

where δ(e)=Y^w¯(e)Y(e). This is very similar to Equation 7.2, the main difference is the definition of the predicted value. Unlike Equation 7.2, this is not a linear function of the parameters (because Y^w¯(e) is not linear in the parameters) and is difficult to solve analytically.

Stochastic Gradient Descent

The problem of finding a set of parameters to minimize errors is an optimization problem; see Section 4.8.

Gradient descent is an iterative method to find a local minimum of a function. To find a set of weights to minimize an error, it starts with an initial set of weights. In each step, it decreases each weight in proportion to its partial derivative:

wi:=wiηwierror(E,w¯)

where η, the gradient descent step size, is called the learning rate. The learning rate, as well as the features and the data, is given as input to the learning algorithm. The partial derivative specifies how much a small change in the weight would change the error.

For linear regression with squared error and logistic regression with log loss, the derivatives, given in Equation 7.2 and Equation 7.3. For each of these (ignoring the constant factor of 2), gradient descent has the update

wi:=wiη1|E|eEδ(e)Xi(e) (7.4)

where δ(e)=Y^w¯(e)Y(e).

A direct implementation of gradient descent does not update any weights until all examples have been considered. This can be very wasteful for large datasets. It is possible to make progress with a subset of the data. This gradient descent step takes a mean value. Often you can compute means approximately by using a random sample of examples. For example, you can get a good estimate of the mean height of a large population of people by selecting 100 or 1000 people at random and using their mean height.

Instead of using all of the data for an update, stochastic gradient descent uses a random sample of examples to update the weights. It is called stochastic because of the random sampling. Random sampling is explored more in Section 9.7. The set of b examples used in each update is called a minibatch or a batch.

The stochastic gradient descent algorithm for logistic regression is shown in Figure 7.12. This returns a function, pred, that can be used for predictions on new examples. The algorithm collects the update for each weight wi for a batch in a corresponding di, and updates the weights after each batch. The learning rate η is assumed to be per example, and so the update needs to be divided by the batch size.

An epoch is |Es|/b batches, which corresponds to one pass through all of the data, on average. Epochs are useful when reporting results, particularly with different batch sizes.

1: procedure Linear_learner(Xs,Y,Es,η,b)
2:   Inputs
3:    Xs: set of input features, Xs={X1,,Xm}. Assume X0=1
4:    Y: target feature
5:    Es: set of training examples
6:    η: learning rate
7:    b: batch size   
8:   Output
9:    function to make prediction on examples
10:   Local
11:    w0,,wm: real numbers
12:    d0,,dm: real numbers   
13:   initialize w0,,wm randomly
14:   define pred(e)=ϕ(iwiXi(e))
15:   repeat
16:    for each i[0,n] do
17:      di:=0    
18:    select batch BEs of size b
19:    for each example e in B do
20:      error:=pred(e)Y(e)
21:      for each i[0,n] do
22:       di:=di+errorXi(e)         
23:    for each i[0,n] do
24:      wi:=wiηdi/b    
25:   until termination
26:   return pred
Figure 7.12: Stochastic gradient descent for linear and logistic regression. For linear regression, ϕ is the identity function. For logistic regression, ϕ is sigmoid
Example 7.12.

Consider learning a squashed linear function for classifying the data of Figure 7.1. One function that correctly classifies the examples is

Reads^(e)=sigmoid(8+7Short(e)+3New(e)+3Known(e)),

where f is the sigmoid function. A function similar to this can be found with about 3000 iterations of stochastic gradient descent with a learning rate η=0.05. According to this function, Reads^(e) is true (the predicted value for example e is closer to 1 than 0) if and only if Short(e) is true and either New(e) or Known(e) is true. Thus, in this case, the linear classifier learns the same function as the decision tree learner.

Smaller batch sizes tend to learn faster as fewer examples are required for an update. However, smaller batches may not converge to a local optimum solution, whereas more data, up to all of the data, will. To see this, consider being at an optimum. A batch containing all of the examples would end up with all of the di being zero. However, for smaller batches, the weights will vary and later batches will be using non-optimal parameter settings and so use incorrect derivatives. It is common to start with small batch size and increase the batch size until convergence, or good enough performance has been obtained.

Incremental gradient descent, or online gradient descent, is a special case of stochastic gradient descent using minibatches of size 1. In this case, there is no need to store the intermediate values in di, but the weights can be directly updated. This is sometimes used for streaming data where each example is used once and then discarded. If the examples are not selected at random, it can suffer from catastrophic forgetting, where it fits the later data and forgets about earlier examples.

Linear Separability

Each input feature can be seen as a dimension; m features results in an m-dimensional space. A hyperplane in an m-dimensional space is a set of points that all satisfy a constraint that some linear function of the variables is zero. The hyperplane forms an (m1)-dimensional space. For example, in a (two-dimensional) plane, a hyperplane is a line, and in a three-dimensional space, a hyperplane is a plane. A Boolean classification is linearly separable if there exists a hyperplane where the classification is true on one side of the hyperplane and false on the other side.

The Logistic_regression_learner algorithm can learn any linearly separable binary classification. The error can be made arbitrarily small for arbitrary sets of examples if, and only if, the target classification is linearly separable. The hyperplane is the set of points where iwiXi=0 for the learned weights w¯. On one side of this hyperplane, the prediction is greater than 0.5; on the other side, the prediction is less than 0.5.

Refer to caption
Figure 7.13: Linear separators for Boolean functions
Example 7.13.

Figure 7.13 shows linear separators for “or” (a) and “and” (b). The dashed line separates the positive (true) cases from the negative (false) cases. One simple function that is not linearly separable is the exclusive-or (xor) function (c). There is no straight line that separates the positive examples from the negative examples. As a result, a linear classifier cannot represent, and therefore cannot learn, the exclusive-or function.

Suppose there are three input features x, y, and z, each with domain {0,1}, and the ground truth is the function “if x then y else z” (represented by t in Figure 7.10). This function is depicted by the cube in Figure 7.13(d) with the origin (x, y, z all zero) at the bottom left and the ground truth for t labelled with + and . This function is not linearly separable.

The following example shows what happens in gradient descent for logistic regression when the data is not linearly separable.

x y z t lin t^
0 0 0 0 3.98 0.02
0 0 1 1 0.08 0.52
0 1 0 0 0.08 0.52
0 1 1 1 4.14 0.98
1 0 0 0 4.10 0.02
1 0 1 0 0.04 0.49
1 1 0 1 0.04 0.49
1 1 1 1 4.14 0.98
(a) (b)
Figure 7.14: Logistic regression for conditional; target t is “if x then t=y else t=z” (a) training data (b) prediction
Example 7.14.

Consider target t from the previous example that is true if x is true and y is true, or x is false and z is true. The prediction of t is not linearly separable, as shown in Figure 7.13(d) – there is no hyperplane that separates the positive and negative cases of t.

After 1000 epochs of gradient descent with a learning rate of 0.05, one run found the following weights (to two decimal points):

lin(e)= 0.12x(e)+4.06y(e)+4.06z(e)3.98
t^(e)= sigmoid(lin(e)).

The linear function lin and the prediction for each example are shown in Figure 7.14(b). Four examples are predicted reasonably well, and the other four are predicted with a value of approximately 0.5. This function is quite stable with different initializations. Increasing the number of iterations makes the predictions approach 0, 1, or 0.5.

Categorical Target Features

When the domain of the target variable is categorical with more than two values, indicator variables can be used to convert the classification to binary variables. These binary variables could be learned separately. Because exactly one of the values must be true for each example, the predicted probabilities should add to 1. One way to handle this is to learn for all-but-one value, and predict the remaining value as 1 minus the sum of the other values. This is effectively how the binary case works. However, this introduces an asymmetry in the values by treating one value differently from the other values. This is problematic because the errors for the other values accumulate, making for a poor prediction on the value treated specially; it’s even possible that the prediction for the remaining value is negative if the others sum to more than 1.

The standard alternative is to learn a linear function for each value of the target variable, exponentiate, and normalize. This has more parameters than necessary to represent the function (it is said to be over-parametrized) but treats all of the values in the same way. Suppose the target Y is categorical with domain represented by the tuple of values (v1,,vk). The softmax function takes a vector (tuple) of real numbers, (α1,,αk), and returns a vector of the same size, where the ith component of the result is

softmax((α1,,αk))i=exp(αi)j=1kexp(αj).

This ensures that the resulting values are all positive and sum to 1, and so can be considered as a probability distribution.

Sigmoid and softmax are closely related:

sigmoid(x) =1exp(x)+1
=exp(x)exp(0)+exp(x)
=softmax((0,x))2

where (0,x) corresponds to the values (false,true) and softmax((0,x))2 is the second component of the pair that results from the softmax. The second equality follows from multiplying the numerator and the denominator by exp(x), and noticing that exp(x)exp(x)=exp(0)=1. Thus, sigmoid is equivalent to softmax where the false component is fixed to be 0.

A softmax, like a sigmoid, cannot represent zero probabilities.

The generalization of logistic regression to predicting a categorical feature is called softmax regression, multinomial logistic regression, or multinomial logit. It involves a linear equation for each value in the domain of the target variable, Y. Suppose Y has domain (vi,,vk). The prediction for example e is a tuple of k values, softmax((u1(e),,uk(e))), where the jth component is the prediction for Y=vj and

uj(e)=w0,j+X1(e)w1,j++Xm(e)wm,j.

This is typically optimized with categorical log loss.

Consider weight wij that is used for input Xi for output value vj, and example e that has Y(e)=vq:

wij logloss(softmax((u1(e),,uk(e))),vq)
=wijlog(exp(uq(e))jexp(uj(e)))
=wij(log(jexp(uj(e)))uq(e))
=((Y^(e))j𝟏(j=q))Xi

where 𝟏(j=q) is 1 if j is the index of the observed value, vq, and (Y^(e))j is the jth component of the prediction. This is the predicted value minus the actual value.

To implement this effectively, you need to consider how computers represent numbers. Taking the exponential of a large number can result in a number larger than the largest number able to be represented on the computer, resulting in overflow. For example, exp(800) will overflow for most modern CPUs. Taking exponentials of a large negative number can result in a number that is represented as zero, resulting in underflow. For example, exp(800) results in zero on many CPUs. Adding a constant to each αi in a softmax does not change the value of the softmax. To prevent overflow and prevent all values from underflowing, the maximum value can be subtracted from each value, so there is always a zero, and the rest are negative. On GPUs and similar parallel hardware, often lower precision is used to represent weights, and so it becomes more important to correct for underflow and overflow.

When there is a large number of possible values, the computation of the denominator can be expensive, as it requires summing over all values. For example, in natural language, we may want to predict the next word in a text, in which case the number of values could be up to a million or more (particularly when phrases and names such as “Mahatma Gandhi” are included). In this case, it is possible to represent the prediction in terms of a binary tree of the values, forming hierarchical softmax. This implements the same function as softmax, just more efficiently for large domains.

Creating Input Features

The definitions of linear and logistic regression assume that the input features are numerical.

Categorical features can be converted into features with domain {0,1} by using indicator variables, as was done for decision tree learning. This is known as a one-hot encoding.

A real-valued feature can be used directly as long as the target is a linear function of that input feature, when the other input features are fixed. If the target is not a linear function, often some transformation of the feature is used to create new features.

For ordinal features, including real-valued features, cuts can be used to define a Boolean feature from a real feature. For input feature x, choose a value v and use a feature that is true if x>v, or equivalently xv>0. It is also common to use binning. Binning involves choosing a set of thresholds, α1<α2<<αk, and using a feature with domain {0,1} for each interval between αi and αi+1. Binning allows for a piecewise constant function. Constructing a feature using max(xv,0) allows for a connected piecewise linear approximation; this is the basis of the rectified linear unit (ReLU), further investigated in the next chapter.

Designing appropriate features is called feature engineering. It is often difficult to design good features. Gradient-boosted trees use conjunctions of input features. Learning features is part of representation learning; see Chapter 8.