10.3 Unsupervised Learning

This chapter has so far considered supervised learning, where target features are observed in the training data. In unsupervised learning, the target features are not given in the training examples.

One general method for unsupervised learning is clustering, which partitions the examples into clusters. Each cluster predicts feature values for the examples in the cluster. The best clustering is the one that minimizes the prediction error, such as squared error or log loss.

Often the term class is used as a semantically meaningful term, but while you might want the clusters to be semantically meaningful, they are not always.

Example 10.8.

A diagnostic assistant may want to group treatments into groups that predict the desirable and undesirable effects of the treatment. The assistant may not want to give a patient a drug because similar drugs may have had disastrous effects on similar patients.

A tutoring agent may want to cluster students’ learning behavior so that strategies that work for one member of a cluster may work for other members.

In hard clustering, each example is placed definitively in a cluster. The cluster is then used to predict the feature values of the example. The alternative to hard clustering is soft clustering, in which each example has a probability distribution over clusters. The prediction of the values for the features of an example is the weighted average of the predictions of the clusters the example is in, weighted by the probability of the example being in the cluster. Soft clustering is described in Section 10.3.2.

10.3.1 k-Means

The k-means algorithm is used for hard clustering. The training examples, E, and the number of clusters, k, are given as input. It requires the value of each feature to be a (real) number, so that differences in values make sense.

The algorithm constructs k clusters, a prediction of a value for each feature for each cluster, and an assignment of examples to clusters.

Suppose the input features, X1,,Xn, are observed for each example. Let Xj(e) be the value of input feature Xj for example e. Associate a cluster with each integer c{1,,k}.

The k-means algorithm constructs

  • a function cluster:E{1,,k} that maps each example to a cluster (if cluster(e)=c, example e is said to be in cluster c)

  • a function prediction(j,c) that returns the predicted value of each element of cluster c on feature Xj.

Example e is thus predicted to have value prediction(j,cluster(e)) for feature Xj.

The aim is to find the functions cluster and prediction that minimize the sum of squared loss:

eEj=1n(prediction(j,cluster(e))Xj(e))2.

To minimize the squared loss, the prediction of a cluster should be the mean of the prediction of the examples in the cluster; see Figure 7.5. Finding an optimal clustering is NP-hard. When there are only a few examples, it is possible to enumerate the assignments of examples to clusters. For more than a few examples, there are too many partitions of the examples into k clusters for exhaustive search to be feasible.

The k-means algorithm iteratively improves the squared loss. Initially, it randomly assigns the examples to clusters. Then it carries out the following two steps:

  • For each cluster c and feature Xj, make prediction(j,c) be the mean value of Xj(e) for each example e in cluster c:

    e:cluster(e)=cXj(e)|{e:cluster(e)=c}|

    where the denominator is the number of examples in cluster c.

  • Reassign each example to a cluster: assign each example e to a cluster c that minimizes

    j=1n(prediction(j,c)Xj(e))2.

These two steps are repeated until the second step does not change the assignment of any example.

1: procedure k-means(Xs,Es,k)
2:   Inputs
3:    Xs set of features, X={X1,,Xn}
4:    Es set of training examples
5:    k number of clusters   
6:   Output
7:    cluster: function from examples to clusters
8:    predicion: function from feature and cluster to a value for that feature   
9:   Local
10:    integer cc[c], cc_new[c] old and new cluster count for cluster c
11:    real fs[j,c], fs_new[j,c] sum of feature Xj for cluster c
12:    Boolean stable   
13:   Initialize fs and cc randomly based on data
14:   define prediction(j,c)=fs[j,c]/cc[c] estimate of Xj^(c)
15:   define cluster(e)=argmincj=1n(prediction(j,c)Xj(e))2
16:   repeat
17:    fs_new and cc_new initialized to be all zero
18:    for each example eEs do
19:      c:=cluster(e)
20:      cc_new[c]+=1
21:      for each feature XjXs do
22:       fs_new[j,c]+=Xj(e)         
23:    stable:=(fs_new=fs) and (cc_new=cc)
24:    fs:=fs_new
25:    cc:=cc_new
26:   until  stable
27:   return cluster, prediction
Figure 10.4: k-Means for unsupervised learning

An algorithm that implements k-means is shown in Figure 10.4. It constructs sufficient statistics to compute the mean of each cluster for each feature, namely

  • cc[c] is the number of examples in cluster c

  • fs[j,c] is the sum of the value for Xj(e) for examples in cluster c.

These are sufficient statistics because they contain all of the information from the data necessary to compute cluster(e) and prediction(j,c). The current values of fs and cc are used to determine the next values (in fs_new and cc_new).

The random initialization could be assigning each example to a cluster at random, selecting k points at random to be representative of the clusters, or assigning some, but not all, of the examples to construct the initial sufficient statistics. The latter two methods may be more useful if the dataset is large, as they avoid a pass through the whole dataset for initialization.

An assignment of examples to clusters is stable if an iteration of k-means does not change the assignment. Stability requires that argmin in the definition of cluster gives a consistent value for each example in cases where more than one cluster is minimal. This algorithm has reached a stable assignment when each example is assigned to the same cluster in one iteration as in the previous iteration. When this happens, fs and cluster_count do not change, and so the Boolean variable stable becomes true.

This algorithm will eventually converge to a stable local minimum. This is easy to see because the sum-of-squares error keeps reducing and there are only a finite number of reassignments. This algorithm often converges in a few iterations.

Example 10.9.

Suppose an agent has observed the X,Y pairs

0.7,5.1, 1.5,6.1, 2.1,4.5, 2.4,5.5, 3.1,4.4, 3.5,5.1, 4.5,1.5, 5.2,0.7, 5.3,1.8, 6.2,1.7, 6.7,2.5, 8.5,9.2, 9.1,9.7, 9.5,8.5.

These data points are plotted in Figure 10.5(a). The agent wants to cluster the data points into two clusters (k=2).

02468100246810(a) the examples
02468100246810(b) random assignment
02468100246810(c) first reassignment
02468100246810(d) stable assignment found
Figure 10.5: A trace of the k-means algorithm for k=2 for Example 10.9

In Figure 10.5(b), the points are randomly assigned into the clusters; one cluster is depicted as + and the other as . The mean of the points marked with + is 4.6,3.65, shown with . The mean of the points marked with is 5.2,6.15, shown with .

In Figure 10.5(c), the points are reassigned according to the closer of the two means. After this reassignment, the mean of the points marked with + is then 3.96,3.27. The mean of the points marked with is 7.15,8.34.

In Figure 10.5(d), the points are reassigned to the closest mean. This assignment is stable in that no further reassignment will change the assignment of the examples.

A different initial assignment to the points can give different clustering. One clustering that arises in this dataset is for the lower points (those with a Y-value less than 3) to be in one cluster, and for the other points to be in another cluster.

Running the algorithm with three clusters (k=3) typically separates the data into the top-right cluster, the left-center cluster, and the lower cluster. However, there are other possible stable assignments that could be reached, such as having the top three points in two different clusters, and the other points in another cluster.

Some stable assignments may be better, in terms of sum-of-squares error, than other stable assignments. To find the best assignment, it is often useful to try multiple starting configurations, using a random restart and selecting a stable assignment with the lowest sum-of-squares error. Note that any permutation of the labels of a stable assignment is also a stable assignment, so there are invariably multiple local and global minima.

One problem with the k-means algorithm is that it is sensitive to the relative scale of the dimensions. For example, if one feature is height in centimeters, another feature is age, and another is a binary ({0,1}) feature, the different values need to be scaled so that they can be compared. How they are scaled relative to each other affects the clusters found. It is common to scale the dimensions to between 0 and 1 or with a mean of 0 and a variance of 1, but this assumes that all dimensions are relevant and independent of each other.

Finding an appropriate number of clusters is a classic problem in trading off model complexity and fit to data. One solution is to use the Bayesian information criteria (BIC) score, similar to its use in decision trees where the number of clusters is used instead of the number of leaves. While it is possible to construct k+1 clusters from k clusters, the optimal division into three clusters, for example, may be quite different from the optimal division into two clusters.

10.3.2 Expectation Maximization for Soft Clustering

A hidden variable or latent variable is a probabilistic variable that is not observed in a dataset. A Bayes classifier can be the basis for unsupervised learning by making the class a hidden variable.

The expectation maximization (EM) algorithm can be used to learn probabilistic models with hidden variables. Combined with a naive Bayes classifier, it does soft clustering, similar to the k-means algorithm, but where examples are probabilistically in clusters.

As in the k-means algorithm, the training examples and the number of clusters, k, are given as input.

Data Model Probabilities
X1X2X3X4tfttfttffftt Refer to caption P(C)P(X1C)P(X2C)P(X3C)P(X4C)
Figure 10.6: EM algorithm: Bayes classifier with hidden class

Given the data, a naive Bayes model is constructed where there is a variable for each feature in the data and a hidden variable for the class. The class variable is the only parent of the other features. This is shown in Figure 10.6. The class variable has domain {1,2,,k}, where k is the number of classes. The probabilities needed for this model are the probability of the class C and the probability of each feature given C. The aim of the EM algorithm is to learn probabilities that best fit the data.

The EM algorithm conceptually augments the data with a class feature, C, and a count column. Each original example gets mapped into k augmented examples, one for each class. The counts for these examples are assigned so that they sum to 1.

For four features and three classes, the example X1=t,X2=f,X3=t,X4=t is mapped into the three tuples, shown in the table on the left of Figure 10.7. EM works by iteratively determining the count from the model, and the model from the count.

X1 X2 X3 X4 C Count
t f t t 1 0.4
t f t t 2 0.1
t f t t 3 0.5
E-step
M-step
P(C)
P(X1C)
P(X2C)
P(X3C)
P(X4C)
Figure 10.7: EM algorithm for unsupervised learning

The EM algorithm repeats the following two steps:

  • E step. Update the augmented counts based on the probability distribution. For each example X1=v1,,Xn=vn in the original data, the count associated with X1=v1,,Xn=vn,C=c in the augmented data is updated to

    P(C=cX1=v1,,Xn=vn).

    This step involves probabilistic inference. If multiple examples have the same values for the input features, they can be treated together, with the probabilities multiplied by the number of examples. This is an expectation step because it computes the expected values.

  • M step. Infer the probabilities for the model from the augmented data. Because the augmented data has values associated with all the variables, this is the same problem as learning probabilities from data in a naive Bayes classifier. This is a maximization step because it computes the maximum likelihood estimate or the maximum a posteriori probability (MAP) estimate of the probability.

The EM algorithm starts with random probabilities or random counts. EM will converge to a local maximum of the likelihood of the data.

This algorithm returns a probabilistic model, which is used to classify an existing or new example. An example is classified using

P(C=cX1=v1,,Xn=vn)
=P(C=c)i=1nP(Xi=viC=c)cP(C=c)i=1nP(Xi=viC=c).

The algorithm does not need to store the augmented data, but can maintain a set of sufficient statistics, which is enough information to compute the required probabilities. Assuming categorical features, sufficient statistics for this algorithm are

  • cc, the class count, a k-valued array such that cc[c] is the sum of the counts of the examples in the augmented data with class=c

  • fc, the feature count, a three-dimensional array; for i from 1 to n, for each value v in domain(Xi), and for each class c, fc[i,v,c] is the sum of the counts of the augmented examples t with Xi(t)=v and class(t)=c.

In each iteration, it sweeps through the data once to compute the sufficient statistics. The sufficient statistics from the previous iteration are used to infer the new sufficient statistics for the next iteration.

The probabilities required of the model can be computed from cc and fc:

P(C=c)=cc[c]|E|

where |E| is the number of examples in the original dataset (which is the same as the sum of the counts in the augmented dataset).

P(Xi=vC=c)=fc[i,v,c]cc[c].
1: procedure EM(Xs,Es,k)
2:   Inputs
3:    Xs set of features, Xs={X1,,Xn}
4:    Es set of training examples
5:    k number of classes   
6:   Output
7:    sufficient statistics for probabilistic model on X
8:   Local
9:    real cc[c], cc_new[c]               # old and new class count
10:    real fc[i,v,c], fc_new[i,v,c]              # old and new feature count
11:    real dc              # class probability for current example and class
12:    Boolean stable   
13:   repeat
14:    cc_new[c] and fc_new[i,v,c] initialized to be all zero
15:    for each example v1,,vnEs do
16:      for each c[1,k] do
17:       dc:=P(C=cX1=v1,,Xn=vn)
18:       cc_new[c]:=cc_new[c]+dc
19:       for each i[1,n] do
20:         fc_new[i,vi,c]:=fc_new[i,vi,c]+dc               
21:    stable:=(cccc_new) and (fcfc_new)
22:    cc:=cc_new
23:    fc:=fc_new
24:   until  stable
25:   return cc,fc
Figure 10.8: EM for unsupervised learning

The algorithm of Figure 10.8 computes the sufficient statistics. Evaluating P(C=cX1=v1,,Xn=vn) in line 17 relies on the counts in cc and fc. This algorithm has glossed over how to initialize the counts. One way is for P(CX1=v1,,Xn=vn) to return a random distribution for the first iteration, so the counts come from the data. Alternatively, the counts can be assigned randomly before seeing any data. See Exercise 10.7.

The algorithm will eventually converge when cc and fc do not change significantly in an iteration. The threshold for the approximately equal in line 21 can be tuned to trade off learning time and accuracy. An alternative is to run the algorithms for a fixed number of iterations.

Example 10.10.

Consider Figure 10.7.

When example x1,¬x2,x3,x4 is encountered in the dataset, the algorithm computes

P(C=cx1¬x2x3x4)
P(X1= 1C=c)P(X2= 0C=c)P(X3= 1C=c)
P(X4= 1C=c)P(C=c)
= fc[1,1,c]cc[c]fc[2,0,c]cc[c]fc[3,1,c]cc[c]fc[4,1,c]cc[c]cc[c]|E|
fc[1,1,c]fc[2,0,c]fc[3,1,c]fc[4,1,c]cc[c]3

for each class c and normalizes the results. Suppose the value computed for class 1 is 0.4, class 2 is 0.1, and class 3 is 0.5 (as in the augmented data in Figure 10.7). Then cc_new[1] is incremented by 0.4, cc_new[2] is incremented by 0.1, etc. Values fc_new[1,1,1], fc_new[2,0,1], etc. are each incremented by 0.4. Next, fc_new[1,1,2], fc_new[2,0,2] are each incremented by 0.1, etc.

Notice the similarity with the k-means algorithm. The E step (probabilistically) assigns examples to classes, and the M step determines what the classes predict.

As long as k>1, EM, like k-means, virtually always has multiple local and global maxima. In particular, any permutation of the class labels will give the same probabilities. To try to find a global maximum, multiple restarts can be tried, and a model with the lowest log-likelihood returned.