Why clustering and classification?

10th September 2015

"It is important to recognise if the goal of a study requires a supervised or unsupervised approach." - Charlotte Soneson, Qlucore

When analysing large quantities of scientific data, researchers have the opportunity for supervised and unsupervised methods. Charlotte Soneson outlines the benefits of both systems in the context of classification and clustering

When working with machine learning methods it is important to distinguish between supervised and unsupervised methods, since they are used in very different circumstances.

Unsupervised methods do not use any information about the samples (annotations), and rather try to find dominating structure and patterns in the data, patterns that can then be interpreted. Clustering is an example of an unsupervised method, where the goal is to find subgroups in the data (without using any sample annotation information).

Also PCA is unsupervised. Supervised methods typically aim to build models that explain or predict some pre-specified sample annotation. This annotation may or may not correspond to the main pattern in the data. Classification, or predictive modelling, is an example of supervised learning. Given some data and a sample annotation, the aim is to build a model from the data that is able to predict the value of the sample annotation in a new sample for which we are only given the data.

It is important to recognise if the goal of a study requires a supervised or unsupervised approach. For example, if the goal is to build a model that can predict the disease status of a patient, one should use a supervised approach. Using an unsupervised approach like clustering or PCA will likely mix the signal that we are interested in with other, unrelated, signals and give a worse predictor, unless the disease status is the main signal in the data. On the other hand, if the goal is to get an overview of a data set, to see which are the strongest signals and if the samples naturally group into subgroups, an unsupervised method like clustering should be used.

One thing to keep in mind when we use supervised methods is that since we are explicitly looking for patterns that are associated with a given annotation, we will most certainly find something that can predict the annotation in the current data set. However, this is not what we are interested in (since we already know the annotation values in this data set). We are interested in seeing whether the derived model can predict the value of the annotation in an independent data set, where we have only the data, but no information about the annotation. Thus, supervised models must always be validated in independent data set (a good predictive performance in the current data says nothing at all). A model that cannot predict the correct annotation values in independent data is not good. This is usually not necessary for unsupervised methods, which are usually used to summarise, explore and describe a data set.


In Qlucore we have two types of clustering methods: hierarchical clustering (in the heatmaps) and k-means clustering. Both are used for the same purpose: to find subgroups among the samples and to see whether the samples naturally distribute themselves into distinct clusters. The difference is that the hierarchical clustering builds a ‘cluster tree’ (or a dendrogram), which organises the samples hierarchically but does not directly divide them into clusters, while the k-means splits the samples into a pre-defined number of groups.

Practical situations where one would like to use a clustering approach are eg:

* To see whether there are subtypes of a particular disease, ie, if the samples group into different clusters. These clusters may represent different disease types, which have different prognosis and behaviour.

* To explore the data set and look for artifacts. This can be done by clustering the data and see whether the clusters agree with the signals that one expects to be the dominating ones, or if they rather correspond to batch effects or other technical artifacts.


Classification models consist of two parts: the variables that are used and a rule to combine the values of these variables in order to obtain a predicted value of a given sample annotation. Both are important, and are usually determined together.

Practical situations where one would like to use a classification approach are eg:

* To build a model that can use gene expression data to predict the prognosis of a cancer patient;

* To build a model that can use some numeric data to assign a sample to one of several disease subtypes.

As noted above, it is important that a predictive model is evaluated on independent data, and not on the same data where it was built. Overfitting refers to the situation where a model is ‘too specifically adapted’ to a given data set, and does not generalise to other data sets. Usually this is a sign that the model has been taking too much advantage of the random noise in the training data set, to build a model that fits well specifically to this data. The noise in an independent data set will likely be different, and then the model does not work any more. 

Cross-validation is a technique that can be used to evaluate a model based on a single data set. Basically, the idea is to subdivide the entire data set into a training and test set (multiple times), build the model on the training part and evaluate the performance on the test part (which was not used to build the model). 

The word classification is usually used to describe predictive modelling where the sample annotation is categorical. To predict a numeric/continuous annotation, one uses regression. 

Charlotte Soneson is with Qlucore.


Twitter Icon © Setform Limited