## 
## Attaching package: 'crayon'
## The following object is masked _by_ '.GlobalEnv':
## 
##     num_colors

Prerequisites

Please install the following packages:

install.packages(c("randomForest", "party", "MASS", "pROC"))

Please load the following three data sets:

m_dat      <- read.csv("https://january3.github.io/Bioinformatics/Datasets/metabo_data.csv", 
                       row.names=1)
m_samples  <- read.csv("https://january3.github.io/Bioinformatics/Datasets/metabo_coldata.csv", 
                       row.names=1)
m_features <- read.csv("https://january3.github.io/Bioinformatics/Datasets/metabo_fdata.csv", 
                       row.names=1)

Machine learning algorithms

There are numerous machine learning algorithms:

  • decision trees
    • with bagging
    • with boosting
    • random forests
  • support vector machines (SVMs)
  • projections to latent structures (PLS)
  • linear discriminant analysis
  • regression models (e.g. logistic regression, ridge regression, lasso regression, elastic nets)
  • neuronal networks
  • deep learning algorithms

Today, we will first apply LDA to the iris data set, and then we will apply decision trees to a metabolome data set.

Iris data set

Creating a training and test set

data(iris)
n <- nrow(iris) # 150
train <- sample(1:n, size=2/3 * n)
test  <- setdiff(1:n, train)

iris_train <- iris[train, ]
iris_test  <- iris[test,  ]

Exercise. (5 min) Make sure that you understand exactly what is happening above. Ask questions if necessary. How many samples are in the training set? How many in the test set? How many of each species is in the test set? Are the group sizes balanced? (what does “balanced” mean?)

LDA with the iris data set

First, we create a model for discriminating between versicolor and virginica iris species only.

data(iris)
library(MASS)
library(tidyverse)
iris_train_vv <- iris_train %>% filter(Species != "setosa") %>%
  mutate(Species=factor(Species))
mod_lda_vv <- lda(Species ~ ., data=iris_train_vv)

The result is an object of class lda, which is a list. The element scaling (mod_lda_vv$scaling) contains a matrix with the coefficients for each of the four features (variables). We now can calculate the values of the LD function for each of the samples. We use here the predict function, but we are not predicting anything – this is still the training set.

iris_train_vv_pred <- predict(mod_lda_vv, iris_train_vv)
iris_train_vv_pred <- iris_train_vv_pred$x[,1]

The element x of the object returned by predict is a matrix with one column containing the LD values. We can visualize the data with a boxplot:

df <- data.frame(Species=iris_train_vv$Species, LD=iris_train_vv_pred)
ggplot(df, aes(x=Species, y=LD)) + geom_boxplot()

We see that LD is high for virginica and low for versicolor.

Exercise (5min). The coefficients for Sepals in the LD are negative, while the coefficients for Petals are positive. LD is positive for virginica and negative for versicolor. Question: based on this fact, are petals larger in virginica or in versicolor?

Making and testing predictions

We can now make predictions for the test set:

iris_test_vv <- iris_test %>% filter(Species != "setosa") %>%
  mutate(Species=factor(Species))
iris_test_vv_pred <- predict(mod_lda_vv, iris_test_vv)
iris_test_vv_pred <- iris_test_vv_pred$x[,1]
df <- cbind(iris_test_vv, LD=iris_test_vv_pred)
ggplot(df, aes(x=Species, y=LD)) + geom_boxplot()

A natural cutoff seems to be for \(LD = 0\). We now will confront the predictions with reality.

pred <- ifelse(iris_test_vv_pred > 0, "virginica", "versicolor")
table(pred, iris_test_vv$Species)
##             
## pred         versicolor virginica
##   versicolor         18         0
##   virginica           0        14

Question. What is the overall error rate?

Homework 6, alternative 1. Repeat the above analysis with the full iris data set (three classes). You will find that the x part of the prediction object has now two columns, LD1 and LD2. Make boxplots for both of them. Which LD discriminates between which species? [note: instead of this homework, you are allowed to choose the “alternative 2” (only if we don’t do it together during practicals) and “alternative 3” below] Submit an Rmarkdown file.

The metabo data set

The metabo data set contains metabolomic profiles for patients suffering from tuberculosis (TB) and of two classes of healthy controls: LTBI, that is individuals who are healthy, but infected with Mycobacterium tuberculosis, and CTRL, who are healthy and uninfected.

The data set consists of three data frames (which we loaded in the “Prerequisites” section).

Exercise (5m). Inspect the data files and tell me what they are.

First, we will remove all biochemicals without a known name:

sel <- !grepl("^X-[0-9]*$", m_features$BIOCHEMICAL.NAME)
m_features <- m_features[sel, ]
m_dat <- m_dat[sel, ]

We will need a data frame containing the features in columns rather than rows (we use GROUP2 for classification because it only has two classes):

m_df <- data.frame(GROUP=factor(m_samples$GROUP2), t(m_dat))

Here is a PCA plot of the data:

pca <- prcomp(t(m_dat), scale.=TRUE)
df <- data.frame(GROUP=m_samples$GROUP2, pca$x[,1:5])
ggplot(df, aes(x=PC1, y=PC2, color=GROUP)) + geom_point()

As before, we will create a training and test set selection.

set.seed(12345678)
n <- nrow(m_samples) # number of samples
train <- sample(1:n, size=2/3 * n)
test  <- setdiff(1:n, train)

In the following, we will be creating various ML models for the data and test their performance.

Decision trees

To understand a decision tree it is best to just visualize it:

library(party)

ct <- ctree(GROUP ~ ., data=m_df[ train, ])
plot(ct)

We will now discuss what is on that tree.

Exercise. (5min) What are the biochemicals which were selected by the algorithm?

Now we can apply the model of the tree to the test data set:

preds <- predict(ct, newdata=m_df[ test, ])
tmp <- table(preds, m_df[ test, ]$GROUP)
print(tmp)
##          
## preds     HEALTHY TB
##   HEALTHY      21  5
##   TB            8 12

Remember: columns correspond to the reality and rows to the predictions. Therefore, we interpret the numbers as follows:

Abbr Name Description Count
TN true negative prediction: HEALTHY, reality: HEALTHY 21
FP false positive prediction: TB, reality: HEALTHY 8
FN false negative prediction: HEALTHY, reality: TB 5
TP true positive prediction: TB, reality: TB 12

Exercise. (15’) Calculate the sensitivity, specificity, FPR and FDR. Consult Wikipedia if necessary, specifically the frame on the right hand side.

Running a leave-one-out crossvalidation (LOO-CV)

LOO-CV allows us to take advantage of almost all samples in the data set when constructing a model, while avoiding overfitting.

The package caret contains many useful features for modeling and cross-validation, however (i) it will be useful to see how LOO-CV works under the hood and (ii) the package is not very straightforward to use. Therefore, we will use our own code. With R vectorization it is quite simple, really:

test_one <- function(df, i) {
  train <- df[-i, ]
  test  <- df[ i, ]
  ct <- ctree(GROUP ~ ., train)
  pred <- as.character(predict(ct, newdata=test))
  return(pred)
}

## warning! This takes a while
preds <- map_chr(1:nrow(m_df), ~ test_one(m_df, .x))

This results in the following performance:

  HEALTHY TB
HEALTHY 84 7
TB 8 39

Random forests

One of the ways to improve the predictions from a decision tree is to combine bootstrapping (using only parts of the data) with growing multiple trees, each tree making its own decision. The grown “forest” of the trees then casts a vote regarding a sample, and the majority wins.

One of the great advantages of random forests is that it delivers out-of-bag (OOB) estimates which are nearly as good as the ones that you get when doing cross-validation. This is why we are allowed to work here with the full data set rather than with the training set.

library(randomForest)
rf <- randomForest(GROUP ~ ., data=m_df, importance=TRUE)
print(rf)
## 
## Call:
##  randomForest(formula = GROUP ~ ., data = m_df, importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 13
## 
##         OOB estimate of  error rate: 6.52%
## Confusion matrix:
##         HEALTHY TB class.error
## HEALTHY      91  1  0.01086957
## TB            8 38  0.17391304

Note that in the table above, the reality is in rows, and predictions are in columns (reverse of what we were using before). Thus, class.error corresponds to 1 - specificity (or FNR) and 1 - sensitivity (or FPR).

We now can check what would happen if we actually tested the model using the data set that we used to train it:

### INCORRECT!!!
preds <- predict(rf, m_df)
table(preds, m_df$GROUP)
##          
## preds     HEALTHY TB
##   HEALTHY      92  0
##   TB            0 46

The above is vastly overfit – no errors at al! But this will not work like that in a new data set.

Variable importance

We cannot check individual trees (well, we can, but its complicated) to see which metabolites play a role in the classification, but we can ask what variables are most important overall:

imp <- rf$importance %>% as.data.frame %>% 
  arrange(-MeanDecreaseAccuracy) %>% head(20)

merge(m_features, imp, by.x="ID", by.y=0) %>%
  dplyr::select(ID, BIOCHEMICAL.NAME, SUB.PATHWAY, HEALTHY, TB, 
         MeanDecreaseAccuracy) %>% knitr::kable()
ID BIOCHEMICAL.NAME SUB.PATHWAY HEALTHY TB MeanDecreaseAccuracy
ID_12 pyroglutamine* Glutamate metabolism 0.0045641 0.0165807 0.0083925
ID_121 3-carboxy-4-methyl-5-propyl-2-furanpropanoate (CMPF) Fatty acid, furan, dicarboxylate 0.0050517 0.0084333 0.0061574
ID_139 1-myristoylglycerophosphocholine* (C14:0) Glycerolipid metabolism 0.0033308 0.0064863 0.0043195
ID_159 taurochenodeoxycholate Bile acid metabolism 0.0016939 0.0038122 0.0024404
ID_169 cortisol Sterol/Steroid 0.0036034 0.0091002 0.0053039
ID_17 phenylalanine Phenylalanine & tyrosine metabolism 0.0103507 0.0130885 0.0112540
ID_171 citrate Krebs cycle 0.0036999 0.0110992 0.0058717
ID_24 kynurenine Tryptophan metabolism 0.0060844 0.0064443 0.0061928
ID_35 beta-hydroxyisovalerate Valine, leucine and isoleucine metabolism 0.0039891 0.0035924 0.0038450
ID_4 threonine Glycine, serine and threonine metabolism 0.0016321 0.0162220 0.0063656
ID_41 cysteine Cysteine, methionine, SAM, taurine metabolism 0.0153378 0.0286705 0.0199347
ID_49 citrulline Urea cycle; arginine-, proline-, metabolism 0.0104126 0.0197633 0.0133568
ID_51 trans-4-hydroxyproline Urea cycle; arginine-, proline-, metabolism 0.0021058 0.0056389 0.0032808
ID_54 creatine Creatine metabolism 0.0041838 0.0102168 0.0060958
ID_59 cysteine-glutathione disulfide Glutathione metabolism 0.0036727 0.0117810 0.0064059
ID_70 ADSGEGDFXAEGGGVR* (fibrinopeptide) Polypeptide 0.0057186 0.0110439 0.0073084
ID_71 DSGEGDFXAEGGGVR* (fibrinopeptide) Polypeptide 0.0028706 0.0033082 0.0029334
ID_73 HWESASXX* (complement C3 peptide) Polypeptide 0.0036394 0.0038200 0.0038241
ID_75 N-acetylneuraminate Aminosugars metabolism 0.0078936 0.0186283 0.0112736
ID_77 mannose Fructose, mannose, galactose, starch, and sucrose metabolism 0.0166245 0.0266204 0.0197107

Fine, but does that make sense? Let us plot some of the molecules.

sel <- c("ID_24", "ID_169", "ID_70", "ID_139")
df <- m_df %>% dplyr::select(all_of(c("GROUP", sel))) %>% 
  pivot_longer(starts_with("ID"), names_to="ID", values_to="level")
ggplot(df, aes(x=GROUP, y=level))+   geom_boxplot() + 
  facet_wrap(~ ID, ncol=2) 

ROC curves

To plot a ROC curve, we need a numeric predictor and not binary (yes / no) predictions. To get these numeric predictor, we can use predict with option prob=TRUE (this works for other algorithms as well!). However we cannot do that for random forest: predictions would be overfit! We need the OOB estimates of the probabilities. These are stored in the rf object:

probs <- rf$votes
head(probs)
##              HEALTHY         TB
## MPIB_00173 0.7526882 0.24731183
## MPIB_00174 0.9365079 0.06349206
## MPIB_00175 0.8500000 0.15000000
## MPIB_00176 0.9142857 0.08571429
## MPIB_00177 0.8743169 0.12568306
## MPIB_00178 0.8655914 0.13440860

The two columns indicate the percentage of votes each class got from the trees, so 0.124 in column “HEALTHY” indicates that 12% of the trees decided that this given sample comes from a healthy person (and 88% trees decided that this is a TB patient).

probs_df <- as.data.frame(probs) %>% merge(m_samples, by.x=0, by.y="SAMPLE_NAME") %>% arrange(HEALTHY)
head(probs_df)
##    Row.names   HEALTHY        TB GENDER GROUP1 GROUP2
## 1 MPIB_00308 0.1032609 0.8967391      M     TB     TB
## 2 MPIB_00290 0.1473684 0.8526316      F     TB     TB
## 3 MPIB_00277 0.1505376 0.8494624      F     TB     TB
## 4 MPIB_00301 0.1587302 0.8412698      M     TB     TB
## 5 MPIB_00304 0.1739130 0.8260870      M     TB     TB
## 6 MPIB_00285 0.1746032 0.8253968      F     TB     TB

To plot a ROC curve, we use roc from package pROC. We need to supply the levels argument to tell pROC which one is the positive and which one is the negative:

library(pROC)
roc(response=m_df$GROUP, predictor=probs[,1], 
    levels=c("HEALTHY", "TB"), plot=TRUE)
## 
## Call:
## roc.default(response = m_df$GROUP, predictor = probs[, 1], levels = c("HEALTHY",     "TB"), plot = TRUE)
## 
## Data: probs[, 1] in 92 controls (m_df$GROUP HEALTHY) > 46 cases (m_df$GROUP TB).
## Area under the curve: 0.9889

We can also calculate the area under the curve (AUC) and its confidence intervals:

roc(response=m_df$GROUP, predictor=probs[,1], 
    levels=c("HEALTHY", "TB"), ci=TRUE)
## 
## Call:
## roc.default(response = m_df$GROUP, predictor = probs[, 1], levels = c("HEALTHY",     "TB"), ci = TRUE)
## 
## Data: probs[, 1] in 92 controls (m_df$GROUP HEALTHY) > 46 cases (m_df$GROUP TB).
## Area under the curve: 0.9889
## 95% CI: 0.9746-1 (DeLong)

We can compare it with a roc curve from the decision tree model. However, for this we need a slight modification of our LOO-CV function:

test_one <- function(df, i) {
  train <- df[-i, ]
  test  <- df[ i, ]
  ct <- ctree(GROUP ~ ., train)
  pred <- predict(ct, newdata=test, type="prob")[[1]][1]
  return(pred)
}

## warning! This takes a while
ct_probs <- map_dbl(1:nrow(m_df), ~ test_one(m_df, .x))

And plot the ROC curve:

roc(response=m_df$GROUP, predictor=ct_probs, 
    plot=TRUE, 
    levels=c("HEALTHY", "TB"))
## 
## Call:
## roc.default(response = m_df$GROUP, predictor = ct_probs, levels = c("HEALTHY",     "TB"), plot = TRUE)
## 
## Data: ct_probs in 92 controls (m_df$GROUP HEALTHY) > 46 cases (m_df$GROUP TB).
## Area under the curve: 0.8689

Exercise – optional (if we find time). Otherwise, Homework 6, alternative 2 (so if we don’t manage to run it during practicals together, you can do it as a homework). Use the metabo dataset and random forests, but instead of “GROUP” take “SEX” as predictor. Which biochemicals found in blood are indicative of biological sex? Submit an Rmarkdown file.

Homework 6, alternative 3. Another method of machine learning are the Support Vector Machines (learn more here). An SVM algorithm is implemented in the function svm of the package e1071. Train svm on the metabolomic training data set, test it on the testing data set (using predict), calculate the sensitivity and specificity. Submit an Rmarkdown file.