##
## Attaching package: 'crayon'
## The following object is masked _by_ '.GlobalEnv':
##
## num_colors
Please install the following packages:
install.packages(c("randomForest", "party", "MASS", "pROC"))
Please load the following three data sets:
m_dat <- read.csv("https://january3.github.io/Bioinformatics/Datasets/metabo_data.csv",
row.names=1)
m_samples <- read.csv("https://january3.github.io/Bioinformatics/Datasets/metabo_coldata.csv",
row.names=1)
m_features <- read.csv("https://january3.github.io/Bioinformatics/Datasets/metabo_fdata.csv",
row.names=1)
There are numerous machine learning algorithms:
Today, we will first apply LDA to the iris data set, and then we will apply decision trees to a metabolome data set.
data(iris)
n <- nrow(iris) # 150
train <- sample(1:n, size=2/3 * n)
test <- setdiff(1:n, train)
iris_train <- iris[train, ]
iris_test <- iris[test, ]
Exercise. (5 min) Make sure that you understand exactly what is happening above. Ask questions if necessary. How many samples are in the training set? How many in the test set? How many of each species is in the test set? Are the group sizes balanced? (what does “balanced” mean?)
First, we create a model for discriminating between versicolor and virginica iris species only.
data(iris)
library(MASS)
library(tidyverse)
iris_train_vv <- iris_train %>% filter(Species != "setosa") %>%
mutate(Species=factor(Species))
mod_lda_vv <- lda(Species ~ ., data=iris_train_vv)
The result is an object of class lda
, which is a list.
The element scaling
(mod_lda_vv$scaling
)
contains a matrix with the coefficients for each of the four features
(variables). We now can calculate the values of the LD function for each
of the samples. We use here the predict
function, but we
are not predicting anything – this is still the training set.
iris_train_vv_pred <- predict(mod_lda_vv, iris_train_vv)
iris_train_vv_pred <- iris_train_vv_pred$x[,1]
The element x
of the object returned by
predict
is a matrix with one column containing the LD
values. We can visualize the data with a boxplot:
df <- data.frame(Species=iris_train_vv$Species, LD=iris_train_vv_pred)
ggplot(df, aes(x=Species, y=LD)) + geom_boxplot()
We see that LD is high for virginica and low for versicolor.
Exercise (5min). The coefficients for Sepals in the LD are negative, while the coefficients for Petals are positive. LD is positive for virginica and negative for versicolor. Question: based on this fact, are petals larger in virginica or in versicolor?
We can now make predictions for the test set:
iris_test_vv <- iris_test %>% filter(Species != "setosa") %>%
mutate(Species=factor(Species))
iris_test_vv_pred <- predict(mod_lda_vv, iris_test_vv)
iris_test_vv_pred <- iris_test_vv_pred$x[,1]
df <- cbind(iris_test_vv, LD=iris_test_vv_pred)
ggplot(df, aes(x=Species, y=LD)) + geom_boxplot()
A natural cutoff seems to be for \(LD = 0\). We now will confront the predictions with reality.
pred <- ifelse(iris_test_vv_pred > 0, "virginica", "versicolor")
table(pred, iris_test_vv$Species)
##
## pred versicolor virginica
## versicolor 18 0
## virginica 0 14
Question. What is the overall error rate?
Homework 6, alternative 1. Repeat the above analysis
with the full iris data set (three classes). You will find that the
x
part of the prediction object has now two columns,
LD1
and LD2
. Make boxplots for both of them.
Which LD discriminates between which species? [note: instead of this
homework, you are allowed to choose the “alternative 2” (only if we
don’t do it together during practicals) and “alternative 3” below]
Submit an Rmarkdown file.
The metabo data set contains metabolomic profiles for patients suffering from tuberculosis (TB) and of two classes of healthy controls: LTBI, that is individuals who are healthy, but infected with Mycobacterium tuberculosis, and CTRL, who are healthy and uninfected.
The data set consists of three data frames (which we loaded in the “Prerequisites” section).
Exercise (5m). Inspect the data files and tell me what they are.
First, we will remove all biochemicals without a known name:
sel <- !grepl("^X-[0-9]*$", m_features$BIOCHEMICAL.NAME)
m_features <- m_features[sel, ]
m_dat <- m_dat[sel, ]
We will need a data frame containing the features in columns rather than rows (we use GROUP2 for classification because it only has two classes):
m_df <- data.frame(GROUP=factor(m_samples$GROUP2), t(m_dat))
Here is a PCA plot of the data:
pca <- prcomp(t(m_dat), scale.=TRUE)
df <- data.frame(GROUP=m_samples$GROUP2, pca$x[,1:5])
ggplot(df, aes(x=PC1, y=PC2, color=GROUP)) + geom_point()
As before, we will create a training and test set selection.
set.seed(12345678)
n <- nrow(m_samples) # number of samples
train <- sample(1:n, size=2/3 * n)
test <- setdiff(1:n, train)
In the following, we will be creating various ML models for the data and test their performance.
To understand a decision tree it is best to just visualize it:
library(party)
ct <- ctree(GROUP ~ ., data=m_df[ train, ])
plot(ct)
We will now discuss what is on that tree.
Exercise. (5min) What are the biochemicals which were selected by the algorithm?
Now we can apply the model of the tree to the test data set:
preds <- predict(ct, newdata=m_df[ test, ])
tmp <- table(preds, m_df[ test, ]$GROUP)
print(tmp)
##
## preds HEALTHY TB
## HEALTHY 21 5
## TB 8 12
Remember: columns correspond to the reality and rows to the predictions. Therefore, we interpret the numbers as follows:
Abbr | Name | Description | Count |
---|---|---|---|
TN | true negative | prediction: HEALTHY, reality: HEALTHY | 21 |
FP | false positive | prediction: TB, reality: HEALTHY | 8 |
FN | false negative | prediction: HEALTHY, reality: TB | 5 |
TP | true positive | prediction: TB, reality: TB | 12 |
Exercise. (15’) Calculate the sensitivity, specificity, FPR and FDR. Consult Wikipedia if necessary, specifically the frame on the right hand side.
LOO-CV allows us to take advantage of almost all samples in the data set when constructing a model, while avoiding overfitting.
The package caret
contains many useful features for
modeling and cross-validation, however (i) it will be useful to see how
LOO-CV works under the hood and (ii) the package is not very
straightforward to use. Therefore, we will use our own code. With R
vectorization it is quite simple, really:
test_one <- function(df, i) {
train <- df[-i, ]
test <- df[ i, ]
ct <- ctree(GROUP ~ ., train)
pred <- as.character(predict(ct, newdata=test))
return(pred)
}
## warning! This takes a while
preds <- map_chr(1:nrow(m_df), ~ test_one(m_df, .x))
This results in the following performance:
HEALTHY | TB | |
---|---|---|
HEALTHY | 84 | 7 |
TB | 8 | 39 |
One of the ways to improve the predictions from a decision tree is to combine bootstrapping (using only parts of the data) with growing multiple trees, each tree making its own decision. The grown “forest” of the trees then casts a vote regarding a sample, and the majority wins.
One of the great advantages of random forests is that it delivers out-of-bag (OOB) estimates which are nearly as good as the ones that you get when doing cross-validation. This is why we are allowed to work here with the full data set rather than with the training set.
library(randomForest)
rf <- randomForest(GROUP ~ ., data=m_df, importance=TRUE)
print(rf)
##
## Call:
## randomForest(formula = GROUP ~ ., data = m_df, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 13
##
## OOB estimate of error rate: 6.52%
## Confusion matrix:
## HEALTHY TB class.error
## HEALTHY 91 1 0.01086957
## TB 8 38 0.17391304
Note that in the table above, the reality is in
rows, and predictions are in columns (reverse of what we
were using before). Thus, class.error
corresponds to 1 -
specificity (or FNR) and 1 - sensitivity (or FPR).
We now can check what would happen if we actually tested the model using the data set that we used to train it:
### INCORRECT!!!
preds <- predict(rf, m_df)
table(preds, m_df$GROUP)
##
## preds HEALTHY TB
## HEALTHY 92 0
## TB 0 46
The above is vastly overfit – no errors at al! But this will not work like that in a new data set.
We cannot check individual trees (well, we can, but its complicated) to see which metabolites play a role in the classification, but we can ask what variables are most important overall:
imp <- rf$importance %>% as.data.frame %>%
arrange(-MeanDecreaseAccuracy) %>% head(20)
merge(m_features, imp, by.x="ID", by.y=0) %>%
dplyr::select(ID, BIOCHEMICAL.NAME, SUB.PATHWAY, HEALTHY, TB,
MeanDecreaseAccuracy) %>% knitr::kable()
ID | BIOCHEMICAL.NAME | SUB.PATHWAY | HEALTHY | TB | MeanDecreaseAccuracy |
---|---|---|---|---|---|
ID_12 | pyroglutamine* | Glutamate metabolism | 0.0045641 | 0.0165807 | 0.0083925 |
ID_121 | 3-carboxy-4-methyl-5-propyl-2-furanpropanoate (CMPF) | Fatty acid, furan, dicarboxylate | 0.0050517 | 0.0084333 | 0.0061574 |
ID_139 | 1-myristoylglycerophosphocholine* (C14:0) | Glycerolipid metabolism | 0.0033308 | 0.0064863 | 0.0043195 |
ID_159 | taurochenodeoxycholate | Bile acid metabolism | 0.0016939 | 0.0038122 | 0.0024404 |
ID_169 | cortisol | Sterol/Steroid | 0.0036034 | 0.0091002 | 0.0053039 |
ID_17 | phenylalanine | Phenylalanine & tyrosine metabolism | 0.0103507 | 0.0130885 | 0.0112540 |
ID_171 | citrate | Krebs cycle | 0.0036999 | 0.0110992 | 0.0058717 |
ID_24 | kynurenine | Tryptophan metabolism | 0.0060844 | 0.0064443 | 0.0061928 |
ID_35 | beta-hydroxyisovalerate | Valine, leucine and isoleucine metabolism | 0.0039891 | 0.0035924 | 0.0038450 |
ID_4 | threonine | Glycine, serine and threonine metabolism | 0.0016321 | 0.0162220 | 0.0063656 |
ID_41 | cysteine | Cysteine, methionine, SAM, taurine metabolism | 0.0153378 | 0.0286705 | 0.0199347 |
ID_49 | citrulline | Urea cycle; arginine-, proline-, metabolism | 0.0104126 | 0.0197633 | 0.0133568 |
ID_51 | trans-4-hydroxyproline | Urea cycle; arginine-, proline-, metabolism | 0.0021058 | 0.0056389 | 0.0032808 |
ID_54 | creatine | Creatine metabolism | 0.0041838 | 0.0102168 | 0.0060958 |
ID_59 | cysteine-glutathione disulfide | Glutathione metabolism | 0.0036727 | 0.0117810 | 0.0064059 |
ID_70 | ADSGEGDFXAEGGGVR* (fibrinopeptide) | Polypeptide | 0.0057186 | 0.0110439 | 0.0073084 |
ID_71 | DSGEGDFXAEGGGVR* (fibrinopeptide) | Polypeptide | 0.0028706 | 0.0033082 | 0.0029334 |
ID_73 | HWESASXX* (complement C3 peptide) | Polypeptide | 0.0036394 | 0.0038200 | 0.0038241 |
ID_75 | N-acetylneuraminate | Aminosugars metabolism | 0.0078936 | 0.0186283 | 0.0112736 |
ID_77 | mannose | Fructose, mannose, galactose, starch, and sucrose metabolism | 0.0166245 | 0.0266204 | 0.0197107 |
Fine, but does that make sense? Let us plot some of the molecules.
sel <- c("ID_24", "ID_169", "ID_70", "ID_139")
df <- m_df %>% dplyr::select(all_of(c("GROUP", sel))) %>%
pivot_longer(starts_with("ID"), names_to="ID", values_to="level")
ggplot(df, aes(x=GROUP, y=level))+ geom_boxplot() +
facet_wrap(~ ID, ncol=2)
To plot a ROC curve, we need a numeric predictor and not binary (yes
/ no) predictions. To get these numeric predictor, we can use
predict
with option prob=TRUE
(this works for
other algorithms as well!). However we cannot do that for
random forest: predictions would be overfit! We need the OOB
estimates of the probabilities. These are stored in the rf object:
probs <- rf$votes
head(probs)
## HEALTHY TB
## MPIB_00173 0.7526882 0.24731183
## MPIB_00174 0.9365079 0.06349206
## MPIB_00175 0.8500000 0.15000000
## MPIB_00176 0.9142857 0.08571429
## MPIB_00177 0.8743169 0.12568306
## MPIB_00178 0.8655914 0.13440860
The two columns indicate the percentage of votes each class got from
the trees, so 0.124
in column “HEALTHY” indicates that 12%
of the trees decided that this given sample comes from a healthy person
(and 88% trees decided that this is a TB patient).
probs_df <- as.data.frame(probs) %>% merge(m_samples, by.x=0, by.y="SAMPLE_NAME") %>% arrange(HEALTHY)
head(probs_df)
## Row.names HEALTHY TB GENDER GROUP1 GROUP2
## 1 MPIB_00308 0.1032609 0.8967391 M TB TB
## 2 MPIB_00290 0.1473684 0.8526316 F TB TB
## 3 MPIB_00277 0.1505376 0.8494624 F TB TB
## 4 MPIB_00301 0.1587302 0.8412698 M TB TB
## 5 MPIB_00304 0.1739130 0.8260870 M TB TB
## 6 MPIB_00285 0.1746032 0.8253968 F TB TB
To plot a ROC curve, we use roc
from package
pROC
. We need to supply the levels
argument to
tell pROC which one is the positive and which one is the negative:
library(pROC)
roc(response=m_df$GROUP, predictor=probs[,1],
levels=c("HEALTHY", "TB"), plot=TRUE)
##
## Call:
## roc.default(response = m_df$GROUP, predictor = probs[, 1], levels = c("HEALTHY", "TB"), plot = TRUE)
##
## Data: probs[, 1] in 92 controls (m_df$GROUP HEALTHY) > 46 cases (m_df$GROUP TB).
## Area under the curve: 0.9889
We can also calculate the area under the curve (AUC) and its confidence intervals:
roc(response=m_df$GROUP, predictor=probs[,1],
levels=c("HEALTHY", "TB"), ci=TRUE)
##
## Call:
## roc.default(response = m_df$GROUP, predictor = probs[, 1], levels = c("HEALTHY", "TB"), ci = TRUE)
##
## Data: probs[, 1] in 92 controls (m_df$GROUP HEALTHY) > 46 cases (m_df$GROUP TB).
## Area under the curve: 0.9889
## 95% CI: 0.9746-1 (DeLong)
We can compare it with a roc curve from the decision tree model. However, for this we need a slight modification of our LOO-CV function:
test_one <- function(df, i) {
train <- df[-i, ]
test <- df[ i, ]
ct <- ctree(GROUP ~ ., train)
pred <- predict(ct, newdata=test, type="prob")[[1]][1]
return(pred)
}
## warning! This takes a while
ct_probs <- map_dbl(1:nrow(m_df), ~ test_one(m_df, .x))
And plot the ROC curve:
roc(response=m_df$GROUP, predictor=ct_probs,
plot=TRUE,
levels=c("HEALTHY", "TB"))
##
## Call:
## roc.default(response = m_df$GROUP, predictor = ct_probs, levels = c("HEALTHY", "TB"), plot = TRUE)
##
## Data: ct_probs in 92 controls (m_df$GROUP HEALTHY) > 46 cases (m_df$GROUP TB).
## Area under the curve: 0.8689
Exercise – optional (if we find time). Otherwise, Homework 6, alternative 2 (so if we don’t manage to run it during practicals together, you can do it as a homework). Use the metabo dataset and random forests, but instead of “GROUP” take “SEX” as predictor. Which biochemicals found in blood are indicative of biological sex? Submit an Rmarkdown file.
Homework 6, alternative 3. Another method of machine
learning are the Support Vector Machines (learn more here).
An SVM algorithm is implemented in the function svm
of the
package e1071
. Train svm
on the metabolomic
training data set, test it on the testing data set (using
predict
), calculate the sensitivity and specificity. Submit
an Rmarkdown file.