Lecture 10: Machine learning

class: center, middle, inverse, title-slide

.title[
# Lecture 10: Machine learning
]
.subtitle[
## BE_22 Bioinformatics SS 21
]
.author[
### January Weiner
]
.date[
### 2024-06-24
]

---

## What is machine learning?

---
class:empty-slide,myinverse
background-image:url(images/person-statistics-machine-learnine-machine-learning-sand.jpeg)

---

.pull-left[

### Unsupervised machine learning:✪

* algorithm
 * data set
 * no "training data"
 * no classes / results to be predicted
 * validation data set

]

.pull-right[

### Supervised machine learning:✪

* algorithm
 * training data set
 * classes / results to be predicted
 * test data set
 * validation data set

]

---

## Classification vs regression problems ✪

.pull-left[

### Classification

* discrete output
 * e.g. "is this a cat or a dog?"
 * e.g. "is this a TB patient or a healthy individual?"

]

.pull-right[

### Regression

* continuous output
 * e.g. "what is the age of this person?"
 * e.g. "what is the concentration of this metabolite?"

]

---
class:empty-slide,myinverse
background-image:url(images/desert2.jpg)

.mytop[
The number of grains of sand in this picture accurately represents the
number of machine learning algorithms.
]

---

## Example: simple machine learning using PCA

The idea:

* select only versicolor and virginica from the iris data set
 * calculate PCA from the iris data set
 * determine a threshold value on the PC1
 * see what the results are

---

.pull-left[

```r
set.seed(12345)
library(ggplot2)
library(tidyverse)
theme_set(theme_bw())
data(iris)
iris <- iris %>% filter(Species != "setosa") %>%
  mutate(Species=factor(Species))
pca <- prcomp(iris[,1:4], scale.=TRUE)
df <- cbind(iris, pca$x)
ggplot(df, aes(x=PC1, y=PC2, color=Species)) + 
  geom_point()
```

* This is not supervising learning yet
 * PCA by itself is unsupervised learning
 * However, we can use it to discriminate between two classes - we need to
   define a cutoff value

]

.pull-right[

![](weiner_BE_22_lecture_10_files/figure-html/iris_pca-1.svg)

]

---

.mycenter[Warning, what follows is incorrect!]

---

## INCORRECT!

.pull-left[

To discriminate, define a cutoff between the two groups.

```r
cutoff <- 0.08

ml_guess <- ifelse(df$PC1 < cutoff, 
                   "virginica", "versicolor")
tmp <- table(ml_guess, df$Species)
print(tmp)
```

<pre class="r-output"><code>##             
## ml_guess     versicolor virginica
##   versicolor         41         9
##   virginica           9        41
</code></pre>

Error rate = `$(9 + 9)/100  = 18.0%$`

]

.pull-right[

![](weiner_BE_22_lecture_10_files/figure-html/unnamed-chunk-5-1.svg)

]

---

## Why is that incorrect?

* We train the model (PCA) with the full data set
 * We *test* the model with the *same* data set.
 * Testing efficiency on the data that was used to predict the model is
   *overfitting*

---

## Correct approach

Select training and testing sets:

.pull-left[

```r
set.seed(1234)

# training set
train <- sample(1:nrow(iris),
                size = nrow(iris) * 2/3)
# remainder is test set
test  <- setdiff(1:nrow(iris),
                 train)

model <- prcomp(iris[train, 1:4], scale.=TRUE)
df_train <- cbind(iris[train, ], model$x)
cutoff <- -0.1 ## just by looking!
```

]

.pull-right[

Results in training set:

```r
train_pred <- ifelse(df_train$PC1 > cutoff, 
                     "versicolor",
                     "virginica")
tmp <- table(train_pred, df_train$Species)
print(tmp)
```

<pre class="r-output"><code>##             
## train_pred   versicolor virginica
##   versicolor         29         7
##   virginica           5        25
</code></pre>

Error rate = `$(7 + 5)/66  = 18.2%$`
]

---

Now check in the test set:

```r
test_pred_df <- data.frame(predict(model, iris[test, 1:4]))
test_pred <- ifelse(test_pred_df$PC1 > cutoff, 
                     "versicolor",
                     "virginica")
tmp <- table(test_pred, iris[test, ]$Species)
print(tmp)
```

<pre class="r-output"><code>##             
## test_pred    versicolor virginica
##   versicolor         13         5
##   virginica           3        13
</code></pre>

Error rate = `$(5 + 3)/34  = 23.5%$`

---

## Positive vs negative: Another example

.pull-left[

Metabolomic data set to discriminate between tuberculosis (TB) patients and healthy
individuals.

]

.pull-right[

![](weiner_BE_22_lecture_10_files/figure-html/unnamed-chunk-9-1.svg)

]

.myfootnote[Weiner 3rd, January, et al. "Biomarkers of inflammation,
immunosuppression and stress are revealed by metabolomic profiling of
tuberculosis patients." PloS one 7.7 (2012): e40221.]

---

## Linear discriminant analysis

Idea: create a function which is a *linear combination* of the variables
(in our case, of the principal components 1 and 2).  That is,

`$$f_{LD} = b_1 \cdot PC_1 + b_2 \cdot PC_2$$`

The coefficients are constructed in such a way that they maximize the
differences between groups on this function.

In other words, the higher the value of the function, the higher the
likelihood that the given sample belongs to a certain class.

`$$f_{LD} = -0.18 \cdot PC_1 + 0.21 \cdot PC_2$$`

Note that the only reason we are running it on the PC variables is to have
a simple situation. Normally, you can run the LDA directly on the variables
(e.g. petal and sepal measurements from the Iris data set). In the original
paper, Fisher applied it to the iris data set.

---

`$$f_{LD} = -0.18 \cdot PC_1 + 0.21 \cdot PC_2$$`

![](weiner_BE_22_lecture_10_files/figure-html/unnamed-chunk-11-1.svg)

---

## Bottom line

* The machine algorithm generates a function of the data
 * The function gives a measure of likelihood that a certain data point is
   one of the two classes
 * We choose a threshold to determine which of the two classes this
   data point belongs to

![](weiner_BE_22_lecture_10_files/figure-html/unnamed-chunk-12-1.svg)

---

For each of these situations, we can calculate the error rates.
To this end, we only use the **test data!** 
(46 samples)

```r
mtb_test <- mtb_pca[test,]
mtrain_pred <- predict(mod_lda, mtb_test)
mtb_test$LD <- mtrain_pred$x[,1]
```

.pull-left[

**LD > -1**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY       9  0
##   TB           22 15
</code></pre>

Error rate = 22 / 46 = 48  %

**LD > 0**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      24  1
##   TB            7 14
</code></pre>

Error rate = 8 / 46 = 17  %

]

.pull-right[

**LD > 1**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      29  5
##   TB            2 10
</code></pre>

Error rate = 7 / 46 = 15  %

**LD > 2**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      31 10
##   TB            0  5
</code></pre>

Error rate = 10 / 46 = 22  %

]

---

## FP, FN, TP, TN✪

.pull-left[

**LD > 0**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      24  1
##   TB            7 14
</code></pre>

]

.pull-right[

Four important abbreviations:

* **TP** (14) - true positive. These are patients correctly classified as patients.
 * **TN** (24) - true negative. These are healthy individuals correctly classified as healthy.
 * **FP** (7) - false positive. These are healthy individuals incorrectly classified as patients.
 * **FN** (1) - false negative. These are patients incorrectly classified as healthy.

Total number of patients: `$P = TP + FN =$` 14 `$+$` 1 `$=$` 15

Total number of healthy individuals: `$N = TN + FP =$` 24 `$+$` 7 `$=$` 31

]

---

## Let us take a closer look at LD > 1

**LD > 1**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      29  5
##   TB            2 10
</code></pre>

Reality is in columns, predictions in rows.

There are 2 false positives (FP) and 5
false negatives (FP). At the same time, there 
are 29 true negatives (TN) 
and 10 true positives (TP).

We can calculate a whole lot of different abbreviations with that!

FPR, TNR, PPV, NPV, FNR, FDR, FOR, PT, TS, ...

(The graphics on the next slide comes from [Wikipedia](https://en.wikipedia.org/wiki/Positive_and_negative_predictive_values))

---
class:empty-slide,mywhite
background-image:url(images/tlas.png)

---

## Let us take a closer look at LD > 1✪

**LD > 1**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      29  5
##   TB            2 10
</code></pre>

There are 2 false positives (FP) and 5
false negatives (FP). At the same time, there 
are 29 true negatives (TN) 
and 10 true positives (TP).

FPR✪  describes the proportion of healthy patients that have been
misclassified:

`$$FPR \stackrel{\text{def}}{=} \frac{FP}{FP + TN}$$`

In our case, FPR = 2 / 31 = 6 %

Similarly, FNR✪  is the false negative rate and describes the proportion of
tuberculosis patients that have been classified as healthy.

`$$FNR \stackrel{\text{def}}{=} \frac{FN}{FN + TP}$$`

In our case, FPR = 5 / 15 = 33 %

---

## Sensitivity and specificity✪

.pull-left[

**LD > -1**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY       9  0
##   TB           22 15
</code></pre>

All TB patients have been classified as positive! Very sensitive. However, 
specificity is poor: 21 of the healthy individuals have been classified as
TB.

**LD > 0**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      24  1
##   TB            7 14
</code></pre>

Slightly less sensitive, but way more specific.

]

.pull-right[

**LD > 1**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      29  5
##   TB            2 10
</code></pre>

Way less sensitive, but much more specific.

**LD > 2**:

<pre class="r-output"><code>##          
## pr        HEALTHY TB
##   HEALTHY      31 10
##   TB            0  5
</code></pre>

Best specificity so far (no FPs at all!), but really poor sensitivity.

]

---

## Formal definitions✪

.pull-left[

### Sensitivity

* recall, TPR, probability of detection, power

`$$SEN = \frac{TP}{TP + FN} = \frac{TP}{P} = 1 - FNR$$`

(where P is the total number of real patients, detected and undetected)

This is the fraction of all the real patients that we were able to detect.

Low sensitivity `$\rightarrow$` many patients incorrectly misclassified as healthy
individuals.

]

.pull-right[

### Specificity

* TNR, selectivity

`$$SPC = \frac{TN}{N} = 1 - FPR$$`

(where N is the total number of healthy individuals, detected and undetected)

This is the fraction of all healthy individuals which we recognized as healthy.

Low specificity `$\rightarrow$` many healthy individuals incorrectly classified as
patients.

]

---

## What if we plot SPC vs SEN for our data?✪

![](weiner_BE_22_lecture_10_files/figure-html/unnamed-chunk-25-1.svg)

---

## Receiver-operator characteristic (ROC) curves✪

.pull-left[

![](weiner_BE_22_lecture_10_files/figure-html/unnamed-chunk-26-1.svg)

]

.pull-right[

*Area under the curve*,
AUC = 0.94 [95% CI 0.87-1.00]

]

---

## Some other important TLA's (three-letter acronyms)

* Prevalence: `$\frac{TP + FN}{TP + FN + TN + FP} = \frac{P}{P + N}$`

* Positive predictive value (PPV), precision: `$\frac{TP}{TP + FP}$`
 
 * False discovery rate (FDR) `$\frac{FP}{TP + FP} = 1 - PPV$`

* Negative predictive value (NPV), `$\frac{TN}{TN + FN}$`

Where

* FN = false negative
 
 * FP = false positive
 
 * TN = true negative

* TP = true positive

---

## PPV and NPV✪

.pull-left[
### PPV - positive predictive value, precision

* probability that a patient is really a patient given that the test is
   positive

* Same as `$1 - FDR$`

* `$\frac{TP}{TP + FP}$`

* in our case, PPV = 5 / 15 = 33 %

]

.pull-right[

### NPV - negative predictive value

* probability that a healthy individual is really healthy given that the
   test is negative

* `$\frac{TN}{TN + FN}$`

* in our case, NPV = 31 / 31 = 100 %

]

PPV and NPV are more useful than sensitivity and specificity in clinical
practice, but...

---

## PPV and NPV depend on prevalence✪

If we are to get *real* probabilities that a given person with a positive
test result really is sick, then we cannot use the test setting, in which
the prevalence of the disease is artificially set to 50%.

Instead, we need to use the *actual* prevalence of the disease in the
population.

.pull-left[

`$PPV = \frac{sensitivity \times prevalence}{sensitivity \times prevalence + (1 - specificity) \times (1 - prevalence)}$`

* If prevalence is high, PPV is high
 * The lower the prevalence, the lower the PPV

]

.pull-right[

`$NPV = \frac{specificity \times (1 - prevalence)}{specificity \times (1 - prevalence) + (1 - sensitivity) \times prevalence}$`

* If prevalence is high, NPV is low
 * The higher the prevalence, the higher the NPV

]

---

## Cross-validation✪

Splitting data set into training and test set:

* we cannot build prediction based on the same data that was used to
   construct the model (**overfitting**)
 * training set small, so learning efficiency takes a hit
 * test set small, so errors in estimation (confidence intervals large)

Solution: cross-validation

---

## LOO (leave-one-out)✪

```
# pseudocode

for i in samples:
  train := samples[ -i ]
  test  := samples[  i ]
  model := algorithm(train)
  prediction := model(test)
  record(i, prediction)
```

In each iteration, we take out *one* sample. We then use all the remaining
samples to create the model, we apply it to the one remaining sample (which
is our test), and we record that prediction.

After all samples are through, we have as many predictions as we have total
samples, but we avoided overfitting.

---

## K-fold cross-validation✪

Pseudocode:

```
folds := split(data, 10)
for i in 1:10:
  train := folds[-i]
  test  := folds[i]
  model := algorithm(train)
  prediction := model(test)
  record(i, prediction)
```

We split the data in e.g. 10 subsets = folds. For each fold, we treat it as
a test set, and  we use all the remaining folds to create a model. We then
apply the model to the test fold and record the predictions.

After all folds have been analysed we have no overfitting, but we have
predictions for all samples.

---

## Bias-variance tradeoff✪

Two important concepts:

* **bias** results from model being inaproppriate for the data (or too
   simple) – that is, model is underfitting the data
 * **variance** results from model being too close to the training data (or
   too complex) – that is, model is overfitting the data

![](images/bias_variance_2.png)

---

## Bias-variance tradeoff

![](images/bias_variance.png)

---

## Bias-variance tradeoff

![:scale 70%](images/bias_variance_3.png)

---
class:empty-slide,mywhite
background-image:url(images/machine_learning.png)

---

## Some notable (useful) simple algorithms for classification

* **decision trees** - a tree-like structure where each node is a decision
   based on a feature, and each leaf is a class
 * **random forests** - a collection of decision trees, each trained on a
   different subset of the data
 * **support vector machines** (SVMs) - a hyperplane that separates the classes
   with the largest margin
 * **naive Bayes** - a probabilistic classifier based on Bayes' theorem
 * **logistic regression** - a regression model for binary outcomes
 * **linear discriminant analysis** - a linear combination of features
 * **neural networks** - a complex model that can learn complex patterns in
   the data

This is useful to know!

---

## Decision trees

.pull-left[

![:scale 80%](images/decision_tree.png)

]

*Maertzdorf J, McEwen G, Weiner 3rd J, Tian S, Lader E, Schriek U,
Mayanja‐Kizza H, Ota M, Kenneth J, Kaufmann SH. Concise gene signature for
point‐of‐care classification of tuberculosis. EMBO molecular medicine. 2016
Feb;8(2):86-95.*

---

## How does ChatGPT work?

* ChatGPT is an interface to a GPT (3.5, 4, ...) model
 * GPT stands for generative pre-trained transformer
 * which is a type of a deep learning network
 * which is a type of a neural network

---

## Neural networks

.pull-left[
![:scale 70%](images/nn.png)
]

.pull-right[
How it works:

* each node takes the value which is the sum of incoming signals
 * if the signals exceed a threshold, the node is "activated"
 * activated node emits signals
]

---

## Neural networks

![:scale 70%](images/Architecture-of-an-artificial-neural-network-with-weight-index-ranges-between-input-and.png)

.myfootnote[Amit, Zarola, and Sil Arjun. "Quantification of recent
seismicity and a back propagation Neural Network for forecasting of
earthquake magnitude in Northeast Region of India." Disaster Advances 10.6
(2017): 17-34.]

---

## How to train your ~~dragon~~ neural network

**Backpropagation**:

* start with some (random?) weights
 * propagate the signal, calculate the error
 * calculate the derivative of the error with respect to the weights
 * adjust the weights accordingly
 * rinse and repeat

---

## Deep learning: basically complex neural networks✪

![:scale 70%](images/Deep-Neural-Network-What-is-Deep-Learning-Edureka.webp)

---

## What does DL have what others haven't

* automatic feature extraction
 * learning complex patterns in the data
 * low error rates
 * unsupervised learning (autoencoder, generative adversarial networks -
   GANs)
    * autoencoder
    * GANs: first, train a network that can differentiate between failure
      and success (e.g. an image that looks like a photograph and an image
      that doesn't). Then, train another network to generate images, and
      use the first one to train the second.
 * hard to interpret the weights (why does it do what it does?, black box
   problem)

---

## Large language models / GPT

* Basically, a DL network that predict one and only one thing: which word
   to choose as next based on current sequence of words (input)
 * however, it has the weird (and quite complex) transformer algorithm,
   which is a "self-attention" mechanism (paying more attention to some
   parts of the input sequence)
 * the network has an engineered architecture, but apart from that,
   everything is due to training on a large data set, BUT
 * when the raw (freshly trained) network starts generating responses,
   humans evaluate the results. Then, another network is trained to predict
   these evaluation, and *this* network is used to train a better GPT
 * At any given stage of "conversation", the network assigns probabilities
   to a number of words, and chooses (randomly) from words with the highest
   probabilities.

.myfootnote[A nice intro to LLMs:
[https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/](https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/)]

---

## Training and learning

Once the model is trained, the actual "chat" is a relatively simple
procedure. The model is simply a huge matrix of probabilities, and the
response is generated by choosing the word with the highest probability,
given the previous words.

In fact, while the training requires inordinate amounts of computational
power, the actual "chat" is relatively simple, and can be even run on a 
smartphone! If you want to try, check out [Ollama](https://ollama.com/).
While you can not download the models behind ChatGPT from version 3 upwards,
there are many other models you can try.