Practicals 06: Sequence reconstruction with R

Installation

First, we need to install a few packages. We will also install BiocManager (Bioconductor package manager), because we will need it during following practicals.

install.packages(c("phangorn", "seqinr", "BiocManager", "ape"))
install.packages("BiocManager")
BiocManager::install("msa")

This will take a moment, during which we will talk about what the plan for the practicals is.

Loading the multiple sequence alignment

It is possible, of course, to create multiple sequence alignments using R (e.g. with the msa package from Bioconductor). However, it is also possible to read in an existing MSA in the phangorn package. Here, we will start with an amino acid alignment.

As you can see in the script that follows, you can supply an URL to R and it will work just like a file. However, normally you could just use a regular file name.

We will start with our beloved hemoglobin example.

library(phangorn)
library(seqinr)
url <- "https://january3.github.io/Bioinformatics/Practicals/practicals_02/data/hba.aln"
msa <- read.alignment(url, format="clustal")

Take a look at the msa object. It is a list which contains a few elements. You can access them using the $ operator, for example msa$nam. What is stored in these elements?

The msa object has been created by the seqinr package and we need to modify it a bit before proceeding. We will use the gsub function to remove the prefix HBA_ or HBA2_ (and everything that comes before that).

## remove the boring identifiers from the sequence names,
## so "sp|P01942|HBA_MOUSE" becomes "MOUSE". Don't do it unless all
## proteins are HBA_something or HBA2_something
msa$nam <- gsub(".*(HBA_|HBA2_)", "", msa$nam)

## convert to phangorn package object
msa <- as.phyDat(msa, type="AA")

The msa object holds now the multiple sequence alignment. We had to convert it to a data typ that is suitable for the phangorn package, though.

We will now use the dist.ml function from the phangorn package to calculate pairwise distances between the sequences. We will use the Blosum62 matrix to calculate the distances:

dist_mat <- dist.ml(msa, model="Blosum62")
dist_mat

##           BOVIN     HORSE     HUMAN     PANTR     MOUSE     BALAC     CHICK     XENTR     VARAL     POGSC
## HORSE 0.1323650                                                                                          
## HUMAN 0.1278715 0.1308344                                                                                
## PANTR 0.1278715 0.1308344 0.0000000                                                                      
## MOUSE 0.1465878 0.1763106 0.1514488 0.1514488                                                            
## BALAC 0.2198872 0.2292613 0.1849926 0.1849926 0.1975120                                                  
## CHICK 0.3498572 0.3661853 0.3598233 0.3598233 0.3497097 0.4051394                                        
## XENTR 0.5908145 0.5591056 0.5763826 0.5763826 0.5625820 0.5768373 0.5667723                              
## VARAL 0.6105375 0.5969209 0.5808340 0.5808340 0.6638261 0.6292339 0.6592822 0.7607134                    
## POGSC 0.6995809 0.6310327 0.6982297 0.6982297 0.6805210 0.7007189 0.7053563 0.8525964 0.7861120          
## GADMO 0.7844204 0.7747574 0.7689802 0.7689802 0.7629719 0.8111820 0.8182609 1.0256332 0.9962033 0.5327656

Note: For a more advanced analysis, one would first spent quite some time deciding which model to chose. This is especially important for DNA sequences.

We can make a plot visualizing this distance matrix, first converting it to regular square matrix, and then plotting a heatmap:

mtx <- as.matrix(dist_mat)
heatmap(mtx, symm=TRUE)

Exercise 1.

What does symm=TRUE do? (Hint: ?heatmap)
Which two species seem to be closest? Take a look at the mtx and check which distances is smallest.
What is BALAC? (hint: the protein name is HBA_BALAC…)

Reconstructing phylogenies

We now can do the first phylogenetic reconstruction using neighbor joining.

nj_tree <- NJ(dist_mat)

That wasn’t hard. Now, we plot the tree.

## split the window in two
par(mfrow=c(1,2))

plot(nj_tree)
plot(nj_tree, type="unrooted")

Let us see the tree in Newick format as well:

write.tree(nj_tree)

## [1] "(HORSE:0.07082734265,BOVIN:0.06153761293,(((((((POGSC:0.2075487809,GADMO:0.325216782):0.2422300483,VARAL:0.3825448204):0.03135412787,XENTR:0.3729811285):0.02299833482,CHICK:0.1907034805):0.08093428274,BALAC:0.1196633273):0.007489082707,MOUSE:0.07565185031):0.01529600895,(HUMAN:0,PANTR:0):0.05720236637):0.005968134133);"

Exercise 2.

Look up the help for the function plot.phylo which is called when you plot a tree. What are the other types of trees? View them all.

The tree above sucks: clearly, GADMO (Gadus morhua, atlantic cod) and POGSC (Saddleback, also a fish) are the outgroup here. We need to root the tree correctly.

nj_tree <- root(nj_tree, outgroup=c("GADMO", "POGSC"))
plot(nj_tree)

Exercise 3.

Instead of the function NJ, use the function upgma() to create a tree. Compare it with the neighbor-joining tree. What do you notice?

Bootstrapping a tree

Bootstrapping is an important technique to understand which parts of the tree are robust, and which are somewhat shaky. Basically, 100 or 1000 trees are computed, but each time a random subset of the alignment columns is used. If a part of a tree is solid, this will have no influence; but othertimes, selecting a subset of columns will change the phylogeny.

Generating bootstrap values may look like that:

## set up the function to calculate phylogenies
make_phylo <- function(msa) {
  tree <- NJ(dist.ml(msa, model="Blosum62"))
}

boot <- bootstrap.phyDat(msa, make_phylo)

The boot object contains now a 100 of different phylogenetic trees. We can plot some of them.

## the command below allows to put six plots 
## in two rows and three columns on one figure
par(mfrow=c(2,3))

for(i in 1:6) {
  plot(boot[[i]])
}

Can you see the differences?

We can now plot the bootstrap values on the original tree. The numbers indicate how often, among the 100 bootstrap replicates, a given group was found. We use the function plotBS to generate a tree which contains bootstrap values as node labels:

nj_tree_bs <- plotBS(nj_tree, boot, type="none")
plot(nj_tree_bs, show.node.label=TRUE)

A slightly more readable figure can be achieved as follows:

plot(nj_tree_bs)
nodelabels(nj_tree_bs$node.label)

Take a look on how the tree looks like in the Newick format:

write.tree(nj_tree_bs)

Using maximum likelihood for phylogeny reconstruction

As noted, ML methods do not actually reconstruct a tree per se. Rather, they are able to provide a measure called likelihood which allows us to decide which tree is better.

We can now calculate the likelihood function and the AIC (Akaike Information Criterion) on the NJ tree we have constructed. Note that we must use the msa_phydat object:

nj_tree_pml <- pml(nj_tree, msa, model="Blosum62")
nj_tree_pml

## model: Blosum62 
## loglikelihood: -1889.648 
## unconstrained loglikelihood: -670.3362 
## Rate matrix: Blosum62

AIC(nj_tree_pml)

## [1] 3817.296

Note that nj_tree_pml is no longer an object of class phylo: pml added extra information to that tree. The actual tree is stored in nj_tree_pml$tree. Yeah, I know it is confusing. Welcome to R and bioinformatics!

AIC, Akaike Information Criterion is a measure which can be used to optimize a model; it represents the compromise between the number of parameters and the goodness of fit:

\[AIC \stackrel{\text{def}}{=} 2\cdot k - 2\cdot \log{L}\]

Where $L$ is the log-likelihood function and k is the number of parameters.

The lower AIC, the better.

However, this is still our boring NJ-generated tree. We can try to find a better one, using some heuristics which starts with an initial tree and then tries to improve it:

ml_tree <- optim.pml(nj_tree_pml, model="JTT", 
                     rearrangement="NNI")
ml_tree
AIC(ml_tree)

par(mfrow=c(1,2))
plot(nj_tree)
plot(ml_tree)

What differences do you see?

Exercise 4.

There are various models (both for nucleic acids and for protein sequences). For example, for amino acid sequences there are the models Blosum62, Jones-Taylor-Thornton (JTT), Dayhoff (of the PAM matrix fame), Whelan and Goldman (WAG). Try all of them; which of them fits the data best? (the higher the log likelihood, the better; the lower the AIC, the better). You can find out what all the models are via ?optim.pml.
Plot all the trees you have created in the previous point (WAG, JTT, Dayhoff, Blosum62) and compare them to the NJ tree.

Bootstrapping the ML tree (optional)

We are going to use a bootstrapping function from the phangorn package. The plotBS function not only plots the tree, it also returns a tree with the node labels.

boot_ml <- bootstrap.pml(ml_tree)
boot_tree <- plotBS(ml_tree$tree, boot_ml, type="none")
plot(boot_tree)
nodelabels(boot_tree$node.label)

As you can see, ML is pretty sure about its result – all bootstrap values are 100! However, it is also wrong, since horse should be in one group with cattle.

Homework 4.

Use the data from https://january3.github.io/Bioinformatics/Practicals/practicals_02/data/globin.aln to reconstruct phylogeny using a suitable amino acid substitution model. As outgroups, use the sequence of myoglobin (MYG_HUMAN). Just like in case of homework 3, you are expected to provide an Rmd file which can be compiled without errors.