First, we need to install a few packages. We will also install BiocManager (Bioconductor package manager), because we will need it during following practicals.
install.packages(c("phangorn", "seqinr", "BiocManager", "ape"))
install.packages("BiocManager")
BiocManager::install("msa")
This will take a moment, during which we will talk about what the plan for the practicals is.
It is possible, of course, to create multiple sequence alignments
using R (e.g. with the msa
package from Bioconductor).
However, it is also possible to read in an existing MSA in the phangorn
package. Here, we will start with an amino acid alignment.
As you can see in the script that follows, you can supply an URL to R and it will work just like a file. However, normally you could just use a regular file name.
We will start with our beloved hemoglobin example.
library(phangorn)
library(seqinr)
url <- "https://january3.github.io/Bioinformatics/Practicals/practicals_02/data/hba.aln"
msa <- read.alignment(url, format="clustal")
Take a look at the msa
object. It is a list which
contains a few elements. You can access them using the $
operator, for example msa$nam
. What is stored in these
elements?
The msa
object has been created by the seqinr package
and we need to modify it a bit before proceeding. We will use the
gsub
function to remove the prefix HBA_
or
HBA2_
(and everything that comes before that).
## remove the boring identifiers from the sequence names,
## so "sp|P01942|HBA_MOUSE" becomes "MOUSE". Don't do it unless all
## proteins are HBA_something or HBA2_something
msa$nam <- gsub(".*(HBA_|HBA2_)", "", msa$nam)
## convert to phangorn package object
msa <- as.phyDat(msa, type="AA")
The msa
object holds now the multiple sequence
alignment. We had to convert it to a data typ that is suitable for the
phangorn
package, though.
We will now use the dist.ml
function from the
phangorn
package to calculate pairwise distances between
the sequences. We will use the Blosum62 matrix to calculate the
distances:
dist_mat <- dist.ml(msa, model="Blosum62")
dist_mat
## BOVIN HORSE HUMAN PANTR MOUSE BALAC CHICK XENTR VARAL POGSC
## HORSE 0.1323650
## HUMAN 0.1278715 0.1308344
## PANTR 0.1278715 0.1308344 0.0000000
## MOUSE 0.1465878 0.1763106 0.1514488 0.1514488
## BALAC 0.2198872 0.2292613 0.1849926 0.1849926 0.1975120
## CHICK 0.3498572 0.3661853 0.3598233 0.3598233 0.3497097 0.4051394
## XENTR 0.5908145 0.5591056 0.5763826 0.5763826 0.5625820 0.5768373 0.5667723
## VARAL 0.6105375 0.5969209 0.5808340 0.5808340 0.6638261 0.6292339 0.6592822 0.7607134
## POGSC 0.6995809 0.6310327 0.6982297 0.6982297 0.6805210 0.7007189 0.7053563 0.8525964 0.7861120
## GADMO 0.7844204 0.7747574 0.7689802 0.7689802 0.7629719 0.8111820 0.8182609 1.0256332 0.9962033 0.5327656
Note: For a more advanced analysis, one would first spent quite some time deciding which model to chose. This is especially important for DNA sequences.
We can make a plot visualizing this distance matrix, first converting it to regular square matrix, and then plotting a heatmap:
mtx <- as.matrix(dist_mat)
heatmap(mtx, symm=TRUE)
Exercise 1.
symm=TRUE
do? (Hint:
?heatmap
)mtx
and check which distances is smallest.HBA_BALAC
…)We now can do the first phylogenetic reconstruction using neighbor joining.
nj_tree <- NJ(dist_mat)
That wasn’t hard. Now, we plot the tree.
## split the window in two
par(mfrow=c(1,2))
plot(nj_tree)
plot(nj_tree, type="unrooted")
Let us see the tree in Newick format as well:
write.tree(nj_tree)
## [1] "(HORSE:0.07082734265,BOVIN:0.06153761293,(((((((POGSC:0.2075487809,GADMO:0.325216782):0.2422300483,VARAL:0.3825448204):0.03135412787,XENTR:0.3729811285):0.02299833482,CHICK:0.1907034805):0.08093428274,BALAC:0.1196633273):0.007489082707,MOUSE:0.07565185031):0.01529600895,(HUMAN:0,PANTR:0):0.05720236637):0.005968134133);"
Exercise 2.
Look up the help for the function plot.phylo
which is
called when you plot a tree. What are the other types of trees? View
them all.
The tree above sucks: clearly, GADMO (Gadus morhua, atlantic cod) and POGSC (Saddleback, also a fish) are the outgroup here. We need to root the tree correctly.
nj_tree <- root(nj_tree, outgroup=c("GADMO", "POGSC"))
plot(nj_tree)
Exercise 3.
Instead of the function NJ
, use the function
upgma()
to create a tree. Compare it with the
neighbor-joining tree. What do you notice?
Bootstrapping is an important technique to understand which parts of the tree are robust, and which are somewhat shaky. Basically, 100 or 1000 trees are computed, but each time a random subset of the alignment columns is used. If a part of a tree is solid, this will have no influence; but othertimes, selecting a subset of columns will change the phylogeny.
Generating bootstrap values may look like that:
## set up the function to calculate phylogenies
make_phylo <- function(msa) {
tree <- NJ(dist.ml(msa, model="Blosum62"))
}
boot <- bootstrap.phyDat(msa, make_phylo)
The boot
object contains now a 100 of different
phylogenetic trees. We can plot some of them.
## the command below allows to put six plots
## in two rows and three columns on one figure
par(mfrow=c(2,3))
for(i in 1:6) {
plot(boot[[i]])
}
Can you see the differences?
We can now plot the bootstrap values on the original tree. The numbers indicate how often, among the 100 bootstrap replicates, a given group was found. We use the function plotBS to generate a tree which contains bootstrap values as node labels:
nj_tree_bs <- plotBS(nj_tree, boot, type="none")
plot(nj_tree_bs, show.node.label=TRUE)
A slightly more readable figure can be achieved as follows:
plot(nj_tree_bs)
nodelabels(nj_tree_bs$node.label)
Take a look on how the tree looks like in the Newick format:
write.tree(nj_tree_bs)
As noted, ML methods do not actually reconstruct a tree per se. Rather, they are able to provide a measure called likelihood which allows us to decide which tree is better.
We can now calculate the likelihood function and the AIC (Akaike
Information Criterion) on the NJ tree we have constructed. Note that we
must use the msa_phydat
object:
nj_tree_pml <- pml(nj_tree, msa, model="Blosum62")
nj_tree_pml
## model: Blosum62
## loglikelihood: -1889.648
## unconstrained loglikelihood: -670.3362
## Rate matrix: Blosum62
AIC(nj_tree_pml)
## [1] 3817.296
Note that nj_tree_pml
is no longer an object of class
phylo
: pml added extra information to that tree. The actual
tree is stored in nj_tree_pml$tree
. Yeah, I know it is
confusing. Welcome to R and bioinformatics!
AIC, Akaike Information Criterion is a measure which can be used to optimize a model; it represents the compromise between the number of parameters and the goodness of fit:
\[AIC \stackrel{\text{def}}{=} 2\cdot k - 2\cdot \log{L}\]
Where \(L\) is the log-likelihood function and k is the number of parameters.
The lower AIC, the better.
However, this is still our boring NJ-generated tree. We can try to find a better one, using some heuristics which starts with an initial tree and then tries to improve it:
ml_tree <- optim.pml(nj_tree_pml, model="JTT",
rearrangement="NNI")
ml_tree
AIC(ml_tree)
par(mfrow=c(1,2))
plot(nj_tree)
plot(ml_tree)
What differences do you see?
Exercise 4.
There are various models (both for nucleic acids and for protein
sequences). For example, for amino acid sequences there are the models
Blosum62, Jones-Taylor-Thornton (JTT), Dayhoff (of the PAM matrix fame),
Whelan and Goldman (WAG). Try all of them; which of them fits the data
best? (the higher the log likelihood, the better; the lower the AIC, the
better). You can find out what all the models are via
?optim.pml
.
Plot all the trees you have created in the previous point (WAG, JTT, Dayhoff, Blosum62) and compare them to the NJ tree.
We are going to use a bootstrapping function from the
phangorn
package. The plotBS function not only plots the
tree, it also returns a tree with the node labels.
boot_ml <- bootstrap.pml(ml_tree)
boot_tree <- plotBS(ml_tree$tree, boot_ml, type="none")
plot(boot_tree)
nodelabels(boot_tree$node.label)
As you can see, ML is pretty sure about its result – all bootstrap values are 100! However, it is also wrong, since horse should be in one group with cattle.
Homework 4.
Use the data from https://january3.github.io/Bioinformatics/Practicals/practicals_02/data/globin.aln
to reconstruct phylogeny using a suitable amino acid substitution model.
As outgroups, use the sequence of myoglobin (MYG_HUMAN
).
Just like in case of homework 3, you are expected to provide an Rmd file
which can be compiled without errors.