Lecture 4: Multiple alignments, motifs and logos

class: center, middle, inverse, title-slide

.title[
# Lecture 4: Multiple alignments, motifs and logos
]
.subtitle[
## BE_22 Bioinformatics SS 21
]
.author[
### January Weiner
]
.date[
### 2024-04-29
]

---

## Sequence conservation & logos

![:scale 70%](images/logos.png)

---

## What are these bits??

Visualize uncertainty (defined as Shannon entropy) about a given position
using *information bits*.

Entropy at position `$i$` describes the probability with which we can predict
the outcome:

`$H_i = - \sum_{aa=1}^{20} f_{aa,i} \cdot \log_2 f_{aa,i}$`

For example, if at a certain position we have 100 % Ala and 0 % of other amino acids,

`$H_i = -f_{Ala,i}\cdot \log_2 f_{Ala,i} + \sum_{aa \neq Ala} f_{aa,i}\log_2 f_{aa, i} = - 1 \cdot \log_2 1 + 0 = 0$`

But if Ala and His are at equal frequencies, we get

`$H_i = -0.5 \log_2 0.5 - 0.5 \log_2 0.5 = - 0.5 \cdot -1 - 0.5 \cdot -1 = 1$`

And if all amino acids are at the same frequencies, we have

`$H_i = -20 \cdot 0.05 \cdot \log_2 0.05 \approx -20 \cdot 0.05 \cdot -4.3 = 4.3$`

---

## What are these bits??

Bits are now defined as

`$R_i = \log_2 20 - H_i + e_n \approx 4.3 - H_i$`

(where `$e_n$` is correction for small sample sizes (reduces to zero for
large number of sequences)

So for all frequencies equal, `$H_i \approx 4.3$` and `$R_i \approx 0$`; for
one frequency equal to 100%, `$H_i = 0$` and `$R_i \approx 4.3$`.

This gives a scaling factor for the height of the letters:

`$bits = y_{aa,i} = f_{aa,i} \cdot R_i$`

---

![](images/logos_0.png)

---

## Our many hemoglobins...

![](images/hemoglobin_clusters.jpg)

---

## Our many hemoglobins...

![](images/hemoglobin_evolution.jpg)

So, why do we need so many?

`$\rightarrow$` example of subfunctionalization

---

## Zeta hemoglobin function

.pull-left[
![](images/hemoglobin_alpha.png)
]

.pull-right[
![](images/hemoglobin_zeta.png)
]

---

## Zeta hemoglobin function

![](images/zeta_function.png)

---

## Zeta hemoglobin function

![](images/zeta_function_2.png)

---

## Multiple sequence alignments

![:scale 70%](images/hb_align.png)

![:scale 70%](images/hb_align_2.png)

---

## Multiple sequence alignments

What is it good for?

![](images/google.jpg)

---

## Multiple sequence alignments

* Find groups of sequences
 * Determine conserved regions
   * Conserved region: important! `$\rightarrow$` purifying selection
 * Find motifs and domains
 * Find common physical properties
   * Do changes in amino acids cause changes in physical properties?
 * Create profiles / HMM models for sensitive homology search
 * First step to phylogeny reconstruction

---

The alignment clearly shows the two paralogs of hemoglobin (A and B)

![](images/hba_align.png)

---

Alignment showing physical properties of amino acids

![](images/align_phys.png)

---

We can find conserved regions / protein domains

![](images/align_conserved.png)

---

## How to score a multiple alignment

Add scores from all possible pairs of sequences

Score for a column `$i$` in a multiple sequence alignment:

`$$S_i = \sum_{j=1}^{N} \sum_{k=j + 1}^{N} aa_{i,j,k}$$`

Score for the full alignment:

`$$S = \sum_{i=1}^L \sum_{j=1}^{N} \sum_{k=j + 1}^{N} aa_{i,j,k}$$`

where `$L$` is the length of the alignment, `$N$` is the number of sequences,
and `$aa_{i,j,k}$` denotes scores between sequences `$j$` and `$k$` in column
`$i$`.

---

## Extending NW approach

![](images/cube-1.png)

---

## Extending NW approach

![](images/cube-2.png)

---

## Extending NW approach

![](images/cube-3.png)

---

## Extending NW approach

![](images/cube-4.png)

---

## Extending NW approach

![](images/cube-5.png)

---

## Extending NW approach

![](images/cube-6.png)

---

## Extending NW approach

* complexity grows with each sequence in MSA
 * for 3 sequences, `$O(n^3)$`; for 4, `$O(n^4)$`, etc. `$\rightarrow O(n^k)$`
 * naive application completely useless
 * sped up with Carillo-Lipman approach, but still not scalable
 * not realistic for more than 4 sequences

---

## Heuristic

* compromise accuracy vs speed
 * algorithms:
   * progressive algorithms
      * clustal[o,w] `$O(N^2)$` (clustalw), `$O(N\log{N})$` (clustalo)
      * kalign
   * stepwise improvement / genetic
      * MUSCLE `$O(N^3L)$`

---

## ClustalW progressive algorithm

.pull-left[
![](images/clustal_algo.png)
]

.pull-right[
 * Calculate pairwise similarity (distance) between each two sequences
 * Build a similarity matrix
 * Create a "guide tree" (which sequences are next to each other?)
 * Step-wise assembly along the tree
]

---

## Distance and score

* Score is calculated by the "sum of all pairs" approach
 * Using an appropriate function, we can always convert a score into a
   distance

For example, the original clustalw algorithm used the following distance
function:

`$$D_{i,j} \stackrel{def}{=} 1 - P_{i,j}$$`

Where `$P_{i,j}$` is the percentage identity between sequences `$i$` and `$j$`.
The reason for using this simple function was its speed.

---

## Distance and score

Feng and Doolittle method (1996):

* Calculate the pairwise score between two matrices using an approppriate matrix `$S_{real}$`
 * Determine "background" scoring of randomly shuffled sequences (average
   between several sequences, `$S_{rand}$`
 * Determine the "maximum possible" score by aligning first and second
   sequences to itself and determining the average, `$S_{iden} = \frac{S_{1,1} + S_{2,2}}{2}$`

Then calculuate the normalized score:

`$$S = \frac{S_{real} - S_{rand}}{S_{iden} - S_{rand}} \times 100$$`

And the distance

`$D = -\log{S}$`

.footnote[Feng, Da-Fei, and Russell F. Doolittle. "[21] Progressive
alignment of amino acid sequences and construction of phylogenetic trees
from them." Methods in enzymology. Vol. 266. Academic Press, 1996.
368-382.]

---

## ClustalW algorithm

![:scale 80%](images/clustal_algo_0.png)

---

## ClustalW algorithm

![:scale 80%](images/clustal_algo_1.png)

---

## ClustalW algorithm

![:scale 80%](images/clustal_algo_2.png)

---

## ClustalW algorithm

![:scale 80%](images/clustal_algo_4.png)

---

## DNA/RNA sequence motifs

Protein binding motifs – binding transcription factors etc.

![:scale 30%](images/stat3.svg)

STAT3 – member of the JAK/STAT pathway

![:scale 30%](images/batfjun.svg)

BATF::JUN heterodimer

---

## DNA/RNA sequence motifs

Examples:
 
 * Shine-Dalgarno sequence 5'-AGGAGGU-3' in the untranslated region of mRNA, complementary to a sequence in the 16S
   rRNA, 5'-ACCUCCU-3':
 
.pull-left[
 ![:scale 80%](images/ecoli_sd_logo.jpg)
 ]

.pull-right[
 ![:scale 50%](images/shine_dalgarno.jpg)
 ]

.pull-left[

* Kozak sequence in eukaryotes (detected by scanning by the PIC,
   pre-initiation complex, which contains S40).

![:scale 90%](images/kozak_logo.png)
 ]

.pull-right[
 ![:scale 50%](images/marylin_kozak.png)
 ]

???

Lynn Dalgarno (top middle)

John Shine (top left)

---

* Bacteria: Shine Dalgarno seqence or leaderless transcripts (which start
   with ATG)
 * Archaea: Shine Dalgarno, Kozak, leaderless transcripts
 * Eukaryota: Kozak sequence

---

## Position specific scoring matrices (PSSMs)

Creating a BLOSUM-like scoring system which depends on the *position* to
detect motifs.

![](images/pssm/gene-prokaryota.jpg)

---

## Position specific scoring matrices (PSSMs)

* regulators
  * AT-rich upstream region
  * -35 box
  * -10 box (Pribnow box)
  * +1 / ATG

---

## Pribnow box (-10 box)

![](images/pssm/january_1-1.jpg)

* Not all promoters have the same sequence
 * Small differences follow a statistical distribution

---

Battle plan:

For each of the three main regions (-35, -10 and +1) create small
alignments.

In an alignment, count the frequencies of the four nucleotides at each
position and calculate scores as log-odds ratios

`$$S_{nn,i} = \log\frac{f_{o,i}}{f_e}$$`

Where `$f_e$` is the expected frequency of nucletide `$nn$` (given by the GC contents),
`$f_{o,i}$` is the observed frequency at position `$i$`, and `$S_{nn,i}$` is the
position-specific score.

Then, slide the three matrices along a DNA sequence and find the
position(s), where the overall score is maximal (we add the scores of the
three matrices together).

---

![](images/pssm/january_1-1.jpg)

---

![](images/pssm/january_1-2.jpg)

---

![](images/pssm/january_1-3.jpg)

---

![](images/pssm/january_3_1.jpg)

---

![](images/pssm/january_3_2.jpg)

---

![](images/pssm/january_3_3.jpg)

---

![](images/pssm/january_3_4.jpg)

---

![](images/pssm/january_4_1.jpg)

---

![](images/pssm/january_4_2.jpg)

---

![](images/pssm/january_4_3.jpg)

---

![](images/pssm/january_4_4.jpg)

---

![](images/pssm/january_4_5.jpg)

---

![](images/pssm/january_4_6.jpg)

---

![](images/pssm/january_4_7.jpg)

---

![](images/pssm/january_4_8.jpg)

---

![](images/pssm/january_4_9.jpg)

---

![](images/pssm/january_4_10.jpg)

---

![](images/pssm/january_4_11.jpg)

---

![](images/pssm/january_4_12.jpg)

---
class:empty-slide,myinverse
background-image:url(images/cute_puppy.jpg)

---

# Position specific iterated BLAST (PSI-BLAST)

* Find a bunch of hits with regular BLAST
 * run a multiple sequence alignments on selected hits
 * create a PSSM
 * get better hits
 * rinse and repeat

---

![](images/psiblast.png)

---

# Hidden Markov Models

![](images/A_profile_HMM_modelling_a_multiple_sequence_alignment.png)

---

# Hidden Markov Models

![](images/hmms_building.png)