Resources
Sequence Logos: link 1, link 2
Jaspar database: Jaspar
Multiple sequence alignments (MSA): we will use
clustal omega
or clustal 2
(=
clustalw
). You can use one of the following online
interfaces or download the program clustalx
:
- (preferred) Clustal Omega on EBI web page: clustalo
- Clustal W interface: clustalw (requires
an offline viewer like Jalview)
- Clustal W interface #2: clustalw
(requires an offline viewer)
- Alternatively, download and install
clustalx
from here. This is a GUI offline
program for doing alignmnets with clustalw.
- You can also download and install Jalview or the much
more primitive Seaview to view the
alignments off line.
- If you want to, you can create MSAs in R using the msa
Bioconductor package.
- To view the trees, if you are not using the EBI interface, you can
use this online
viewer
Editing files: See this
guide on editing files.
Exercises
Sequence logos
Use the online tool (link 1, link 2) to create
sequence logos from the following sets of sequences:
What are the consensus sequences of these motifs for
these two sets?
A regular expression in DNA sequence motifs uses the
following UPAC codes to represent positions where multiple possibilities
are represented by a single letter. E.g. “N” stands for any base, and
“W” stands for A or T. Summarise the sequence logos in form of regular
expression (that is, if at position 1 there are only A’s and T’s, write
“W”)
A |
Adenine |
G |
Guanine |
C |
Cytosine |
T |
Thymine |
Y |
Pyrimidine (C or T) |
R |
Purine (A or G) |
W |
weak (A or T) |
S |
strong (G or C) |
K |
keto (T or G) |
M |
amino (C or A) |
D |
A, G, T (not C) |
V |
A, C, G (not T) |
H |
A, C, T (not G) |
B |
C, G, T (not A) |
X/N |
any base |
What happens if you change the GC contents parameter? E.g. set to
50%, to 25%, to 75% -> what happens?
Go to the JASPAR
database and search for the motifs “CTCF”, “BATF3”, “FOS::JUN” and
“STAT3”. Do any of these motifs look similar to the sequence logos you
have created?
Multiple sequence
alignments
There are three data sets:
- Data set 1, hba.fasta are the
hemoglobin alpha sequences from different species.
- Data set 2, globin.fasta are the
different human globin sequences.
- Data set 3, bhlh.fasta are different
human bHLH proteins. There are more than 150 such proteins; here is just
a handful (mostly from “group C”, as described here).
Here are the exercises:
- Open the FASTA file for data set 1. The names of the organisms are
in latin. Do you know any of them? Search for the names you do not know.
What are these organisms? How re they related to humans? Are all
sequences HBA sequences?
- Using one of the online services or an offline program, create an
MSA of data set 1. The main result file has an extension
.aln
(if you want to download it and e.g. view with an
offline viewer). The guide tree is in the .dnd
file. Both
are text files that can be viewed in your browser.
- Inspect the MSA. Which sequences are close to the consensus? Which
are more distant? Which regions are conserved?
- Inspect the guide tree. Is the real phylogeny reflected by this
tree?
- Run the alignment and repeat the steps 2-4 using the data set 2
(human globin sequences). How does the alignment compare to the previous
one? Which proteins cluster together? How does it relate to the
alpha/beta hemoglobins as described in Lecture 4?
- Run the alignment and repeat the steps 2-4 using the data set 3.
Which region is conserved?
- Pick one of the bHLH protein identifiers and search for it in Swissprot/Uniprot. Go to the “Family
& Domains” section of the record for that protein. Which domain
shows the greatest conservation?
- (Hard) If there is a structure present for that protein (section
“Structure”), which fragment of the structure does this conserved region
correspond to? (Note: the structure may show more than one molecule! The
table on the right shows which chain corresponds to the given protein;
hovering cursor over the structure reveals to which chain a residue
belongs to.)
Homework
There are 20,000 human genes, pick one! No, seriously. You must have
heard of one protein or another. An enzyme? A transcription factor?
Anything, but not hemoglobin or bHLH.
- Pick a protein. You are free to choose any of the 20,000 or so human
proteins.
- Using BLASTP, find at least five homologous sequences from other
organisms. You are allowed to pick specific organisms, of
course.
- Create a multiple sequence alignment of these organism.
- Your homework is the
.aln
file created by the MSA
program.