1 Resources

  • Sequence Logos: link 1, link 2

  • Jaspar database: Jaspar

  • Multiple sequence alignments (MSA): we will use clustal omega or clustal 2 (= clustalw). You can use one of the following online interfaces or download the program clustalx:

    • (preferred) Clustal Omega on EBI web page: clustalo
    • Clustal W interface: clustalw (requires an offline viewer like Jalview)
    • Clustal W interface #2: clustalw (requires an offline viewer)
    • Alternatively, download and install clustalx from here. This is a GUI offline program for doing alignmnets with clustalw.
    • You can also download and install Jalview or the much more primitive Seaview to view the alignments off line.
    • If you want to, you can create MSAs in R using the msa Bioconductor package.
    • To view the trees, if you are not using the EBI interface, you can use this online viewer
  • Editing files: See this guide on editing files.

2 Exercises

2.1 Sequence logos

  1. Use the online tool (link 1, link 2) to create sequence logos from the following sets of sequences:

  2. What are the consensus sequences of these motifs for these two sets?

  3. A regular expression in DNA sequence motifs uses the following UPAC codes to represent positions where multiple possibilities are represented by a single letter. E.g. “N” stands for any base, and “W” stands for A or T. Summarise the sequence logos in form of regular expression (that is, if at position 1 there are only A’s and T’s, write “W”)

Code Stands for
A Adenine
G Guanine
C Cytosine
T Thymine
Y Pyrimidine (C or T)
R Purine (A or G)
W weak (A or T)
S strong (G or C)
K keto (T or G)
M amino (C or A)
D A, G, T (not C)
V A, C, G (not T)
H A, C, T (not G)
B C, G, T (not A)
X/N any base
  1. What happens if you change the GC contents parameter? E.g. set to 50%, to 25%, to 75% -> what happens?

  2. Go to the JASPAR database and search for the motifs “CTCF”, “BATF3”, “FOS::JUN” and “STAT3”. Do any of these motifs look similar to the sequence logos you have created?

2.2 Multiple sequence alignments

There are three data sets:

  1. Data set 1, hba.fasta are the hemoglobin alpha sequences from different species.
  2. Data set 2, globin.fasta are the different human globin sequences.
  3. Data set 3, bhlh.fasta are different human bHLH proteins. There are more than 150 such proteins; here is just a handful (mostly from “group C”, as described here).

Here are the exercises:

  1. Open the FASTA file for data set 1. The names of the organisms are in latin. Do you know any of them? Search for the names you do not know. What are these organisms? How re they related to humans? Are all sequences HBA sequences?
  2. Using one of the online services or an offline program, create an MSA of data set 1. The main result file has an extension .aln (if you want to download it and e.g. view with an offline viewer). The guide tree is in the .dnd file. Both are text files that can be viewed in your browser.
  3. Inspect the MSA. Which sequences are close to the consensus? Which are more distant? Which regions are conserved?
  4. Inspect the guide tree. Is the real phylogeny reflected by this tree?
  5. Run the alignment and repeat the steps 2-4 using the data set 2 (human globin sequences). How does the alignment compare to the previous one? Which proteins cluster together? How does it relate to the alpha/beta hemoglobins as described in Lecture 4?
  6. Run the alignment and repeat the steps 2-4 using the data set 3. Which region is conserved?
  7. Pick one of the bHLH protein identifiers and search for it in Swissprot/Uniprot. Go to the “Family & Domains” section of the record for that protein. Which domain shows the greatest conservation?
  8. (Hard) If there is a structure present for that protein (section “Structure”), which fragment of the structure does this conserved region correspond to? (Note: the structure may show more than one molecule! The table on the right shows which chain corresponds to the given protein; hovering cursor over the structure reveals to which chain a residue belongs to.)

3 Homework

There are 20,000 human genes, pick one! No, seriously. You must have heard of one protein or another. An enzyme? A transcription factor? Anything, but not hemoglobin or bHLH.

  1. Pick a protein. You are free to choose any of the 20,000 or so human proteins.
  2. Using BLASTP, find at least five homologous sequences from other organisms. You are allowed to pick specific organisms, of course.
  3. Create a multiple sequence alignment of these organism.
  4. Your homework is the .aln file created by the MSA program.