What is an alignment? How does it differ from a complementary sequence? What is a local alignment? What is a global alignment?
For the dot plots and alignments you will be using EMBOSS. EMBOSS is a suite of command line tools for bioinformatics, you can download it from here. If you are brave, feel free to download and use it from command line. Alternatively, download Jemboss and use the GUI Java interface.
However, many sites offer a web interface, which is the way I will show you these tools in the course.
Links to online interfaces (use the one that you like)
EMBOSS reference: Rice, Peter, Ian Longden, and Alan Bleasby. “EMBOSS: the European molecular biology open software suite.” Trends in genetics 16.6 (2000): 276-277.
Note: in previous courses, on rare occasion it happened that the sites noticed mutiple connections from one IP range and the firewall automatically shut us down. This should not happen; if it does, we will have to waste time on installing Jemboss. However, installing unfamiliar tools is a large portion of life in bioinformatics, so that will also be a valuable exercise.
Use the “dotmatcher” program from EMBOSS.
The program does not merely put a dot on the plot for each match. Instead, the matches are averaged over a number of nucleotides or amino acid residues, and the result is then compared with a threshold value (only values above the threshold are then displayed).
>
. These sequences are very similar, you should have no
issues seeing the dot plot.>sequence1
ACTCTTCTGGTCCCCACAGACTCAGAGAGAACCCACCATGGTGCTGTCTCCTGCCGACAAGACCAACGTC
AAGGCCGCCTGGGGTAAGGTCGGCGCGCACGCTGGCGAGTATGGTGCGGAGGCCCTGGAGAGGTGAGGCT
CCCTCCCCTGCTCCGACCCGGGCTCCTCGCCCGCCCGGACCCACAGGCCACCCTCAACCGTCCTGGCCCC
GGACCCAAACCCCACCCCTCACTCTGCTTCTCCCCGCAGGATGTTCCTGTCCTTCCCCACCACCAAGACC
TACTTCCCGCACTTCGACCTGAGCCACGGCTCTGCCCAGGTTAAGGGCCACGGCAAGAAGGTGGCCGACG
CGCTGACCAACGCCGTGGCGCACGTGGACGACATGCCCAACGCGCTGTCCGCCCTGAGCGACCTGCACGC
GCACAAGCTTCGGGTGGACCCGGTCAACTTCAAGGTGAGCGGCGGGCCGGGAGCGATCTGGGTCGAGGGG
CGAGATGGCGCCTTCCTCGCAGGGCAGAGGATCACGCGGGTTGCGGGAGGTGTAGCGCAGGCGGCGGCTG
CGGGCCTGGGCCCTCGGCCCCACTGACCCTCTTCTCTGCACAGCTCCTAAGCCACTGCCTGCTGGTGACC
CTGGCCGCCCACCTCCCCGCCGAGTTCACCCCTGCGGTGCACGCCTCCCTGGACAAGTTCCTGGCTTCTG
TGAGCACCGTGCTGACCTCCAAATACCGTTAAGCTGGAGCCTCGGTGGCCATGCTTCTTGCCCCTTGGGC
CTCCCCCCAGCCCCTCCTCCCCTTCCTGCACCCGTACCCCCGTGGTCTTTGAATAAAGTCTGAGTGGGCG
GCA
>sequence2
AGACTCTGGCCAACACCCCCTGTAAGGCCACAGGAGAGGAACAGGAGTGATAGCCCCCAAACCCCAGTCC
CACCAGGCCCTGAGGGCCCCTTTGTCACTGGATCTGATAAGAAACACCACCCCTGCAGCCCCCTCCCCTC
ACCTGACCAATGGCCACAGCCTGGCTGGGCCCAGCTCCCTGTATATAAGGGGACCCTGGGGGCTGAGCAC
TACCAAGGCCAGTCCTGAGCAGGCCCAACTCCAGTGCAGCTGCCCACCCTGCCGCCATGTCTCTGACCAA
GACTGAGAGGACCATCATTGTGTCCATGTGGGCCAAGATCTCCACGCAGGCCGACACCATCGGCACCGAG
ACTCTGGAGAGGCTCTTCCTCAGCCACCCGCAGACCAAGACCTACTTCCCGCACTTCGACCTGCACCCGG
GGTCCGCGCAGTTGCGCGCGCACGGCTCCAAGGTGGTGGCCGCCGTGGGCGACGCGGTGAAGAGCATCGA
CGACATCGGCGGCGCCCTGTCCAAGCTGAGCGAGCTGCACGCCTACATCCTGCGCGTGGACCCGGTCAAC
TTCAAGCTCCTGTCCCACTGCCTGCTGGTCACCCTGGCCGCGCGCTTCCCCGCCGACTTCACGGCCGAGG
CCCACGCCGCCTGGGACAAGTTCCTATCGGTCGTATCCTCTGTCCTGACCGAGAAGTACCGCTGAGCGCC
GCCTCCGGGACCCCCAGGACAGGCTGCGGCCCCTCCCCCGTCCTGGAGGTTCCCCAGCCCCACTTACCGC
GTAATGCGCCAATAAACCAATGAACGAA
Play with the parameters “window size” and “threshold”. Increase and decrease the threshold: what happens? What happens if you set the threshold to 1 and windows size to 3?
Play with the modified sequences: cut out a fragment from the middle of one of the sequences and display the dot plot. Do you see what happens? Now copy and paste a fragment two or more times. See what happens?
Now try to run the dot plots for the following two sequences.
>Sequence3
ATGGCGAAAAAACCAAAAAAATTAGAAGAAATTTCAAAAAAATTTGGGGCAGAACGTGAAAAGGCCTTGA
ATGACGCTCTTAAATTGATTGAGAAAGACTTTGGTAAAGGATCAATCATGCGTTTGGGTGAACGTGCGGA
GCAAAAGGTGCAAGTGATGAGCTCAGGTTCTTTAGCTCTTGACATTGCCCTTGGCTCAGGTGGTTATCCT
AAGGGACGTATCATCGAAATCTATGGCCCAGAGTCATCTGGTAAGACAACGGTTGCCCTTCATGCAGTTG
CACAAGCGCAAAAAGAAGGTGGGATTGCTGCCTTTATCGATGCGGAACATGCCCTTGATCCAGCTTATGC
TGCGGCCCTTGGTGTCAATATTGACGAATTGCTCTTGTCTCAACCAGACTCAGGAGAGCAAGGTCTTGAG
ATTGCGGGAAAATTGATTGACTCAGGTGCAGTTGATCTTGTCGTAGTCGACTCAGTTGCTGCCCTTGTTC
CTCGTGCGGAAATTGATGGAGATATCGGAGATAGCCATGTTGGTTTGCAGGCTCGTATGATGAGCCAGGC
CATGCGTAAACTTGGCGCCTCTATCAATAAAACCAAAACAATTGCCATTTTTATCAACCAATTGCGTGAA
AAAGTTGGAGTGATGTTTGGAAATCCAGAAACAACACCGGGCGGACGTGCTTTGAAATTCTATGCTTCAG
TCCGCTTGGATGTTCGTGGTAATACACAAATTAAGGGAACTGGTGACCAAAAAGAAACCAATGTCGGTAA
AGAAACTAAGATTAAGGTTGTAAAAAATAAGGTAGCTCCACCGTTTAAGGAAGCCGTAGTTGAAATTATG
TACGGAGAAGGAATTTCTAAGACTGGTGAGCTTTTGAAGATTGCAAGCGATTTGGATATTATCAAAAAAG
CAGGGGCTTGGTATTCTTACAAAGATGAAAAAATTGGGCAAGGTTCTGAGAATGCTAAGAAATACTTGGC
AGAGCACCCAGAAATCTTTGATGAAATTGATAAGCAAGTCCGTTCTAAATTTGGCTTGATTGATGGAGAA
GAAGTTTCAGAACAAGATACTGAAAACAAAAAAGATGAGCCAAAGAAAGAAGAAGCAGTGAATGAAGAAG
TTACGCTTGACTTAGGCGATGAACTTGAAATCGAAATTGAAGAATAA
>Sequence4
ATGGCAATAGATGAAGACAAACAAAAAGCGATTTCTTTAGCGATCAAACAAATTGATAAGGTTTTTGGTA
AGGGGGCGTTGGTGCGCCTTGGGGATAAGCAAGTAGAAAAGATTGACTCTATTTCTACAGGCTCGTTAGG
GTTGGATCTGGCTTTAGGGATTGGGGGCGTTCCAAAGGGTAGGATCATTGAAATTTATGGGCCAGAGTCA
AGCGGGAAGACCACTTTAAGCTTGCATATCATTGCAGAATGCCAAAAAAATGGGGGCGTGTGCGCGTTTA
TTGACGCTGAGCATGCCCTAGATGTGCATTATGCTAAGAGGCTAGGCGTGGATACGGAAAACTTACTCGT
TTCCCAACCTGATACAGGCGAGCAAGCTTTAGAGATTTTAGAAACGATCACCAGAAGCGGAGGGATTGAT
TTAGTGGTGGTGGATTCCGTAGCGGCTCTTACGCCTAAAGCGGAGATTGATGGGGATATGGGCGATCAGC
ATGTGGGCTTGCAAGCAAGGCTTATGAGCCATGCGTTAAGAAAAATCACCGGTATTTTGCACAAGATGAA
CACCACTCTCATTTTTATCAATCAAATCAGAATGAAGATTGGCATGATGGGTTATGGGAGTCCAGAGACC
ACAACCGGAGGTAATGCCTTAAAATTCTATGCGAGCGTTAGGATTGATATTAGAAGGATTGCGGCTTTAA
AACAAAACGAACAGCATATTGGCAATAGGGCTAAAGCCAAAGTGGTTAAAAATAAAGTCGCTCCGCCCTT
TAGAGAAGCGGAATTTGACATCATGTTTGGGGAGGGGATTTCTAAAGAGGGCGAAATCATTGATTATGGC
GTGAAATTAGACATTGTGGATAAGAGTGGGGCATGGCTTAGCTACCAGGATAAAAAGCTAGGGCAAGGCC
GAGAAAATGCTAAAGCCTTACTGAAAGAAGACAAAGCCCTAGCGAATGAAATCACTCTTAAGATTAAAGA
GAGCATTGGCTCTAATGAAGAGATCATGCCCTTACCAGATGAGCCTTTAGAAGAAATGGAATAG
You will create global alignments with the NW algorithm (program “needle”) and the local alignments with the Smith-Waterman algorithm (program “water”).
Create global and local alignments between the above sequences 1 and 2 as well as 3 and 4.
The following are the protein sequences corresponding to sequences 1 and 2. Run the alignments. Any thoughts?
>sequence1_prot
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR
>sequence2_prot
MSLTKTERTIIVSMWAKISTQADTIGTETLERLFLSHPQTKTYFPHFDLHPGSAQLRAHG
SKVVAAVGDAVKSIDDIGGALSKLSELHAYILRVDPVNFKLLSHCLLVTLAARFPADFTA
EAHAAWDKFLSVVSSVLTEKYR
Cut out a fragment from the middle of one of the sequences. Run the alignment and take a look. Now set the gap opening penalty and gap extension penalty to 1. What is the difference? How about 25/0.5?
(hard) There is a consipracy theory online that the spike protein
from Sars-Cov-2 is similar to the human syncytin-1 protein (which is of
viral origin). Find the two proteins in the Uniprot database
(SPIKE_SARS
and and create global and local alignments and
dot plots. Can you see any similarity? Try different scoring matrices
(BLOSUM30, BLOSUM62 etc.) and different gap penalties.
Like often in bioinformatics, the blast family of programs can be
used as command line programs. You can download and install them on your
computer (in Ubuntu, it is as simple as
sudo apt install ncbi-blast+
); similarly, you can download
and install the NCBI databases such as NR or refseq locally. This is
advantageous when you have more than just a few sequences to blast.
However, BLAST features a popular and extremly refined user interface on the NCBI website, which we will use today. The entry point is https://blast.ncbi.nlm.nih.gov/Blast.cgi.
Using blastn, identify the sequences 1, 2, 3 and 4.
Michael Crichton included a DNA sequence in his book “Lost World”. Use BLAST to identify the sequence origin. You might try to limit the search to a group of organisms.
>DinoDNA "Dinosaur DNA" from Crichton LOST WORLD
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG
GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC
ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA
GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC
TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG
ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG
CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC
GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG
GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC
TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC
CTGCGGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA
TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC
GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC
GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG
GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC
TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC
GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT
TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA
ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA
CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC
CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT
GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG
AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC
TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC
Now use blastx
, which takes a nucleotide sequence as
input but searches in protein sequence databases. What do you see? Can
you see the hidden message in the protein sequence? Identify the
nucleotides corresponding to the hidden message.
Using appropriate alignment tool, create a global alignment between the Crichton sequence and the best hit. How can you get FASTA from the blast output? How can you get a FASTA nucleotide sequence?
Here is another one, this time from “Jurassic Park”. Where does the sequence come from?
>DinoDNA "Dinosaur DNA" from Crichton JURASSIC PARK p. 103 nt 1-1200
GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC
GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG
TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC
TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG
CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA
AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG
ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT
CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT
GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG
CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA
CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG
CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA
CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA
GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG
CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG
ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA
ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC
GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG
CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG
CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT
Here is a mystery sequence. What is this protein? From which organism does it come from? (X stands for an unknown amino acid).
>mystery sequence
GATGAPGIAGAPGFPGARGAPGPQGPSGAPGPKXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXGVQGPPGPQGPR
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXGSAGPPGATGFP
GAAGRXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXGVVGLPGQR
Identify the sequence.
There are now several similar sequences in the NR database, use the one
has the identifier starting with “CO1A1”. Please use the sequence with
identifier starting with “TY”. Click on the “Sequence ID” link to view
more information about the sequence. and find and read the paper from
which the mystery sequence is derived. Provide a five sentence (roughly)
summary of what has been done and how and what the conclusions are.