CS 5263 Bioinformatics

CS 5263 Bioinformatics

CS 5263 Bioinformatics Lecture 7: Heuristic Sequence Alignment Algorithms (BLAST) Roadmap Last lecture review Sequence alignment statistics Today Gene finding by alignment Heuristic alignment algorithms Basic Local Alignment Search Tools Sequence Alignment Statistics Substitution matrices How is the BLOSUM matrix made How to make your own substitution matrix Whats the meaning of an arbitrary substitution

matrix Significance of sequence alignments P-value estimation for Global alignment scores Local alignment scores What is a p-value? The probability of observing an effect as strong or stronger than you observed, given the null hypothesis. i.e., How likely is this effect to occur by chance? Pr(x > S | null) What is a null-hypothesis? A statisticians way of characterizing chance. Generally, a mathematical model of

randomness with respect to a particular set of observations. The purpose of most statistical tests is to determine whether the observed data can be explained by the null hypothesis. For sequence alignment Your null hypothesis is the two sequences are unrelated Your alternative hypothesis is the two sequences are related You obtained a score S how likely that you can obtain such a score if the null hypothesis is true? How to test that null-hypothesis? Randomly generate some pairs of unrelated sequences

See what alignment scores you may get for those unrelated sequences Must keep other factors in mind Your random sequences must be as close as possible to your true sequences Except that they are unrelated (i.e., not from a common ancestor) Possible ways to get unrelated sequences Which is better? Randomly pick some sequences from a database Randomly pick some sequences from a database and truncate to the same length as your real sequences Generate random sequences according to the frequency that each letter is used by your real

sequences Randomly shuffle your sequences Possible ways to get random sequences Which is better? Randomly pick some sequences from a database Randomly pick some sequences from a database and truncate to the same length as your real sequences Generate random sequences according to the frequency that each letter is used by your real sequences Randomly shuffle your sequences Random shuffling is what we do to estimate p-values for global alignment

Mouse HEXA Human HEXA Score = 732 45 40 35 Number of Sequences 30

25 20 15 10 732 5 0 -200 -100

0 100 200 300 400 Alignment Score 500 600 700 800 Distribution of the alignment scores between mouse HEXA

and 200 randomly shuffled human HEXA sequences P-value: less than 1/200 = 0.005 Advantages Easy to implement You dont need to know a lot of theories to do this Disadvantages Slow Cannot estimate small p-values If we had repeated 1,000 times, would we get a score as high 732? Based on what Ive already seen, I would guess probably no What about 1,000,000 times? When theory exists It gets much better You dont really need to go there to know

whats there (I know roughly how many times you can get a score as high as 732 if you repeat your experiments a billion times) That is what happened for local alignment For ungapped alignment between two sequences of lengths M and N E(S) = KMN exp[-S] (Expected value, E-value. ) K, depends on sequence composition and substitution matrix Can be obtained either empirically or analytically S

P xS 1 exp E ( S ) 1 exp KMNe E ( S ) when P is small. Distribution of alignment scores for 1000 random sequence pairs Extreme value distribution 400 350 Number of Sequences 300 250 200

150 100 50 0 Theory says my score distribution should have this shape 9 10 11 12 13 14

15 Alignment Score 16 17 18 My experiment shows me that the theory seems correct Example You are aligning two sequences, each has 1000 bases m = 1, s = -1, d = -inf (ungapped alignment) You obtain a score 20 Is this score significant?

= ln3 = 1.1 E(S) = K MN exp{- S} E(20) = 0.1 * 1000 * 1000 * 3-20 = 3 x 10-5 P-value = 3 x 10-5 << 0.05 The alignment is significant 400 350 Number of Sequences 300

250 200 150 100 20 50 0 9 10 11 12

13 14 15 Alignment Score 16 17 18 Distribution of 1000 random sequence pairs Multiple-testing problem You are searching a 1000-base sequence against a database of 106 sequences (average length 1000 bases) You get a score 20

You are essentially comparing 1000 bases with 1000x106 = 109 bases (ignore edge effect) E(20) = 0.1 * 1000 * 109 * 3-20 = 30 By chance we would expect to see 30 matches P-value = 1 e-30 = 0.9999999999 Not significant at all A better way to understand p-value Your p-value is 0.99999 You have very low confidence (<0.00001) to say that the null hypothesis is wrong Is the null hypothesis true then (i.e., the two sequences are unrelated)? You dont know Your test was not designed to tell you that In practice You search the sequence against a database of

106 sequences You get 35 matches You expect to get about 30 by chance It could be all 35 are real, or none, or some You already reduced your target from 106 sequences to 35 sequences Take all 35 sequences with caution. Look for other evidences Statistics for gapped local alignment Theory not well developed Extreme value distribution works well empirically Need to estimate K and empirically Exercising FSA How do you make an FSA for the

Needleman-Wunsch algorithm? Exercising FSA How do you make an FSA for the Needleman-Wunsch algorithm? (-, yj)/d (xi,yj) / Ix (-, yj) / d (xi,yj) / (xi,-)/d (-, yj) / d

F (xi,-) / d (xi,yj) / Iy (xi,-)/d Simplify (xi,yj) / (xi,-) / d F (xi,yj) / (-, yj) / d

(xi,-) / d I (-, yj) / d Simplify more (xi,yj) / F(i-1, j-1) + (xi, yj) F(i,j) = max F(i-1, j) + d F(i, j-1) + d F (xi,-) / d (-, yj) / d A more difficult alignment problem

(A gene finder indeed!) X is a genomic sequence (DNA) X encodes a gene May contain introns Y is an ORF from another species Contains only exons We want to compare X against Y Conservation is on the level of amino acids DNA intron intron Pre-mRNA

5 UTR exon exon exon 3 UTR Splice Mature mRNA (mRNA) Open reading frame (ORF) Start codon Stop codon We have a predicted gene We know the positions of the start codon and

stop codon But we dont know where are the splicing sites Not even the number of introns intron Start codon exon intron exon exon Stop codon 1. Most splicing sites start at GT and end at AG

2. But there are lots of GT and AG in the sequence 3. Aligning to a orthologous gene with known ORF may help us determine the splicing sites Orthologous genes: two genes evolved from the same ancestor Coding region are likely conserved on amino acid level UUA, UUG encode the same amino acid So do UCA, UCU, UCG, UCC GTAG Mouse putative gene human ORF

The Genetic Code Third letter If know where are the exons Easy Mouse putative gene Remove introns Mouse putative ORF translate Global alignment human ORF translate Or directly align triplets Mouse putative

gene Remove introns Mouse putative ORF Global alignment human ORF Codon substitution scores AAA AAG AAU AAC AAA

4 3 -1 AAG 3 4 AAU -1 AAC

UCU UCC -1 -1 -1 -1

-1 -1 -1 -1 4 3 1 1 -1

-1 3 4 1 1 UCU -1 -1 1

1 4 3 UCC -1 -1 1 1 3

4 64 x 64 substitution matrix FSA for aligning genomic DNA to ORF (xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) / A B

(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d Considering only exons 1. We dont know exactly where are the splicing sites 2. Length of introns may not be a multiple of 3 - If convert the whole seq into triplets, may result in ORF shift 17 bases? Mouse putative gene human ORF Model introns 1. Most splicing sites start at GT and end at AG 2. For simplicity, assume length of exon is a multiple of 3 Not true in reality

Only a little more work without this assumption GTAG Mouse putative gene human ORF 126 nt = 42 aa 120 nt = 40 aa Aligning genomic DNA to ORF Fixed cost to have an intron Alignment with Affine gap

penalty FSA for aligning genomic DNA to ORF (xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (xi-2xi-1xi, yj-2yj-1yj) / A B (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d Considering only exons FSA for aligning genomic DNA to ORF

(xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e Start an intron (xi-2xi-1xi, yj-2yj-1yj) / (-, GT) / s A B (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d C FSA for aligning genomic DNA to ORF

Continue in intron (xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e Start an intron (xi-2xi-1xi, yj-2yj-1yj) / (-, GT) / s A B (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d (-, yi) / 0

C FSA for aligning genomic DNA to ORF Continue in intron (xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e Start an intron (xi-2xi-1xi, yj-2yj-1yj) / (-, GT) / s A

(-, yi) / 0 B (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d (-, AG) / s Close an intron C (xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (-, yj) / 0 (xi-2xi-1xi, yj-2yj-1yj) / A

(-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d B (-, AG) / s A(i-3,j-3) + (xi-2xi-1xi, yj-2yj-1yj) A(i, j) = max B(i-3,j-3) + (xi-2xi-1xi, yj-2yj-1yj) C(i, j-2) + s, if yj-1yj == AG C (xi-2xi-1xi, yj-2yj-1yj) /

(xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (-, yj) / 0 (xi-2xi-1xi, yj-2yj-1yj) / A (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d B (-, AG) / s A(i, j-3) + d A(i-3, j) + d B(i, j) = max

B(i, j-3) + e B(i-3, j) + e C (xi-2xi-1xi, yj-2yj-1yj) / (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / e (-, yj) / 0 (xi-2xi-1xi, yj-2yj-1yj) / A (-, GT) / s (xi-2xi-1xi, - ) or (-, yj-2yj-1yj) / d

B (-, AG) / s B(i, j-2) + s, if yj-1yj == GT C(i, j) = max C(i, j-1) C ACGGATGCGATCAGTTGTACTACGAGCTGACGGTCCTCAGACTTGATTA There is a close relationship between dynamic programming, FSA, regular expression, and regular grammar Using FSA, you can design more complex alignment algorithms If you can draw the state diagram for a problem,

it can be easily formulated into a DP problem In particular, Hidden Markov Models Will discuss more in a few weeks Heuristic Local Aligners BLAST and alike State of biological databases Sequenced Genomes: Human 3109 Yeast 1.2107 Mouse 2.7109 Rat 2.6109 Neurospora 4107 Fugu fish 3.3108 Tetraodon

3108 Mosquito 2.8108 Drosophila 1.2108 Worm 1.0108 Rice 1.0109 Arabidopsis 1.2108 sea squirts 1.6108 Current rate of sequencing: 4 big labs 3 109 bp /year/lab 10s small labs State of biological databases

Number of genes in these genomes: Vertebrate: ~30,000 Insects: ~14,000 Worm: ~17,000 Fungi: ~6,000-10,000 Small organisms: 100s-1,000s Each known or predicted gene has an associated protein sequence >1,000,000 known / predicted protein sequences Some useful applications of alignments Given a newly discovered gene, - Does it occur in other species? - How fast does it evolve?

Assume we try Smith-Waterman: Our new gene 104 The entire genomic database 1010 - 1011 May take several weeks! Some useful applications of alignments Given a newly sequenced organism, - Which subregions align with other organisms? - Potential genes - Other biological characteristics

Assume we try Smith-Waterman: Our newly sequenced mammal 3109 The entire genomic database 1010 - 1011 > 1000 years ??? BLAST Basic Local Alignment Search Tool Altschul, Gish, Miller, Myers, Lipman, J Mol Biol 1990 The most widely used comp bio tool The most cited paper

Which is better: long mediocre match or a few nearby, short, strong matches with the same total score? score-wise, exactly equivalent biologically, later may be more interesting, & is common at least, if must miss some, rather miss the former BLAST is a heuristic emphasizing the later speed/sensitivity tradeoff: BLAST may miss former, but gains greatly in speed BLAST Main idea: 1. Construct a dictionary of all the words in the query 2. Initiate a local alignment for each word match between query and DB

Running Time: O(MN) However, orders of magnitude faster than Smith-Waterman query DB BLAST Original Version Dictionary: All words of length k (~11 for DNA, 3 for proteins) Alignment initiated between words of alignment score T (typically T = k) query

Alignment: Ungapped extensions until score below statistical threshold Output: All local alignments with score > statistical threshold scan DB query BLAST Original Version k = 4, T = 4 The matching word GGTC

initiates an alignment Extension to the left and right with no gaps until alignment falls < 50% Output: GTAAGGTCC GTTAGGTCC C C C T T C C T G G A T T G C G A Example: A C G A A G T A A G G T C C A G T Gapped BLAST

Pairs of words can initiate alignment Extensions with gaps in a band around anchor Output: GTAAGGTCCAGT GTTAGGTC-AGT C T G A T C C T G G A T T G C G A Added features: A C G A A G T A A G G T C C A G T Example Query: gattacaccccgattacaccccgattaca (29 letters)

[2 mins] Database: All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or phase 0, 1 or 2 HTGS sequences) 1,726,556 sequences; 8,074,398,388 total letters >gi|28570323|gb|AC108906.9| Oryza sativa chromosome 3 BAC OSJNBa0087C10 genomic sequence, complete sequence Length = 144487 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125138 tacacccagattacaccccga 125158 Score = 34.2 bits (17),

Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 125104 tacacccagattacaccccga 125124 >gi|28173089|gb|AC104321.7| Oryza sativa chromosome 3 BAC OSJNBa0052F07 genomic sequence, complete sequence Length = 139823 Score = 34.2 bits (17), Expect = 4.5 Identities = 20/21 (95%) Strand = Plus / Plus Query: Sbjct: 4 tacaccccgattacaccccga 24 ||||||| ||||||||||||| 3891 tacacccagattacaccccga 3911

Example Query: Human atoh enhancer, 179 letters [1.5 min] Result: 57 blast hits 1. gi|7677270|gb|AF218259.1|AF218259 Homo sapiens ATOH1 enhanc... gi|22779500|gb|AC091158.11| Mus musculus Strain C57BL6/J ch... gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhanc... gi|28875397|gb|AF467292.1| Gallus gallus CATH1 (CATH1) gene... gi|27550980|emb|AL807792.6| Zebrafish DNA sequence from clo... gi|22002129|gb|AC092389.4| Oryza sativa chromosome 10 BAC O... gi|22094122|ref|NM_013676.1| Mus musculus suppressor of Ty ... gi|13938031|gb|BC007132.1| Mus musculus, Similar to suppres...

2. 3. 4. 5. 6. 7. 8. 355 1e-95 264 4e-68 256 9e-66 78 5e-12 54 7e-05 44 0.068 42 0.27 42 0.27 gi|7677269|gb|AF218258.1|AF218258 Mus musculus Atoh1 enhancer sequence Length = 1517

Score = 256 bits (129), Expect = 9e-66 Identities = 167/177 (94%), Gaps = 2/177 (1%) Strand = Plus / Plus Query: 3 tgacaatagagggtctggcagaggctcctggccgcggtgcggagcgtctggagcggagca 62 ||||||||||||| ||||||||||||||||||| |||||||||||||||||||||||||| Sbjct: 1144 tgacaatagaggggctggcagaggctcctggccccggtgcggagcgtctggagcggagca 1203 Query: 63 cgcgctgtcagctggtgagcgcactctcctttcaggcagctccccggggagctgtgcggc 122 |||||||||||||||||||||||||| ||||||||| |||||||||||||||| ||||| Sbjct: 1204 cgcgctgtcagctggtgagcgcactc-gctttcaggccgctccccggggagctgagcggc 1262 Query: 123 cacatttaacaccatcatcacccctccccggcctcctcaacctcggcctcctcctcg 179 ||||||||||||| || ||| |||||||||||||||||||| ||||||||||||||| Sbjct: 1263 cacatttaacaccgtcgtca-ccctccccggcctcctcaacatcggcctcctcctcg 1318

Different types of BLAST blastn: search nucleic acid database blastp: search protein database blastx: you give a nucleic acid sequence, search protein database Tblastn: you give a protein sequence, search nucleic acid database tblastx: you give a nucleic database, search nucleic acid database, implicitly translate both into protein sequences Variants of BLAST MEGABLAST: Optimized to align very similar sequences Linear gap penalty NCBI-BLAST:

WU-BLAST: (Wash Univ BLAST) Optimized, added features BLAT: Blast-Like Alignment Tool BlastZ: Optimized for aligning two genomes PSI-BLAST: BLAST produces many hits Those are aligned, and a pattern is extracted Pattern is used for next search; above steps iterated Sensitive for weak homologies Slower Pattern hunter Instead of exact matches of consecutive matches of k-mer, we can look for discontinuous matches

My query sequence looks like: ACGTAGACTAGCAGTTAAG Search for sequences in database that match AXGXAGXCTAXC X stands for dont care Seed: 101011011101 Pattern hunter A good seed may give you both a higher sensitivity and higher specificity You may think 110110110110 is the best seed Because mutation in the third position of a codon often doesnt change the amino acid Best seed is actually 110100110010101111

How to design such seed is an open problem May combine multiple random seeds

Recently Viewed Presentations

  • Note to the Students - Ancient Civilizations

    Note to the Students - Ancient Civilizations

    Note to the Students. This Power Point has no guided notes to go along with it. Instead, it is an interactive presentation . The only materials you need for this section of the unit is a map of Italy (provided...
  • Lana Del rey

    Lana Del rey

    Influences: Britany Spears, Thomas Newman, Eminem, Elvis Presley, Anthony and the Johnsons, and last but not least Kurt Cobain. Biography (Continued) At the age of only fifteen years old, Lana started to experiment with alcohol.
  • Emergy & Complex Systems Day 3, Lecture 7.

    Emergy & Complex Systems Day 3, Lecture 7.

    Venezuela Oil Spill. Miami Huricane. Vietnam War. U.S. Civil War Battle. Florida Strip Mining. Area (E9 m2) Emergy Density (E11 sej/m2) Empower Density (E9 sej/m2/day) Ordering Flux. Attracted Ordering EMergy. Duration of Ordering (days) Attracted Flux.
  • PE / DVT Andrea Wilson May 20/ 2004

    PE / DVT Andrea Wilson May 20/ 2004

    Contrast-enhanced spiral computed tomography (CT) and contrast-enhanced magnetic resonance imaging (MRI) are being promoted for use in the diagnosis of pulmonary embolism.13,14 However, neither technique has sufficient correlation with pulmonary arteriography for subsegmental emboli, which occur commonly.15 Moreover, contrast-enhanced spiral...
  • Metric Conversions - Jefferson County Public Schools

    Metric Conversions - Jefferson County Public Schools

    Metric Conversions 7th Grade Math Online Blueprint Skills SPI 7.4.3 Convert from one unit to another within the same system Metric Conversions 7th Grade Math Online Blueprint Skills SPI 7.4.3 Convert from one unit to another within the same system...
  • South African Cultural Observatory Portfolio Committee: Sports, Arts

    South African Cultural Observatory Portfolio Committee: Sports, Arts

    To brief the Honourable Members of the Portfolio Committee of Sports, Arts and Culture on the current and planned activities of the South African Cultural Observatory. ... Define individual research agenda items required to address the research requirements of the...
  • THE HAMILTONIAN METHOD AMANDA BURKE OUTLINE  Historical background

    THE HAMILTONIAN METHOD AMANDA BURKE OUTLINE Historical background

    Time derivative of q. j. is the generalized velocity of each coordinate The Lagrangian is a function of generalized coordinate and generalized velocity Lagrange's equations of motion: generalized momenta
  • Youth Connexions Hertfordshire Topic Group Monday 5 October ...

    Youth Connexions Hertfordshire Topic Group Monday 5 October ...

    YOUTH CONNEXIONS - Organisation Youth Connexions - established 1 April 2008 Hertfordshire County Council, Youth Service Connexions Hertfordshire Organisations merged to form new INTEGRATED YOUTH SUPPORT SERVICE within CSF Delivery restructured to align with District Council boundaries - 10 integrated...