Yeast whole-genome analysis of conserved regulatory motifs

Yeast whole-genome analysis of conserved regulatory motifs

Interpreting the human genome Manolis Kellis CSAIL MIT Computer Science and Artificial Intelligence Lab Broad Institute of MIT and Harvard for Genomics in Medicine The age of comparative genomics human opossum bat chimp armadillo mouse rat rabbit dolphin dog

cow lemur bushbaby 17 yeasts 12 flies hyrax pika elephant hedgehog tenrec etc... llama Tree shrew pangolin 32 mammals Post-duplication

Resolving power in mammals, flies, fungi Pre-dup 9 Yeasts Diploid P P P P P P 10 mammals Neutral: 2.57 subs/site (opp: 0.62 32sps: 4.87) Coding: 1.16 subs/site Detect: 6-mer at FP 10-6 12 flies Neutral: 4.13 subs/site Coding: 1.65 subs/site

Detect: 6-mer at 10-11 Haploid 8 Candida 17 yeasts Neutral: 15.5 subs/site (Yeast: 6.5 Candida: 6.5) Coding: 7.91 subs/site Detect: 3-mer at 10-21 Comparative Genomics 101: Conservation Function Conserved elements are typically functional (and vice versa) For example: exons are deeply conserved to mouse, chicken, fish Some conserved elements are still uncharacterized How do we make sense of them? How do we distinguish each type of functional element Answer: evolutionary signatures (Comp. Genomics 201) Tell me how you evolve, Ill tell you who you are Patterns of change selective pressures specific function

Gene identification Study known genes Derive conservation rules Discover new genes Evolutionary signatures Tell me how you evolve, ill tell you who you are Each type of functional elements evolves in its own specific ways Distinguishing genes from non-coding regions Splice Dmel Dsec Dsim Dyak Dere Dana Dpse Dper Dwil Dmoj Dvir Dgri TGTTCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGCCCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC

TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-GTTAGCCAGGCGGAGTGCCTTCTACCATTACCGTGCGGACGAGCATGT---GGCTCCAGCATCTTC TGTCCATAAATAAA-----TTTACAACAGTTAGCTG-CTTAGCCATGCGGAGTGCCTCCTGCCATTGCCGTGCGGGCGAGCATGT---GGCTCCAGCATCTTT TGTCCATAAATAAA-----TCTACAACATTTAGCTG-GTTAGCCAGGCGGAGTGTCTGCGACCGTTCATG------CGGCCGTGA---GGCTCCATCATCTTA TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGGCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATCATTTTC TGTCCATAAATGAA-----TTTACAACATTTAGCTG-CTTAGCCAGGCGGAATGCCGCCGTCCGTTCCCGTGCATACGCCCGTGG---GGCTCCATTATTTTC TGTTCATAAATGAA-----TTTACAACACTTAACTGAGTTAGCCAAGCCGAGTGCCGCCGGCCATTAGTATGCAAACGACCATGG---GGTTCCATTATCTTC TGATTATAAACGTAATGCTTTTATAACAATTAGCTG-GTTAGCCAAGCCGAGTGGCGCC------TGCCGTGCGTACGCCCCTGTCCCGGCTCCATCAGCTTT TGTTTATAAAATTAATTCTTTTAAAACAATTAGCTG-GTTAGCCAGGCGGAATGGCGCC------GTCCGTGCGTGCGGCTCTGGCCCGGCTCCATCAGCTTC TGTCTATAAAAATAATTCTTTTATGACACTTAACTG-ATTAGCCAGGCAGAGTGTCGCC------TGCCATGGGCACGACCCTGGCCGGGTTCCATCAGCTTT ***** * * ** *** *** *** ******* ** ** ** * * ** * ** ** ** ** **** * ** Protein-coding genes have specific evolutionary constraints Gaps are multiples of three (preserve amino acid translation) Mutations are largely 3-periodic (silent codon substitutions) Specific triplets exchanged more frequently (conservative substs.) Conservation boundaries are sharp (pinpoint individual splicing signals) Encode as evolutionary signatures Computational test for each of them

Combine and score systematically Signature 1: Reading frame conservation RFC RFC 100% 60% 100% 55% 100% 90% 100% 40% 100% 60% 100%

100% 100% 20% 100% 30% 100% 40% 100% 60 Mutations Gaps Frameshifts Genes Intergenic 30%

1.3% 0.14% 58% 14% 10.2% Separation 2-fold 10-fold 75-fold Signature 2: Distinct patterns of codon substitution Genes Codon observed in species 2 Codon observed in species 1 Codon observed in species 1 Codon observed in species 2 Intergenic

Codon substitution patterns specific to genes Genetic code dictates substitution patterns Amino acid properties dictate substitution patterns human Codon Substitution Matrix (CSM) mouse aliphatic aromatic polar polar negative positive Signatures 3, 4, 5, 6, 7, etc real exon ISEs acceptor site ESEs

donor site ISEs Mutation patterns of splicing signals Real splice acceptor/donor evolve in specific ways Evolution of other motifs associated with splicing Exonic/Intronic Splicing Enhancers/Silencers (ESE,ESI) Density of motif clouds surrounding real exons Sharp conservation boundaries Relative conservation exon vs. surrounding regions Length of longest open reading frame Frequency of stop codons in each frame / each species Putting it all together: probabilistic framework Hidden Markov Models (HMMs)

Generative model, learn emission, transition probabilities Easy to train, hard to integrate long-range signals Conditional Random Fields (CRFs) Discriminative dual of HMMs, learn weights on features Easy to integrate diverse signals, gradient ascent for training From HMMs to CRFs hidden sequence yi-1 yi yi+1 feature functions F(i-1) F(i)

F(i+1) observed X From HMMs to CRFs L P( X , Y ) F( yi 1 , yi , i, X ) i 1 V(i, y ) max(F( y ' , y, i, X ) V(i 1, y ' )) y' (i, y ) F( y ' , y, i, X ) (i 1, y ' ) y' Generative model Transition and Emission probabilities r example, features can simply be ei and aij Discriminative model F( y ' , y, i, X ) e y , xi a y ', y Or pretty much anything:

f 3 ( y ' , y, i, X ) % Heads in xi 50 ...xi 50 f 9 ( y ' , y, i, X ) %CpG in xi 50 ...xi 50 f17 ( y ' , y, i, X ) distance to nearest BLAST hit Running on real genomes Obtain optimal weights (from training set) Experimentally-defined, genetics, curation, cDNA Apply CRF systematically to new genome Revisit existing genomes Annotate new genomes Revisiting fly genome annotation D. melanog. D. simulans D. erecta D. persimilis () 10,845 fully confirmed 579 fully rejected 1,454 exons +668 exons 2,499 not (~800 genes) in 443 genes aligned

Power of evolutionary signatures New genes and exons, dubious genes and exons Adjust gene boundaries: ATG, frame, splice site, seq errors Signatures more powerful than primary signals Recognize unusual gene structures read-through, uORFs, editing Towards a revised genome annotation Curation: FlyBase integrates prediction with cDNA, protein, literature Experimentation: BDGP large-scale functional validation novel exons Systematic application leads to Reading Frame Conservation Exon-level changes Ex 1: New genes Ex 2: New exons Ex 3: Dubious genes

Codon Substitution Matrix Codon observed in species 2 Codon observed in species 1 More subtle changes Genes vs. Intergenic Ex 4: Start/end adjustments Ex 5: Wrong reading frame Ex 6: Splice site adjustments Ex 7: Sequencing errors fixed Unusual gene structures W1: Stop-codon read-through W2: uORFs & dicistronic W3: Internal frame-shifts Example 1: Known genes stand out harp conservation oundaries.

nown exons tand out. igh sensitivity nd specificity. conserved substitution insertion frameshift gap Example 2: Novel multi-exon gene 1,454 novel exons outside known genes Many cluster in new multi-exon genes Others are isolated high-confidence exons Example 2b: Novel exons inside known genes (sorry, this example is from human, mouse, dog, rat) 668 cases in fly New candidate alternatively spliced gene forms New protein domains

Novel genes and exons 1,454 novel exons outside existing genes 60% cluster in 300 multi-exon genes 40% isolated exons 668 novel exons inside existing genes Alternative splicing: Many with cDNA support Nested genes: Few known examples Human curation Collaboration with FlyBase Hundreds of changes in release 5.1, more in 5.2 Systematic experimentation Sue Celniker and Berkeley Genome Project Thousands of new genes in the pipeline Example 3: Dubious single-exon gene Only evidence was an open reading frame Comparative information much stronger 579 Dubious Genes Classification approach: Yes / No answer

Closely related species: both genes and intergenic aligned Show very different patterns of mutation Comparative analysis provides negative evidence Alignment is unambiguous, orthologous, spans entire gene Sequence shows mutations and indels in every species Weak or missing experimental evidence 100 of these independently rejected by FlyBase These are missing from systematic clone collections Only 34 (6%) have assigned names (vs. 36% of all fly genes) Systematic application leads to Reading Frame Conservation Exon-level changes Ex 1: New genes Ex 2: New exons Ex 3: Dubious genes

Codon Substitution Matrix Codon observed in species 2 Codon observed in species 1 More subtle changes Genes vs. Intergenic Ex 4: Start/end adjustments Ex 5: Wrong reading frame Ex 6: Splice site adjustments Ex 7: Sequencing errors fixed Unusual gene structures W1: Stop-codon read-through W2: uORFs & dicistronic W3: Internal frame-shifts Example 4: Start codon adjustment Codon substitution patterns suggest new start in 200 genes Score each substitution using Codon Substitution Matrix (CSM) CG6664/FBtr0100439

ATG poor CSM score, atypical substitution high CSM score, protein-like substitution ATG annotated start codon conserved start codon Example 5: Gene annotated on wrong reading frame cDNA evidence supports overlapping reading frames, both open Annotation traditionally selects longer one Conservation enables distinguishing the two Shorter ORF is the correct one mRNA supports both ORFs Annotated ORF (345nt) CG7738-RA is incorrect Real ORF (315nt)

Conservation only supports shorter ORF Example 6: Incorrect splice causes wrong frame Second exon annotated in the wrong frame Due to splice site boundary error Correction is supported by cDNA evidence First exon: correct frame Fix exon boundary 2nd exon: incorrect frame Example 7: Detect seq. errors / strain mutations Insertion/deletion causes frameshift Conservation signature shifts from frame1 to frame2 All other species disagree with D. melanogaster indel Sequencing error or species-specific mutation chr3R:6,953,865-6,953,927 (Ugt86Dd) dm droSec droSim droYak droEre droAna droWil droMoj

droVir CAGTACATATTTGTGGAGAGTTACTTGAAAG-CTTGGCAGCTAAGGGTCATCAGGTGACCGTTA CAGTACATATTTTTGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTA CAGTACATATTTATGGAGAGCTACTTGAAAGCCTTGGCAGCTAAGGGTCACCAGGTGACCGTTA CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCCAAGGGTCACCAGGTGACCGTTA CAGTACATTTTTGTGGAGACCTACTTGAAAGCCCTGGCAGCTAGGGGTCACCAGGTGACTGTTA CAGTACATCTTTGTGGAGACCTATCTGAAGGCTTTGGCCGACAAAGGTCACCAGGTGACTGTTA CAATACATATTCATTGAGGCGTATCTAAAGGCATTGGCTGCCAAAGGACATCAGTTAACTGTGA CAGTACATATTCGCCGAGGCGTATTTGAAGGCGCTAGCAGCCCGGGGCCATGAGGTCACCGTGA CAGTATATATTTGCCGAGTCGTATTTGAAGGCCTTGGCAGCGCGGGGTCATGAGGTGACAGTGA 01201201201201201201201201201201 2012012012012012012012012012012 ** ** ** ** *** ** * ** * * ** * ** ** ** * ** ** * Conservation in correct frame Conservation in 2nd frame Frame-shift (sequencing error / recent mutation) Example 8: Dubious gene is a miRNA transcript Evolutionary signatures reveal specific function Systematic application leads to

Reading Frame Conservation Exon-level changes Ex 1: New genes Ex 2: New exons Ex 3: Dubious genes Codon Substitution Matrix Codon observed in species 2 Codon observed in species 1 More subtle changes Genes vs. Intergenic

Ex 4: Start/end adjustments Ex 5: Wrong reading frame Ex 6: Splice site adjustments Ex 7: Sequencing errors fixed Unusual gene structures W1: Stop-codon read-through W2: uORFs & dicistronic W3: Internal frame-shifts Unusual genes 1: Stop codon read-through Method #1 (single exons) 112 events, 95 extending known genes Manual curation: 82 Enriched in neuronal function Method #2 (after splicing) 256 events, looser cutoff, large overlap, needs manual curation Enriched in transcription factors Protein-coding conservation Stop codon read through Continued protein-coding conservation No more

conservation 2nd stop codon Unusual genes 2: Polycistronic messages / uORFs Method High-scoring ORFs with cDNA evidence Disjoint from the annotated ORF Results 217 cases Protein-coding conservation in the 5UTR Unusual genes 3: Frame-shift in the middle of exons Method Exons changing high-scoring frame Far from splice junctions Results 68 cases in 44 genes chrX:2,226,518-2,226,639 (CG14047) dm droSim droSec droYak droEre droAna droPse

droPer droWil droMoj droVir droGri GACTATTTCAACAATCAGCAGCGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCC GACTATTTCAACAACCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCC GACTATTTCAACAACCAACAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---TCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTCC GACTACTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GGCGAGATTTGTACCGCCTCCACCGCCTCCGCGTCGCTTGCTGC GACTATTTCAACAATCAGCAACGCGAGCGACACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGACC---GCCGAGATTTGTACCGCCGCCACCGCCTCCGCGTCGCTTGCTTC GACTACTACAACAATCAGCAGCGGGAGCGGCACTACCAGCTCCGGCGGCAGAGCCAGCGGCAGGCCAGCGGCGAAGTTCGTCCCTCCTCCGCCGCCTCCGCGACGTTTGCTTC GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGCAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGC GACTACTACAACAACCAGCAGCGGGAGCGACACTACGAGCTCCGGAGGCAGAGCCAGCGGAAGGCC---AGCGAGGTTTATACCACCGCCGCCGCCTCCGCGTCGCTTGCTGC GACTACTACAACAATCAGCAGAGGGAGCGACACTACGAGCAACGTCGCCAAAGCCAGCGGCAGGCC---AGCCAAATTTATACCACCGCCACCGCCTCCACGTCGACTGCTGC GACTACTACAACAACCAGCAGCGGGAGCGGCACTACCAGCTGCGCCACCAGAGCCAACGTCAAGCC---ACCGAGATTTATACCACCACCGCCGCCGCCTCGTCGTCTGCTGC GACTACTACAACAACCAACAGCGGGAGCGGCACTACCAGCAGCGCCGCCAGAGCCAACGTCAAGCC---ACCGAGATTCATTCCACCGCCGCCGCCGCCTCGTCGTCTGCTGC GACTACTACAACAATCAGCAGCGGGAGCGGCACTATCAACAGCGTCGCCAGAGTCATCGTCAAGCC---ACCGAGATTTATACCACCACCACCGCCACCTCGTCGTCTATTGC 012012012012012012012012012012012012012012012012012012012012012012 01201201201201201201201201201201201201201201 ***** * ****** ** ** * ***** ***** * * ** ** ** ** ** * ** * * ** * ** ** ** ***** ** ** ** * * * 012 Frame 1 is high-scoring

120 Frame 2 is high-scoring Initial results for the whole human genome Human Dog Mouse Rat 9,862 fully confirmed 1,065 fully rejected 454 novel (2591 exons) 7,717 refined 1,919 not aligned Fully rejected genes: weak/no evidence New exons: existing & novel experimental evidence Need: large-scale functional annotation for novel genes

Discriminative framework shows continued increase in power Reading frame conservation (RFC) score Dmel,Dpse Dmel,Dyak,Dpse 2 species 8000 Dmel,Dyak,Dpse,Dwil,Dgri 3 species 4500 7000 5 species 3500 4000 12 flies 1600

3500 6000 5000 1200 2500 3000 1000 2500 2000 2000 1500 800 4000 3000 12 species

1400 3000 600 1500 2000 1000 400 500 200 1000 1000 500 0 0 -1

0 1 0 0 -2 -1 0 1 2 -4 -3 -2 -1 0

1 2 3 4 Codon substitution matrix (CSM) score 2 species 10% 5% pe c s 12 sp ec ie 90%

2s s 12 species 12 sp ec ie 30% 20% 95% ies 80% 70% Overview Part 1. Genome interpretation Evolutionary signatures of genes Revisiting the human and fly genomes Unusual gene structures Part 2. Gene regulation Regulatory motif discovery microRNA regulation

Enhancer identification Part 3. Genome evolution Phylogenomics The two forces of gene evolution Accurate gene trees in complete genomes Whos actually doing the work Mike Lin Gene identification Alex Stark Ameya Deoras Spectral genomics Josh Grochow Fly motifs and miRNAs Network motif discovery Pouya Kheradpour Erez Lieberman Human enhancers

Motif evolution Matt Rasmussen Phylogenomics Aviva Presser Network evolution

Recently Viewed Presentations

  • Clinical Trials in NSW, QLD and Victorian Public

    Clinical Trials in NSW, QLD and Victorian Public

    Relevant information in NEAF Section Application / Administrative Process: Instructions and explanations When review conducted New South Wales Part 5, Guardianship Act 1987 (NSW) Section 6 ss 45AA, 45AB require the approval of the Guardianship Tribunal where an individual under...
  • Evropsko Sodišče Za Človekove Pravice V Strasbourgu

    Evropsko Sodišče Za Človekove Pravice V Strasbourgu

    Turkey) za materialno škodo Izjema: pretekla ne-materialna škoda (domage moral, non-pecuniary damages) 19. december 2011 Prof. Dr. Boštjan M. Zupančič, ESČP * 37. člen - Izbris pritožb (2/2) Kljub temu pa Sodišče nadaljuje z obravnavo pritožb, če to zahteva spoštovanje...
  • Soil Profile - Winston-Salem/Forsyth County Schools

    Soil Profile - Winston-Salem/Forsyth County Schools

    Deposition, burial, lithification. Heat and pressure. Melted intoMagma. Cooling and crystallization. Heat and pressure. uplift. There are three main types of rocks: igneous, sedimentary, and metamorphic. Each of these types of rocks are formed in different ways and each type...
  • 10 B Break Even Song November 2011 - tutor2u

    10 B Break Even Song November 2011 - tutor2u

    Bonus/Alternative Verses on Value of Break Even as a tool. Now the break-even tool it has the bonus. Of providing a target for business owners. It's simple to use and really concise. Can model changes like raising your price
  • FDI-linked Technology Transfer A Search for the Policy Model ...

    FDI-linked Technology Transfer A Search for the Policy Model ...

    About Kaizen (改善). A Japanese philosophy that focuses on . continuous improvement. through all aspects of life. In business: incremental & continuous improvement with the involvement of the entire workforce
  • Getting Dirty with SOILS

    Getting Dirty with SOILS

    Title: Getting Dirty with SOILS Author: L. Scott Eaton Last modified by: Scott Eaton Created Date: 1/29/2003 1:09:27 AM Document presentation format
  • Chapter 5 Electric Field in Material Space

    Chapter 5 Electric Field in Material Space

    ECE 305 Electromagnetic Theory. Qiliang Li. Dept. of Electrical and Computer Engineering, George Mason University, Fairfax, VA. Chapter 5 Electric Fields In Material Space
  • November 2019 WSU Staff Recruitment Basics Overview of

    November 2019 WSU Staff Recruitment Basics Overview of

    The Washington State Law Against Discrimination makes it an unfair practice for an employer to discriminate against a candidate due to age, sex, marital status, sexual orientation, race, creed, color, national origin, honorably discharged veteran or military status, or the...