De novo motif finding using ChIP-seq

De novo motif finding using ChIP-seq

De novo Motif Finding using ChIP-Seq Presenter: Zhizhuo Zhang Supervisor: Wing-Kin Sung Outline Introduction of Chip-Seq Data The Impact of Chip-Seqs Properties in Motif Finding Our proposing algorithm (Pomoda) Experiment Result Exploring Center Distribution 02/25/2020 2 Copyright 2009 @ Zhang ZhiZhuo Chip-Seq Technique 02/25/2020 3 Copyright 2009 @ Zhang ZhiZhuo Comparison with Chip-Chip 02/25/2020 4

Copyright 2009 @ Zhang ZhiZhuo What Chip-Seq means to us Sequences Motif Finding Tools Motif models More data Good news for data mining, but necessary for denovo motif finding Higher resolution job becomes easier, localization 02/25/2020 5 Copyright 2009 @ Zhang ZhiZhuo How large the data is? The definition of large data keeps changing! 10 years before, tens of sequences (Promoter Sequences: MEME,AlignACE) 5 years before, hundreds of sequences (Chip-Chip: Weeder) 2 years before, thousands of sequences (higher throughput Chip-Chip: Trawler, Amandeus) Now, tens of thousands of sequences (Chip-Seq: ?) 02/25/2020 6 Copyright 2009 @ Zhang ZhiZhuo

Higher Resolution Means? Means finding main motif (antibody targeting TF) becomes a easy job! Main Motif would be very over-represented The Peak range just about 50 bp, simply align all the peak region, we can get the good motif. It means our focuses may change from the main TF to the TFs who are working with the main one. 02/25/2020 7 Copyright 2009 @ Zhang ZhiZhuo Localization = OverRepresentation AR GATA 1000 450 900 400 800 350 700 300 Frequency

500 400 300 250 200 150 100 200 50 100 0 Location bins 02/25/2020 61 57 53 49 45 41 37 33

29 25 21 17 13 9 5 0 1 Frequency 600 1 5 9 1 3 1 7 21 25 29 33 37 41 4 5 49

53 57 61 Location bins 8 Copyright 2009 @ Zhang ZhiZhuo Peak Oriented Motif Discovery What information of Peak can be helpful? Peak Intensity Peak location Our targets: not only the main motif, but also the comotifs sitting around the main motif. 02/25/2020 9 Copyright 2009 @ Zhang ZhiZhuo POMODA Peak Oriented Motif Discovery Algorithm Centered on ChIP-seq peak of The main motif A co-motif 02/25/2020 Should be noise as it does not exhibit distance preference to the main motif 10

Copyright 2009 @ Zhang ZhiZhuo Motif Modeling String Motif : Smaller searching space, enable fast string matching algorithm PWM Motif: More precise approximation to the real motif, statistics sound. (PWMPosition Weighted Matrix) 02/25/2020 11 Copyright 2009 @ Zhang ZhiZhuo Background Modeling Organism Specified Background: Hard to capture the negative information in background Position Specified Background: Reveal the biological context, and easier to capture the negative information 02/25/2020 12 Copyright 2009 @ Zhang ZhiZhuo Position Specified Background Given the peak position in chip-seq, we not only identify the active position(center) of the master TF, but also the active region of its co-motif. Peak in Chip-Seq Background Region Background Region Active Region

02/25/2020 13 Copyright 2009 @ Zhang ZhiZhuo Center Enrichment Score Since we dont know the exact size of the active region, and it may vary for different motif. Hence, we define a odd-ratio score base on dynamic window size. CenterOcc ( Seqlen windowsize ) Score max windowsize { } BgOcc windowsize where CenterOcc minimal support 02/25/2020 14 Copyright 2009 @ Zhang ZhiZhuo Algorithm Overview Seed Finding PWM Extending & Refinement Redundant Motifs Filtering 02/25/2020 15 Copyright 2009 @ Zhang ZhiZhuo Seeds Finding GGTCAC CGGTCA

GGGTCA AGGTCA Enumerate all length 6 patterns AACTTG ATGACC CAGGTC AGGTCG CGTGAC CTGACC 02/25/2020 Po 1 2 3 4 5 6 A 0.97 0.97

0.01 0.01 0.01 0.01 C 0.01 0.01 0.97 0.01 0.01 0.01 G 0.01 0.01 0.01 0.01 0.01 0.97 T

0.01 0.01 0.01 0.97 0.97 0.01 16 Copyright 2009 @ Zhang ZhiZhuo PWM Extending & Refinement Encapsulate the core PWM into a wide PWM For example, we implant the length 6 PWM into a length 26 PWM, as following: Po 1 2 9 10 11 12

13 14 15 16 25 26 A 0.25 0.25 0.25 0.97 0.97 0.01 0.01 0.01 0.01

0.25 0.25 0.25 C 0.25 0.25 0.25 0.01 0.01 0.97 0.01 0.01 0.01 0.25 0.25

0.25 G 0.25 0.25 0.25 0.01 0.01 0.01 0.01 0.01 0.97 0.25 0.25 0.25 T 0.25 0.25

0.25 0.01 0.01 0.01 0.97 0.97 0.01 0.25 0.25 0.25 02/25/2020 Core PWM 17 Copyright 2009 @ Zhang ZhiZhuo Background Instances PWM Extending & Refinement

AAGGTCACC TGGGTCAAG GAGGTCATT TGGGTCAGG CTGGTCATA Select the best column to update based on Center PWM and Bg PWM. Center Instances 02/25/2020 AAGGTCACC TGGGTCACG CTGGTCACA GGTCANNNNC 18 Copyright 2009 @ Zhang ZhiZhuo Redundant Motifs Filtering 1. Positions overlap more than 5% 2. PWM divergence less than 0.18 l 1 PWM divergence ED ( P1 , P 2 ) 2 l i 1 02/25/2020

b{ A,C ,G ,T } ( Pi1,b Pi ,2b ) 2 19 Copyright 2009 @ Zhang ZhiZhuo Results Comparison 1. Dataset: 1. MCF7 dataset (ER), 4361 sequences 2. LNCAP dataset (AR), 10000 sequences 2. Evaluate PWM divergence with Transfac motif as in Harbison et al (2004) and Amadeus (2008) 3. +/- 5000 bases from peak (Pomoda), and +/- 200 bases from peak for other algorithms 4. Each motif finder report its top20 results 02/25/2020 20 Copyright 2009 @ Zhang ZhiZhuo Cell TF Mcf7 ER Pomoda Amadeus

Trawler Weeder HNF3 GATA AP1 SP1 BACH1 <0.12 <0.18 <0.24 E2F OCT1 AP4 LNCAP AR HNF3 NF1 GATA OCT ETS 02/25/2020 21 Copyright 2009 @ Zhang ZhiZhuo Comparison Pomoda Amadeus

Trawler Weeder Background model Position Specified Organism Specified Organism Specified Organism Specified Motif model PWM (k-mer exact match) PWM (k-mer with mismatches ) PWM (IUPAC string in initial scan) k-mer with mismatches Algorithm

Exhaustive search +PWM column updating Add mismatches Merge (recursively) EM Exhaustive search + clustering Exhaustive search Motif Length Various length Fixed length Semi-various length Semi-various length Gap Detection Supported Not Supported Not Supported Not Supported

Localization center windows size Over-represented bins Not supported Not supported Sequence Weighting Supported Not Supported Not Supported Not Supported Average Running time 30min 93min >4hours >4 hours 02/25/2020 22

Copyright 2009 @ Zhang ZhiZhuo Center Distribution Foxa1 1600 1400 1200 1000 800 600 400 200 0 9 -1 00 6 -1 00 3 -1 00 Mixture Model: 0 -1 00 0

-7 0 0 -4 0 0 -1 x 0 20 0 50 0 80 0 11 00 14 00 17

00 y e (1 ) c 02/25/2020 x if e ) c (10.1503 1.94 1.3738 x c 0.0506 Range x binsize194bp 23 Copyright 2009 @ Zhang ZhiZhuo Thank You! 02/25/2020 24 Copyright 2009 @ Zhang ZhiZhuo

Recently Viewed Presentations