Week 1 Notes/Slides/Information for the Purdue Big Data For ...

Week 1 Notes/Slides/Information for the Purdue Big Data For ...

Week 1 Notes/Slides/Information for the Purdue Big Data For Biologists Workshop 2016 Fleet 2016 Welcome! DAY 1 Session 1: WELCOME Encouraging Big Data Thinking in Biomedical research Introductions Who are you? from? Where are you What is your research interest? Why are you interested in big data? Fleet 2016

Instructors and Teaching Assistants Main Instructors: James C. Fleet, PhD (Nutrition Science) Wanqing Liu, PhD (Medicinal Chemistry and Molecular Pharmacology) Pete Pascuzzi, PhD (Libraries) Min Zhang, PhD (Statistics) Teaching Assistants: Chen Chen (Statistics) Min Ren (Statistics) Harley Schawadron Fleet 2016 Guest Lecturers Week 1: Doug Crabill (Purdue University) Sean Davis (National Cancer Institute) Nadia Atallah (Purdue University) Ingenuity Systems Staff Week 2: Jennifer Neville (Purdue University)

Shi Li Lin (Ohio State University) Martin Lundquist (John Hopkins University) Fleet 2016 Strengths/Weaknesses Of Hypothesis-Driven Science I A B C F G D H E K. Hokusai Fleet 2016

Disruptive Technologies Propel Scientific Advancement Sequenced genomes helped scientists see that data generation need not be a barrier to scientific progress. Genomi c GEO ENCOD E What GTEx 2001 Breakthrough of the Year Genetic: WebQTL/JAX 1000 genomes can we do with all this Metabolom

ic HMDB information? Proteomic ProteomicsD B Phenotyp es JAX TCGA Fleet 2016 Brain-imaging modality Sub ject s Dinov (2016) J Med Stat Inform New technologies bring

challenges to using and integrating data 12 Brain-imaging modalities Requires Multidimensional Thinking and Analytical Tools Fleet 2016 What is Biological Big Data Science (BBDS)? Data Velocity Real Time An approach that integrates multidimensional, highdensity, variously formatted data types to gain insight in the function or regulation of biological systems. Near Real Time periodic Tb

batch Mb Gb physiologi c omic Image populatio n Data Variety Pb Data Volume Two new Vs: Veracity Value Fleet 2016 Big Data in Biomedicine Has Two Flavors

Omics Driven Genotyping Gene expression NGS data Basic Biological Understandin g and Treatment Discovery Payer-provider Driven Electronic Medical Records Pharmacy information Insurance Records

Optimize Healthcare Delivery and Economics Fleet 2016 Where do traditional biomedical researchers fit? Analysis Computation al Disciplines Life Sciences Focus Biological Big Data Science

(BBDS) Levels of Scale Gene Cell Organism Community Global Characterizatio n Analytical Disciplines Fleet 2016 Central Dogma of Biology Has Been Reshaped by Big Data 1000 Genomes, dbSNP ENCOD

E GTex, BioGP S ProteomicsDB Phenotype Modified from Doerge (2002) NRG 3:43 Fleet 2016 Is Traditional Science Outdated? faced with massive data, this approach to science hypothesize, model, test is becoming obsolete. Chris Anderson, Editor, Wired Magazine, 2008 Hypothesis-driven Needs falsifiable hypothesis Needs appropriate data Statistical significance

Small effects in lots of data Determines Causation vs Data-driven No hypothesis needed Exploratory? Full data not needed Post-hoc explanation Statistical significance? Fleet 2016 The Wrong Way to Do Bioinformatics Experimen tal

Scientist In te rp re n ta tio Well Designed Experiment 1 Results And then a miracle happens . Fleet 2016

The Challenge of BBDS Core Facility Statistics Sample Analysis Raw Data Experimen tal Scientist In te rp re n ta tio Well Designed Experiment

1 Visualizatio n Algorithm Results Computin gAlgorithm sStatistics Processed Data 1 Processe d Data n Computin Statistics g Functional

Algorithm Analysis s Computin Statistics Fleet 2016 Big Data Challenges in Biomedicine From the NIH Big Data to Knowledge (BD2K) program 1)Locating data 2)Getting access to data 3)Organizing, Managing, and Processing data 4)Computational issues Infrastructure Speed (computer and algorithms) 5) Analytical methods 6) Training researchers to use BBD Fleet 2016

Stages of Big Data Education NAIVE AWARENESS FAMILIARITY LOW LEVEL ABILITY Skill + Independen ce HIGH LEVEL ABILITY EXPER T Appreciate Value Vocabulary to Talk with Experts to Define Goals Use Pre-existing

Software Code, Develop Tools Creative Work, Novel Approaches Fleet 2016 What Well cover during the workshop Week 1 Unit 1: Microarray Unit 2: Next Generation Sequencing Biological Goals: Regulation of gene expression Week 2 Unit 3: Biomarker Discovery

Phenotype prediction Unit 4: Genetic Variation Genotypephenotype Fleet 2016 What We will cover during the workshop Week 1 Unit 1: Microarray Unit 2: Next Generation Sequencing Week 2 Unit 3: Biomarker Discovery Technical Goals:

Analysis pipelines Statistical issues Visualization Functional annotation Databases Project management Computation and programming Unit 4: Genetic Variation Fleet 2016 Day 1 Session 2: Working with the Purdue Computer Infrastructure Doug Crabill Department of Statistics

Purdue University Sites to Understand Computing UNIX operating system Learn UNIX http://www.tutorialspoint.com/unix/index.htm Linux operating system http://www.tutorialspoint.com//operating_syste m/os_linux.htm R coding http://bioinformatics.knowledgeblog.org/2011/ 06/21/using-r-a-guide-for-complete-beginners/ https://www.r-project.org/about.html Fleet 2016 Monday BREAK #1 Day 1

Session 3: Data Repositories and Preprocessed Data Sites James C. Fleet, PhD Distinguished Professor Department of Nutrition Science Pete Pascuzzi, PhD Assistant Professor Purdue Libraries Data Archives Web link Description Trans-NIH BioMedical Informatics Coordinating NIH Data Sharing Repositories https://www.nlm.nih.gov/NI Committee (BMIC) sites Hbmic/nih_data_sharing_re positories.html Gene Expression NCBI; transcriptome and ChIPOmnibus (GEO) http://www.ncbi.nlm.nih.go seq datasets v/geo/

http://www.ebi.ac.uk/array EMBL-EBI repository to archive express Array Express functional genomics data / Comprehensive record of worlds European Nucleotide Archive nucleotide sequencing (ENA) http://www.ebi.ac.uk/ena information The Cancer Genome Atlas (TCGA) Proteomics IDEntifications (PRIDE) Metabolomics Workbench Multi "omic" phenotype http://cancergenome.nih.g characterization of tumors ov/ http://www.ebi.ac.uk/pride/ European proteomics datasets archive/ http://metabolomicsworkbe metabolomic datasets nch.org/standards/nominat

ecompounds.php Fleet 2016 Fleet 2016 GEO Datasets Week 1 GSE15947: Time course of 1,25(OH)2 D treated RWPE1 cells GSE54783: The Osteoblast to Osteocyte Transition: Epigenetic changes and response to the vitamin D3 hormone GSE #: Accession number for a complete dataset that is submitted to GEO GSM #: Accession number for a specific sample within a dataset GPL #:

The platform used to generate a dataset SRX #: Accession number for a sample generated by NGS that is deposited in the Short Read Archive (SRA) Fleet 2016 Data Archives Oncomine Web link https://www.oncomine.org/ resource/login.html Gene Expression across Normal and Tumor tissue (GENT) http://medical-genome.krib b.re.kr/GENT/ BioGPS http://biogps.org/#goto=w elcome cBioPrortal

http://www.cbioportal.org/ Description 715 microarray datasets from 19 cancers gene expression patterns in human cancer from Affy Chips (+ 1000 cell lines) Human/rat/mouse transcript data by tissue TCGA cancer genomics Genotype-Tissue Expression project (Gtex) human, multi-tissue gene expression and gene variation for http://www.gtexportal.org/h eQTL Immunological Genome Project ome/ transcriptome data from cultured (Immgen) http://www.immgen.org/ mouse immune cells

transcriptome and associated metadata for developing and Human Brain Transcriptome http://hbatlas.org/ adult human brain. NHLBI Kidney Transcriptome Segment-specific expression in database https://hpcwebapps.cit.nih. rat kidney gov/ESBL/Database/Transc Multi-omic database from rat and riptomic/index.html Kidney Systems Biology Project https://hpcwebapps.cit.nih. mouse studies gov/ESBL/Database/ Saccharomyces Genome Integrated biological information Database http://www.yeastgenome.or on budding yeast g/transcriptome-data-in-ye published miRNA sequences, astmine annotation. Expression dataset miRBase http://mirbase.org/ links available

Fleet 2016 Day 1 Session 4: Understanding R and Bioconductor James C. Fleet, PhD Distinguished Professor Department of Nutrition Science Pete Pascuzzi, PhD Assistant Professor Purdue Libraries Fleet 2016 Fleet 2016 Fleet 2016 Rstudio is a Graphical User Interface (GUI) that lets you use R more conveniently Script Window

Bioconductor package Window Command Window Fleet 2016 Day 1 Session 5-6: Microarray Overview Data Processing and QC James C. Fleet, PhD Distinguished Professor Department of Nutrition Science Pete Pascuzzi, PhD Assistant Professor Purdue Libraries 1,25(OH)2 D Induces Gene Transcription through the Vitamin D Receptor

CYP27B 1 25-OH D 1, 25 (OH)2 D Classical Role: Regulate whole body calcium metabolism NGS project: Osteoblast and osteocytes Fleet et al. (2012) Biochem J 441:61 Novel Role: Prevent and treat cancer Microarray project: Normal prostate epithelial cell VDR Deletion Accelerates Cancer Development in Mice

APCmin Mice Tumor size LPB-Tag Mice Tumor Size Tumor size (mm) Tumor weight (mgs) Prostate Cancer WT 7 WT HT KO K O 9

12 Age (wks) 15 Tumor-Free Mice (%) Colon Cancer MMTV-neu Mice Tumor Breast Cancer Incidence WT K O 18 3

6 9 12 15 18 Age (mo) (5 months) Larriba et al., 2011, PLoS One 6:e23524 Mordan-Mcombs et al., 2010, JSBMB 121:368 Fleet 2016 Zinser and Welsh 2004, Carcinogenesis 25:2361 Strategies of Cancer Prevention pr om ot

io n r g o pr on i s es i in t io a ti Cancer Treatment Cancer

Prevention Survival with Cancer Diagnosi s Survival without Cancer n Time (y) Umar et al. (2012) Nat. Rev. Cancer 12:835 Fleet 2016 Research Paradigm: Study Gene expression with DNA microarray Kovalenko et al. (2010) 1,25 dihydroxyvitamin Dmediated orchestration of anticancer, transcriptlevel effects in the immortalized, non-transformed prostate epithelial cell line, RWPE1. BMC

Genomics 11:26 http://www.ncbi.nlm.nih.gov/pmc/articles/PMC282 0456/ Fleet 2016 Study Design RWPE1 cells Human prostate epithelial cell line (ATCC CRL-11609) 54 y old white male Peripheral zone of normal health prostate HPV18 immortalized Not tumorigenic in nude mice Media chang e 60% Confluent Culture GEO ID = GSE15947 100 nM 1,25(OH)2 D or EtOH control

6h 24 h 48 h 2 x 3 factorial design, n = 4/group (24 total samples) Kovalenko et al. (2010) BMC Genomics 11:26 Fleet 2016 Affymetrix Microarray Analysis Workflow Data Filters Well Interpret Designed Study High Quality RNA

Normalized Vetted Data Additional QC analysis Affymetix analysis Affymetix QC analysis Affymetix Raw Data Process Data RMA Normalize MAS5 (P/ A) P/A call Fold

Statistical Analysis Differentiall y Expressed Gene List Experimen t Network building Pathway and Geneset Analysis Clustering and visualizatio n Fleet 2016 Affymetrix Tiling Arrays

11-25 probes or probe pairs per gene Perfect Match = 25 bp Mismatch in oneUsed base to define background Algorithms from Affymetrix and others Fleet 2016 Affymetrix GeneChip Microarrays Probe Array Prehybridization Probe Cell cDNA library and Biotin

Label Hybridization to Array Probe Set U133 Plus 2 GeneChip: 54,000 probe sets Probe Array Post-hybridization Probe Pair (PM/MM) U133 Plus 2 GeneChip: 11 probe pairs/gene 25 base long probes 40 x 107 copies/cell Fleet 2016

High quality (mostly white) Residual Plots of Scanned GeneChips Scratched Uneven (edge drying) Too Variable (intense red/blue) Residual s Red = (+) Blue = (-) White = 0 ttp://plmimagegallery.bmbolstad.com/ Uneven

Fleet 2016 Relative Log Expression (RLE) provides a measure of reproducibility of gene expression data that can be compared across batches, experiments or trials. Large spread and non-zero = bad Outside control limits = bad Control limits IQR limits Fleet 2016 Normalized Unscaled Standard Error (NUSE)

Values have no units. Used to assess relative quality of arrays within an analysis set Large spread and non-zero = bad Outside control limits = bad Control limits IQR limits Fleet 2016 99% 95% 1 4

5 Fleet 2016 Monday BREAK #2 Day 1 Session 7: Differential Gene Expression James C. Fleet, PhD Distinguished Professor Department of Nutrition Science Pete Pascuzzi, PhD Assistant Professor Purdue Libraries Statistics for Biologists Nature Publishing: Collections of short articles on basic statistical concepts for biologists

http://www.nature.com/collections/qgh hqm Fleet 2016 What is a pValue? 26 0 330. 6 the probability of obtaining an effect at least as extreme as the one you observed. Control Mean = 260 Treatment Mean = 330.6 Fleet 2016 J. Frost, MiniTab Blog

What is type I error? Null Hypothesis is Null Hypothesis is Reject Null Hypothesis Fail to Reject Null Hypothessis True False Type I error (False positive, a) Correct Inference (True positive) Correct Inference (True negative)

Type II error (False negative, b) Rejecting the null hypothesis when it is in fact true.a false positive. Fleet 2016 Why is type I error a problem for omics studies? P (at least one Type I error among m tests) = e.g. When = .05 and m = 10, P = 0.401 large p, small n problem 100 repeats of same experiment: 5 potential false positives when a

25,000 transcripts on an array = 1,250 false positives at a = 0.05 Fleet 2016 How do we control type I error in omics studies? Familywise Error Rate Correction (FWER): the probability of making even one false discovery in a set of comparisons. e.g. Bonferroni test a/(# comparisons) = 0.05/25,000 = 0.000002 Very conservative but useful if the goal is to only find the changes that are most reliable (and likely the largest) Fleet 2016 How do we control type I error in omics studies? False Detection Rate (FDR): e.g. Benjamini and Hochberg procedure i/m(Q) = (rank/(# comparisons) )* (false discovery

rate) Gene P value Rank FDR Gene 1 0.0001 1 0.000002 Gene 2 0.0004 2 0.000004

Gene 1500 0.049 1500 0.003 Gene 25,000 0.988 25000 0.05 Accepts a set rate of false positives within a number of comparisons (e.g. 5% FDR means 5/100 significant comparisons are Fleet 2016 likely false positives) Data Reduction as a

Strategy to Minimize Type I Error Problem MAS5 (use M=P) # Present Filter out genes with: Low expression or High variation 54,677 genes on the U133 Plus 2.0 array At least 4/24 = P 28,883 75% total = P 21,448 75% P in at least 1 group 25,985

50% total = P 23,793 50% P in at least 1 group 29,260 Drop Bottom 25% # Present 75% total = P 40,597 75% P in at least 1 group 45,689 Drop genes

SD/mean >0.25 Hackstadt and Hess (2009) BMC Informatics 10:11 # Present Fleet 2016 Processing and Statistics Influence the DEG List Processing Statistic DEG at 48 h gcRMA SAM 3566 RMA SAM

2249 RMA limma 1021 Fleet 2016

Recently Viewed Presentations

  • A GLIMPSE ON FRACTAL GEOMETRY - Applied mathematics

    A GLIMPSE ON FRACTAL GEOMETRY - Applied mathematics

    2.3 Formula of fractal dimension(for regular fractal geometry) Let k be the unit size of our measurement (e.g. k=1cm for a line), with the method of continuously covering the figure; let N(k) be the # of units with such a...
  • Statistics in Applied Science and Technology

    Statistics in Applied Science and Technology

    Statistics in Applied Science and Technology Supplemental: Elaborating Crosstabs: Adding a Third Variable Key Concepts in this Chapter Direct relationship Spurious relationship Intervening relationship Conditional (interactive) relationship Limitation of Elaborating Crosstabs Introduction Few research questions can be answered through a...
  • Virtual Search Planning: - Nevada Public Health Foundation

    Virtual Search Planning: - Nevada Public Health Foundation

    Virtual Search Planning: Process of examining known verified facts surrounding a missing person event. Applying subject and statistical behavior data. Weather data. Technical data (cell /radar forensics, audio, video, etc) Layered to virtual imaging and terrain analysis tools. Building a...
  • Interventional pulmonology for tracheal stenosis-where are ...

    Interventional pulmonology for tracheal stenosis-where are ...

    A 24 - year old patient, admitted in emergency for acute respiratory failure, stridor, post prolonged intubation due to a vehicle crush. The chest radiography was normal.
  • Bipartisan Election Advisory Commission Jena Griswold, Secretary of

    Bipartisan Election Advisory Commission Jena Griswold, Secretary of

    Welcome - Jena Griswold, Secretary of State . BEAC purpose: Open discussion about the administration and conduct of elections in Colorado is necessary to ensure that every eligible citizen has the opportunity to participate in fair, accessible, and impartial elections,...
  • Unit Four Quiz Solutions and Unit Five Goals

    Unit Four Quiz Solutions and Unit Five Goals

    Evaluate PdV from V1 to V2 Use P(V)dV for work in kJ (or Btu) or use P(v)dv for kJ/kg (or Btu/lbm) Open Systems/Assumptions General energy and mass balances Steady-Flow Systems First law for DKE = DPE = 0 Unsteady Flow...
  • Ergodic (Invariant) Measures Applied to nDimensional, Lag Embeddings

    Ergodic (Invariant) Measures Applied to nDimensional, Lag Embeddings

    The "funnel " Positive Lyapounov exponent ... Cooper electron pairs tunnel across JJs at critical current; this process is perturbed by a change in magnetic flux. density. To study the magnetic fields. components of the plasma. of consciousness.
  • Chapter 5

    Chapter 5

    The Freudian View - Dreams. Dreams (REM) images, emotions, and thoughts passing through a sleeping person's mind. hallucinatory imagery. discontinuities. incongruities. vivid recall: if we are awakened during REM sleep (or right after) ... Chapter 5 Last modified by: