STATISTICS NETWORKING DAY Species Distribution Models (SDM) for Presence Only (PO) data. Maria Angelica Lopez-Aldana Principal Supervisor: Assoc. Prof. Bernd Gruber Associate Supervisor: Dr. Carlos Gonzalez-Orozco Prof. Arthur Georges August 2015 MDBfutures Collaborative Research Network 1 MDBfutures Collaborative Research Network

Outline An overview about Species Distribution Models (SDM) SDM methods SDM for presence only (PO) data. Learning resources. Complexities and MDBfutures Collaborative Research Network An overview about Species Distribution Models (SDM) Ecological question: What is the species occurrence probability on a determined area?

Uses: - Reserve design and conservation planning. - Target areas for protected status. - Assess threats to protected areas - Design reserves - Ecological restoration Risk and Impacts of Invasive Species. Effects of global warming on biodiversity. Describing or estimating macroecological patterns such as species richness. MDBfutures Collaborative Research Network Predictive modelling of species geographic distribution based on the environmental conditions (Phillips et al 2006).

Main Assumption. Species distribution are predictable from environmental variables. Ocurrence probability Prediction Species Occurrences Response variable: Covariates: Environmental data (geographic Probability of presence coordinates X,Y) MDBfutures Collaborative Research Network

SDM: Methods GLM, Logistic regression Presence/absence data Systematic Biological Survey GAM, Generalized additive models MARS Multivariate adaptive regression splines Type of data

Presence only data Herbarium or museum data ManEnt, Maximum Entropy Maxlike Maximum likelihood MDBfutures MAXENT MAXLIKE Machine Learning Method

Maximum likelihood Method Automatic and flexible set of arrangements (Linear, Quadratic, Product, Splines) Subject to overfitting Not as flexible, arrangements need to be specified. Not possible to apply the standard statistical inference techniques. Collaborative Research Network Possible to apply the standard statistical inference techniques (e.g. hypothesis test, confidence intervals or model selection)

Explores the relative suitability of one place Logit-linear model which first ensures that over another using the maximum entropy the predicted value is a real probability principle. value # run & predict (in parallel) maxlike models for k randomizations acalikeMods

Coursera: Programming in R by Roger D. Peng, PhD Johns Hopkins University DataCamp R - bloggers MDBfutures Collaborative Research Network MDBfutures Collaborative Research Network R list User Group There are mailing lists for R users. For more information and to subscribe, see The R Project for Statistical Computing (Mailing Lists). The primary mailing list is called "Rhelp"; it offers swift and competent answers to problems with R.

Newsletter Since January 2001, R has had an online newsletter, which in 2009 became the R Journal. MDBfutures Other learning Resources. SDM Books. Collaborative Research Network A. Townsend Peterson, Jorge Sobern, Richard G. Pearson, Robert P. Anderson, Enrique Martnez-Meyer, Miguel Nakamura & Miguel B. Arajo

SDM and R, available Online https://cran.r-project.org/ web/packages/dismo/ vignettes/sdm.pdf Species distribution modeling with R Robert J. Hijmans and Jane Elith March 14, 2015 Janet Franklin, San Diego State University MDBfutures Collaborative Research Network omplexities and Recommendations - Modeling formulation, modeling fitting an modeling evaluation require specific statistical methods. Conceptual modeling formulation

Variable selection Statistical Modeling Different methods Model selection Evaluation Model evaluation - It is necessary to learn a set of software (e.g. Arcgis and R) and skills; computational and theoretical. - Processing time can be very extended. - As a novel methods, Information might be limited and dispersed. Recommendations -

Learn R first. It is a valuable tool to apply over a set of problems. - As some inconvenient are very specific (e.g. code or software conditions) is always a good idea google questions and read forums. - Do not hesitate to write papers authors. - Include all PhD student in the network?. THANKS!!

1) How they use the method in their work; 2) How they learned about the method textbooks, websites, mentors; 3) Complexities they have experienced in applying the method. Conditions: No absence data. MDBfutures Collaborative Research Network How to choose the covariates? Purpose of the Study Data availability Biology of the Species Scale Extent Range

Environmental Covariate Climate Topography Land use Soil type Biotic Interaction Global >10000 km Continental Regional 2000-10000 km 200-2000km Landscape

10-200 km Local 1-10km Site Micro < 101000 m 10m MDBfutures Collaborative Research Network Species Biology and SDM performance.

How the biology of the species affects the model performance? (Franklin, 2009): Higher accuracy: - Rare species , better discrimination of suitability - In plants, obligate seeders - site fidelity. - Longevity. Statistical Modeling : Methods How to choose the method? MDBfutures Collaborative Research Network MDBfutures Collaborative Research Network

Species Distribution Models and Presence Only data (PO). Presenceabsence survey data is generally not available - Huge sampling efforts behind Museum data collection. - Urgent decisions for conservation - Only option when the landscapes extend to be modeled are significantly large. Yet, - how can we contrast the environmental conditions of Pr esence WITHOUT ABSENCES? MDBfutures MaxEnt Collaborative Research Network - Follows Maximum Entropy Principle

- Developed by Phillips et al. 2006. - What is Maximum Entropy Principle? What does it mean in the SDM context? Premise: the best approximation of a distribution is determined by maximum entropy, subject to constraints on its moments. Entropy component: Maximum Entropy model aims to find the distribution that is most spread out (i.e. closest to the uniform). Constraint component: restraint on the average of the covariates - Uses background data. locations where presence/absences are unmeasured. - Explores the relative suitability of one place over another using the maximum entropy principle F1(z) / F(z) - F1(z) pdf of covariates where the sp is present MaxEnt MDBfutures

Collaborative Research Network Exponential output (raw Maxent). - MaxEnt distribution = Gibbs distribution (exponential function) - As every distributions sums to 1. - Cells with environmental variables close to the mean of presence locations have high values. Scale Dependent, not intuitite, projections no easy to interpreted Cummulative output - The value assigned to a pixel is the sum of the probabilities at that pixel and all other pixels with equal or lower probability Scale independent, easier to use in projections but is not proportional to probability of presence!! Logistic output This approximation is derived from a logistic function over the maximum entropy function Using this approximation, it is assumed that the probability of presence in a typical site is 0.5!!.

MDBfutures Collaborative Research Network MaxEnt Feature selection. Complexity Allows different arrangements. Depends on the number of presences: Too many arrangements, subject to over fitting. - Linear (always possible) Quadratic (at least 10 points) Product (at least 80) Splines(at least 15 points) MDBfutures Collaborative Research Network

MaxEnt - Most Popular Method! (Even for presence/absence data) - (over 108 (2008-2012) used MaxEnt, 36% discarded absence. - Yackulic, 2013) - Limited customization: Number of background points Default prevalence Output format. MDBfutures Collaborative Research Network MaxLike - Statistical Method. Landscape divided by x number

of pixels - Developed by Royle 2012. - Random Sampling Principle. Explore random sampling and Bayes Rule to derive the likehood for the presence-only sample. Using a hypotetical first stage random sample to create a sample inclusion variable w(x) Describe: P(x / w(x)=1, y(x)=1 ) w(x)=1 if x appears in the first stage sample y(x)=1 if the pixel is occupied - Assumptions. Species detection probability is constant. MDBfutures Collaborative Research Network MaxLike

- Possible to apply the standard statistical inference techniques (e.g. hypothesis test, confidence intervals or model selection) - Logit-linear model which first ensures that the predicted value is a real probability value - It has a R package (Maxlike) to fit the model. (MaxEnt too!!!) # run & predict (in parallel) maxlike models for k randomizations acalikeMods <- foreach(k=1:sets, .verbose=T, .packages="maxlike") %dopar% maxlike(~annual_mean_rad + I(annual_mean_rad^2) + annual_mean_temp + I(annual_mean_temp^2) +annual_precipitation + I(annual_precipitation^2) ,rstrans, acaTrain[[k]], control=list(maxit=10000), removeDuplicates=TRUE) - Not as flexible as MaxEnt MDBfutures PROGRAM AND DESIGN OF THE RESEARCH INVESTIGATION

Objectives: Collaborative Research Network MDBfutures Collaborative Research Network MDBfutures Collaborative Research Network PROGRAM AND DESIGN OF THE RESEARCH INVESTIGATION Methodogy. i. Empical comparison between Maxlike and MaxEnt. MDBfutures i. Empical comparison between Maxlike and MaxEnt

Conceptual modeling formulation Collaborative Research Network Covariates: mean annual radiation, annual temperature and annual rainfall Presences :30 sp Acacia High Abundance A > 556 registers Low Abundance 205 < A < 361 High Coverage C >69 grids

Group 1. (AC) A. ligulata A. salicina A. deanei A. ramulosa A. sibirica A. monticola A. stenophilla A. Hologericea Group 2. (aC) A. paraneura A. rhodophloia A. strowardii A. Ayersiana A. pruinocarpa A. gonoclada A. adoxa

Low Coverage 30 < C < 43 grids Group 3. (Ac) A. crassa A. floribunda A. terminalis A. rubida A. mucronata A. euthicarpa A. pulchella Group 4. (ac) A. latipes A. alleniana A. triptera A. hemiteles

A. lanigera A. microcarpa A. halliana A. dimidiata MDBfutures Collaborative Research Network Statistical Modeling MaxEnt vs Maxlike Response Variable MaxEnt. Suitability Index (Logistic Output) Maxlike. Probability of occurrence.

Covariates Linear and Quadratic terms - mean annual radiation - annual temperature - annual rainfall MDBfutures Collaborative Research Network Calibratio n & Evaluation Using cross validation (25/75) Akaike Information Criteria (AIC)

Area Under Operator Curve (AUC) - Cross Validation (25/75) (30 times) - AIC. Akaike Information Criteria : - < AIC, lower unexplained deviance. Better Model!! - AUC. Area Under the Receiver Operating Curve - AUC > 0.9 - 0.7 0.9 - 0.5-0.7 Very good model!! Good model! Bad model. MDBfutures

Collaborative Research Network Premilinary Results. i. Empical comparison between Maxlike and MaxEnt. Selecting 2 species per group, as follows: Group 1 (AC). A. ligulata A. sibirica Group 2 (aC). Group 3 (Ac). Group 4 (ac).

A. A. floribunda Euthicarpa A. A. A. A. stowardii gonoclada lanigera alleniana

MDBfutures Collaborative Research Network Models Performance AIC values Train/test MaxLike MaxEnt MaxEnt MaxLike A. alleniana 69 / 206 4127.1

6674.815 2547.699553 A. euthicarpa 245 / 734 13633 26347.6 12714.49252 A. floribunda 152 / 456

8892.3 15595.12 6702.818403 A. gonoclada A. lanigera 85 / 255 82 / 247 6397.3 4781.7 9805.981 8650.121

3408.703674 3868.431935 A. ligulata 713 / 2140 52648 86249.69 33601.81599 A. sibirica 150 / 450

9624.8 18449.53 8824.750298 A. stowardii 64 / 193 5206 7829.549 2623.598923 AIC. Akaike Information Criteria : - < AIC, lower unexplained deviance. Better

Model!! - Maxlike Lower unexplained deviance than MaxEnt. MDBfutures Collaborative Research Network - AUC. Area Under the Receiver Operating Curve (AUC > 0.9 :Very good model!!, 0.7 0.9 Good model, 0.5-0.7 Bad model). AUC is consistent with AIC result AUC-Maxlike values are always bigg than AUC-MaxEnt values, however the difference is almost insignifican for species with low coverage MDBfutures

- Mean Probability of presence. Collaborative Research Network Because of the default value of 0.5 in MaxEnt model, mean probability of presence is close to this value. The probability of presence for Maxlike is, in most of the cases, bigger but exhibit a wide variation. MDBfutures MaxLike vs MaxEnt: Mean Predicted Probability

Maxlike. A. sibirica AC aC B. gonoclada MaxEnt Collaborative Research Network MDBfutures MaxLike vs MaxEnt: Mean Predicted Probability Maxlike. A. floribunda Ac

ac A. alleniana MaxEnt Collaborative Research Network MDBfutures Which one is the best model?: Collaborative Research Network MaxLike has better AUC and AIC values, but exhibits a huge variability. MaxEnt is more consistent between models (low variability), but maintains a probability of presence of around 0.5.

We will choose the model that has the best fit, taking into account the research questions, the biology of the species and the influence of omission and comission error. Taking into account SDM purpose Case 1. Reserve design. Comission (False positive): False presences, inversion for conservation over unappropiate areas. MaxEnt Better option? Case 2. Impact of invasive Species Omission (False negative): False absences, areas uncontrolled!! Maxlike Better option?

ii. predict the distribution of species over the time. Conceptual modeling formulation MDBfutures Collaborative Research Network Covariates: 19 bioclim variables, soil and water temperature?, Soil Moisture? Presences : Turtle species Chelonia longicollis, Emydura macquarti Chelonia expansa (AUC=0.978) Myuchelys bellii

Annual mean radiation Precipitation driest quarter Lowest period moisture MDBfutures Collaborative Research Network Resources and Funding Required Data requirement: The PO data set to be used in this project and the collaborators are: Aim 1. Acacia species, Carlos Gonzalez-Orozco Aim 2. Turtle species, Arthur Georges. Aim 3 and 4 . Plants, fishes, amphibian, reptile and mammal data sets. Carlos Gonzalez-Orozco and Margarita Medina. Software requirement: R for programming. The program is free and has been obtained already.

Funding source: The project is supported by Murray Darling Basin Futures project. MDBfutures Collaborative Research Network Timetable PhD duration Literature review Code R. maxEnt /Maxlike Running Code Australia (Acacia) Turtle model Running Code MDB (Multitaxon) Mapping for conservation Writing Conference to determine

2014 2013 2015 Confirmation seminar Jun2014 Work in progress seminar 8 Jul 15 Introductory seminar Dec 13 PhD Starts April 13

2016 2016 PhD Finishes April 16 Final seminar MDBfutures Collaborative Research Network Acknowledgment: 1. Funding! MDB Futures Collaborative Research Network. 2. Research Group : - Bernd Gruber - Carlos Gonzalez-Orozco

- Arthur Georges - Peter Unmack - Aaron Adamack - Margarita Medina Thanks for listening!!!! AUC. Area under the ROC curve. A statistic generated from a receiver operating characteristic plot (ROC). AUC represents an overall performance measure of model performance across all thresholds and strengths of a prediction. AUC is a non-parametric measure that range between 0 and 1. Summarize the models ability to rank presence records higher than absence records (or background records in PO methods) AIC. Akaike Information Criterion. It is a measure of the relative goodness of fit of a statistical model. It offers a relative measure

of the information lost when a given model is used to describe reality. It can be said to describe the tradeoff between bias and variance in model construction, or loosely speaking between accuracy and complexity of the model. In the general case, the AIC is: AIC = 2K - 2ln(L) Where k is the number of parameters in the statistical model, and L is the maximized value of the likelihood function for the estimated model. Given a set of candidate models for the data, the preferred model is the one with the minimum AIC value. MDBfutures Collaborative Research Network Factors impacting the geographic range of species The abiotic environment (fundamental niche) temperature

precipitation soil type The biotic community food webs and ecological networks Movement: history and geography dispersal MDBfutures Collaborative Research Network Conceptual modeling formulation: niche theory MDBfutures Collaborative Research Network Model Selection

Few Parameters Simple Parsimony Generality Descriptive accuracy Overfitting More flexibility Sacrifice Predictive Performance Modelling occurrence probability in with Maxlike. Using yi = 1 to denote a presence at grid cell xi, and P(y i=1/Xi,0,) to denote occurrence probability. The likelihood for Maxlike is given by (Royle et al. 2012). L() = Where N is the total number of presences, B is the background data, 0 is an intercept parameter, and is the vector of slope coefficients associated with environmental covariates. The numerator describe the likelihood at presence

cells while the denominator describe the likelihood at background cells. Often background cells are taken as random sample of cells over the landscape (Lele & Keim 2006; Lele 2009; Royle et al 2012. MDBfutures w to build the model? TISTICAL MODELING USING SDM Collaborative Research Network (Guisan and Zimmermann 2000) MDBfutures Collaborative Research Network xLike vs MaxEnt_LF: Standar Deviation of Predicted Probability Maxlike

MaxEnt DF_BC A. Denaei A.flexifolia A.semilunata MDBfutures Collaborative Research Network axLike vs MaxEnt_LF_BC: Standar Deviation of Predicted Probability axlike MaxEnt LF_BC A. Denaei

A.flexifolia A.semilunata ves and Research Questions: e an empirical comparison between MaxEnt (maximun entropy) and axLike (maximun likelihood) in the predictions of Acacia in Australia hich of these methodologies has a better performance in the Acacia istribution? mpare the performance of this methods over other species. (Eucalyptus, h and Frogs) in the Murray Darling Basin. this performance different between species and scales?

grate the distributions of this important groups in a conservation map f B area. e the important areas consistent with the already defined conservation y Preliminary Results ke Lower unexplained deviance than MaxEnt (LF, LF_BC) nt DF show better performance than MaxEnt LF ifolia (Site fidelity sp) show a good adjustment in all the different met Area Proportion Threshold Statistical Modeling : Methods

Statistical Modeling : MaxEnt p(y=1/z) Unknown p(y=1/z): the probability of presence species, conditioned on environment. ) F1(z) : prevalence of the specie p(z): pdf of covariates across L. F(z) Make estimation about the radio F1(z)/F(z) MaxEnt Raw output In logistic Output: n(z) Why make SDM?:

Listado de usos de SDM, los mas importantes MDBfutures Theme 2 : Environmental watering and allocation Collaborative Research Network Project 3:Biodiversity Conservation Example. Acacia aneura Response variable: A. aneura presence Prevalence Covariates: Average Annual Rainfall Max temperature Probability of presence

MDBfutures Theme 2 : Environmental watering and allocation Project 3:Biodiversity Conservation Collaborative Research Network MDBfutures Theme 2 : Environmental watering and allocation Collaborative Research Network Project 3:Biodiversity Conservation From SDM to conservation mapping: continental and regional approaches

Step 3. Mapping and integrating SDM results to identify priority areas for conservation. Step 2. Testing consistency of this performance across taxon groups. Taxon groups so far: Plants(Acacia and eucalypts),genera of plants, frogs and fish. Step 1. Testing Modelling Performance for P/Only data. Models: MaxEnt vs Maxlike Species: 50 Acacia Species MDBfutures Theme 2 : Environmental watering and allocation Collaborative Research Network Project 3:Biodiversity Conservation

Testing methods, Part I: Comparing MaxEnt versus Maxlike Acacia species: A. deanei (n = 809) A. flexifolia (n=203) MaxEnt: MaxEnt-Linear Features MaxEnt-All Features MaxEnt-Linear Features Bias-Corrected MaxEnt-All Features Bias-Corrected Maxlike

A semilunata (n=99) MDBfutures Theme 2 : Environmental watering and allocation Collaborative Research Network Project 3:Biodiversity Conservation A. semilunata A. flexifolia A. deanei

Maxlike Maxent_allF Maxent_allF_BC Maxent_LF Maxent_LF_BC MDBfutures Theme 2 : Environmental watering and allocation Collaborative Research Network A. semilunata

A. flexifolia A. deanei Project 3:Biodiversity Conservation : preliminary results SDM Maxlike Maxent_allF Maxent_LF Maxent_LF_BC Maxent_allF_BC