# Data Analysis - Rowan Statistics: Data Presentation & Analysis Fr Clinic I Overview Tables & Graphs Populations & Samples Mean, Median, & Variance Error Bars Standard Deviation, Standard Error & 95% Confidence Interval (CI) Comparing Means of Two Populations Linear Regression (LR) Warning Statistics is a huge field, Ive simplified considerably here. For example:

Mean, Median, and Standard Deviation There are alternative formulas 95% Confidence Interval There are other ways to calculate CIs (e.g., z statistic instead of t; difference between two means, rather than single mean) Error Bars Dont go beyond the interpretations I give here! Comparing Means of Two Data Sets We just cover the t test for two means when the variances are unknown but equal, there are other tests Linear Regression We only look at simple LR and only calculate the intercept, slope and R 2. There is much more to LR! Tables Table 1: Average Turbidity and Color of Water Treated by Portable Water Filters Water

Pond Water (2) 10 (3) 13 Apparent Color (Pt-Co) (4) 30 Sweetwater 4 5 12 Hiker

3 8 11 (1) Turbidity True Color (NTU) (Pt-Co) Consistent Format, Title, Units, Big Fonts Differentiate Headings, Number Columns Consistent Format, Title, Units Good Axis Titles, Big Fonts Figures 25 Turbidity (NTU)

20 20 11 15 10 11 10 7 5 5 1 0 Pond Water Sweetwater Miniworks

Hiker Pioneer Voyager Filter Figure 1: Turbidity of Pond Water, Treated and Untreated Populations and Samples Population All possible outcomes of experiment or observation US population Particular type of steel beam Sample Finite number of outcomes measured or observations made 1000 US citizens 5 beams

Use samples to estimate population properties Mean, Variance E.g., Height of 1000 US citizens used to estimate mean of US population Central Tendency Mean and Median 1 3 3 6 8 10 Mean = xbar = Sum of values divided by sample size = (1+3+3+6+8+10)/6 = 5.2 NTU Median = m = Middle number Rank 1 2 3 4 5 6 Number 1 3 3 6 8 10 For even number of sample points, average middle two = (3+6)/2 = 4.5

Excel: Mean AVERAGE; Median - MEDIAN Variability Variance, s2 sum of the square of the deviation about the mean divided by degrees of freedom s2 = n(xi xbar)2/(n-1) Where xi = a data point and n = number of data points Example (cont.) s2 = [(1-5.2)2 + (3-5.2)2 + (3-5.2)2 + 6-5.2)2 + (8-5.2)2 + (10-5.2)2] /(6-1) = 11.8 NTU2 Excel: Variance VAR Error Bars Show data variability on plot of mean values Types of error bars include: Max/min, Standard Deviation, Standard Error, 95% CI Turbidity (NTU) 10

8 6 4 2 0 Filter 1 Filger 2 Filter Type Filter 3 Standard Deviation, s 2 Square-root of variance s s If phenomena follows Normal Distribution (bell curve), 95% of population lies within 1.96 standard deviations of the mean Normal Distribution Error bar is s

above & below mean 95% Excel: standard deviation STDEV -4 -1.96 -2 0 1.96 2 Standard Deviation Standard Deviations from Mean 4 Standard Error of Mean

sX Also called St-Err or sxbar For sample of size n taken from population with standard deviation estimated as s s sX n As n , sxbar estimate, i.e., estimate of population mean improves Error bar is St-Err above & below mean 95% Confidence Interval (CI) for Mean A 95% Confidence Interval is expected to contain the population mean 95 % of the time (i.e., of 95%-CIs from 100 samples, 95 will contain pop mean) X t 95%,n 1s X t95%,n-1 is a statistic for 95% CI from sample of size n t95%,n-1 = TINV(0.05,n-1)

If n 30, t95%,n-1 1.96 (Normal Distribution) Error bar is t95%,n 1s X above & below mean Using Error Bars to compare data Standard Deviation Demonstrates data variability, but no comparison possible Standard Error If bars overlap, any difference in means is not statistically significant If bars do not overlap, indicates nothing! 95% Confidence Interval If bars overlap, indicates nothing! If bars do not overlap, difference is statistically significant Well use 95 % CI in this class Any time you have 3 or more data points, determine mean, standard deviation, standard error, and t95%,n-1, then plot mean with error bars showing the 95% confidence interval Adding Error Bars to an Excel Graph Create Graph

Column, scatter, Select Data Series In Layout Tab-Analysis Group, select Error Bars Select More Error Bar Options Select Custom and Specify Values and select cells containing tthe values 95%, n 1 s X Example 1: 95% CI Turbidity Data 1 2 3 mean St Dev NTU NTU NTU NTU NTU 2.1 2.1 2.2 2.1

0.06 3.2 4.4 5 4.2 0.92 4.3 4.2 4.5 4.3 0.15 Filter 1 Filter 2 Filter 3 7.0 6.0 Turbidity (NTU) 5.0 4.2 4.3

Filter 2 Filter 3 4.0 3.0 2.1 2.0 1.0 0.0 Filter 1 Portable Water Filter n 3 3 3 St-Err NTU 0.03 0.53

0.09 t 95%,2 +/- 95% CI t 95%,2St-Err 4.30 4.30 4.30 0.14 2.28 0.38 What can we do? Lift weight multiple times using different solar panel combinations (or hyrdoturbines, or gear boxes) and plot mean and 95 % Confidence interval error bars. If error bars overlap between to different test conditions, indicates nothing! If error bars do not overlap, difference is statistically

significant T Test A more sophisticated way to compare means Use t test to determine if means of two populations are different E.g., lift times with different solar panel combinations or turbines or Comparing Two Data Sets using the t test Example - You lift weight with two panels in series and two in parallel. Series: Mean = 2 min, s = 0.5 min, n = 20 Parallel: Mean = 3 min, s = 0.6 min, n = 20 You ask the question - Do the different panel combinations result in different lift times? Different in a statistically significant way Are the Lift Times Different? Series Use TTEST (Excel)

Fractional probability of being wrong if you claim the two populations are different Well say they are significantly different if probability is 0.05 Parallel 1.5 2 2.2 1.8 3 1.6 1.2 2.1 1.9 2.2 2.6 1.7 1.8 1.5 2.4

2.5 2.7 1.4 1.5 2.6 3 2.4 2.2 2.6 3.4 3.6 3.8 3.5 2.7 2.4 3.5 3.8 2.1 2.5 3.4 3.3 2.4

3.6 2.3 3.7 Marbles Linear Regression Fit the best straight line to a data set Grade Point Average 25 20 y = 1.897x + 0.8667 R2 = 0.9762 15 10 5

0 0 2 4 6 8 10 12 Height (m) Right-click on data point and select trendline. Select options to show equation and R2. R2 - Coefficient of multiple Determination R2 = n(i - ybar)2 / n(yi - ybar)2 i = Predicted y values, from regression equation

yi = Observed y values Ybar = mean of y R2 = fraction of variance explained by regression R2 = 1 if data lies along a straight line