--- title: "Distribution Overview" author: "Paul E. Johnson" date: '2015-02-04' output: slidy_presentation --- ## Visualize (Whirled Peas) Quick PDF sketch - Normal (too easy) - Gamma - Beta - Logistic - Poisson - Binomial # Guess that Distribution ```{r drawHist, include=FALSE} drawHist <- function(x, ...){ dots <- list(...) ## These are legal argument names we need to take from ... ## if the user supplied them. histNames <- c("breaks", "nclass", "include.lowest", "right") argsForHist1 <- dots[ names(dots) [names(dots) %in% histNames]] assign("xvar", deparse(substitute(x))) ##argsForHist1[["x"]] <- as.name(deparse(substitute(x))) argsForHist1$x <- x argsForHist1$plot <- FALSE if(is.null(dots$breaks)){ dots$breaks <- 100 } if ((length(dots$breaks) == 1) & is.numeric(dots$breaks)){ argsForHist1$breaks <- min(dots$breaks, length(unique(x))+1) } h1 <- do.call("hist", argsForHist1) cr <- range(x) ## Unless the ... or "h1" arguments say otherwise, use these: defaults <- list(prob = TRUE, main = "", border = gray(.80), xlim = extendrange(cr,f=0.10), ylim = rockchalk::magRange(h1$density, c(1,1.2)), ylab = "density") defaults$xlab <- gsub(".*\\$", "", deparse(substitute(x))) defaults[["x"]] <- as.name(deparse(substitute(x))) ## Battle through the ... arguments. Find if any are legal hist arguments ## and insert them. namesOverlap <- names(dots) %in% names(formals(hist.default)) dotsForHist <- dots[ names(dots)[namesOverlap] ] defaults <- modifyList(defaults, dotsForHist) ## Problem: Must take notice of breaks calculated by h1. defaults <- modifyList(defaults, list("breaks" = h1$breaks, "include.lowest" = h1$include.lowest, "right" = h1$right, "nclass" = h1$nclass)) callhist <- do.call("hist", defaults) lines(density(x), col="red", lty=2, lwd=2) ind <- seq(cr[1], cr[2], length.out=100) cm1 <- round(mean(x), 3) cs1 <- round(sd(x), 3) nprob <- dnorm(ind, m = cm1, s = cs1) lines(ind, nprob, lty = 1, col = "black") nlab <- bquote(paste("Normal(", .(round(cm1,3)),",", .(round(cs1,3))^2,")")) legend("topleft",legend=c("Kernel Density", as.expression(nlab)), lty=c(2,1), col=c("red","black"), lwd=c(1.5,1)) legend("left", legend=c(paste("Obs. mean=", cm1), paste("Obs. s.d.=",cs1))) invisible(callhist) } library(rockchalk) ``` ## I'm Not that {.allowframebreaks} ```{r include = FALSE} x1 <- rnorm(2000, 88, 200) ``` ```{r} length(x1) x1[1:10] rockchalk::summarize(x1) ``` I am not: a) Poisson b) Normal c) Uniform d) Beta e) Binomial ## I'm Not that {.allowframebreaks} ```{r echo=FALSE} hist(x1, prob = TRUE, xlab = "Guess What I'm Not!") ``` I am not a) Poisson b) Normal c) Gamma d) Beta e) Binomial ## Answer: x1 <- rnorm(2000, 88, 200) ## I'm Not that 2 {.allowframebreaks} ```{r include = FALSE} x2 <- rgamma(2000, 2, 1) ``` ```{r} x2[1:10] ``` ```{r} summarize(x2) ``` I am not a) Poisson b) Normal c) Uniform d) Beta ## I'm Not that 2 {.allowframebreaks} ```{r echo=FALSE} hist(x2, prob = TRUE, xlab = "Guess What I'm Not!") ``` I am not a) Poisson b) Logistic c) Uniform d) Beta ## Answer x2 <- rgamma(2000, 2, 1) ## I'm Not that 3{.allowframebreaks} ```{r include = FALSE} x3 <- rbinom(2000, 10, prob = c(0.3)) ``` ```{r} x3[1:10] ``` ```{r} rockchalk::summarize(x3) ``` I am not: a) Poisson b) Normal c) Uniform d) Beta e) Binomial ## I'm Not that {.allowframebreaks} ```{r echo=FALSE} hist(x3, prob = TRUE, xlab = "Guess What I'm Not!") ``` I am not a) Poisson b) Normal c) Gamma d) Beta ## Answer: Actually, I was rbinom(2000, 10, prob = c(0.3)) # Harvesting from R/WorkingExamples ## plot-histogramWithLinesAndLegend.R(html) ```{r} drawHist(x1) ``` ## ex 2 ```{r} drawHist(x2) ``` ## ex 3 ```{r} drawHist(x3) ``` # Expected Values ## EV: Simple idea with complicated jargon - Expected Value Probability weighted sum of outcomes - Outcomes are 1, 2, 3 - Probabilities are 0.5, 0.4, and 0.1 Calculate the Expected Value: ? 0.5 * 1 + 0.4 * 2 + 0.1 * 3 Easy! ## Do you think terminology is the major problem - Common, everyday meaning of "expected value" - Is that ever what a lay person would "expect"? - For which distributions is this "informative"? (unimodal, symmetric) - When is it not subjectively informative? (others) ## Even if EV is not subjectively informative... - Its still a fixed characteristic of the probability process - Estimate it! - Sometimes we engage in this game - Stipulate a theoretical distribution - Calculate the average from a sample - Note the sample is "so far" from what we theorized about the EV that the theory must be wrong! ## Got R? {.allowframebreaks} ``` x1 <- rpois(5000, lambda = 0.2) hist(x1, main = "My EV is 0.2") ``` ``` x2 <- rpois(5000, lambda = 10) hist(x2, main = "My EV is 10") ``` - please make note of these numbers ``` mean(x1) mean(x2) sd(x1) sd(x2) ``` ## I got {.allowframebreaks} ```{r} x1 <- rpois(5000, lambda = 0.2) hist(x1, main = "My EV is 0.2") ``` - Sometimes I notice that R's hist doesn't work so well with discrete variables. Have ideas about that, but don't want to bother you about them now. ```{r} x2 <- rpois(5000, lambda = 10) hist(x2, main = "My EV is 10") ``` - Wedge those into same figure with par(mfcol = c(2,1)) ```{r echo=FALSE} op <- par(no.readonly=TRUE) par(mfrow = c(2,1)) hist(x1, main = "My EV is 0.2") hist(x2, main = "My EV is 10") par(op) ``` ## Sample average is one way to estimate EV - The mean \[ \bar{x}=\frac{sum\ of\ observed\ values}{N} \] - Need criteria to say if $\bar{x}$ is a "good"" estimate of $E[x]$ - Later in term - unbiased > $\bar{x}$ flutters around E[x] - consistent > As your sample grows, gap between $\bar{x}$ and $E[x]$ will probably shrink ## Probabilities. Who Likes Calculus? - This is a discrete variable, so we'd say it has "Probability Mass", techincally different from a "continuum" - Easiest math - Probability of an outcome simply given to us - Easy to see addition and multiplication rules. ## Continuous variables - If you can put up with integrals, you can describe these things more clearly. - PDF, f(x): probability density of a score equal to x - CDF, F(k): cumulative distribution function. Chance of an observed score smaller than k ## Extreme scores - Sketch some PDF (plays per fumble) - Mark any point k. - We almost never interested in question of a score exactly k - Are interested in outcomes more extreme than k - Imagine - the distribution of what would happen - the data gives you some estimate k - we want to know if k is extremely different from model # I hate the word population ## Data Generating Process - I am almost never interested in describing "the people out there from which we are sampling" - I am interested in estimating the parameters of the process that generates them. ## I want the distribution of an estimate - what did you get for $mean(x1)$ before? - Sampling distribution # Theoretical Variance ## Variance - Expected value of $(x - E[x])^2$ - Can we calculate variance of {1, 2, 3} with probabilities {0.5, 0.4, and 0.1}? - Var[x] is a characteristic of the probability process - Sample variance is an estimate of it. ## Terminology - Differentiate the "theoretical property" from the "sample estimate" - Expected Value <=> sample mean - There's no special word for "population variance" similar to expected value (thus creating confusion when discussing variance) ## How to calculate sample variance estimates - Here, the N and N-1 problem slaps us in the face. - Intuition, a good formula appears to be $$s1 = \frac{Sum\ of (x - E[x])^2}{N}$$ - That is a maximum likelihood estimate ## But it is not unbiased. - $E[s1]\neq Var[x]$ - But this is unbiased $$s1 = \frac{Sum\ of (x - E[x])^2}{N-1}$$