## Summer Camp Blog 2013
## Day 1
I forgot to mention:
The new copy of the "rockchalk" package is on our testing repository
install.packages("rockchalk", repos = "http://rweb.quant.ku.edu/kran", type = "source")
Version 1.7.90 is in "feature freeze" now, bug fixes only. No new functions, no new data. To see the beauty of it, run the install, then
library(rockchalk) example(plotSlopes) example(testSlopes)
Your eyes will dance with joy, I assure you!
Now, here are the points of interest that I collected in our first day of Stats Camp, 2013
## Paul Johnson pauljohn at ku.edu
## Pascal Deboeck pascal at ku.edu
## 2013-06-03
## R functions mentioned in lectures and
## other pithy observations
## 1. Assignment symbol < - (don't use =) x < - 1 + 2 ## 2. Print contents of x x ## same as print(x) ## 2. list objects in a session ls() ## ls is unix "list files", so natural to write same in R. ## apparently ls identical to objects() ## 3. get out of R q() ## usually say NO on saving workspace when it asks, ## unless you want to inherit same objects next time. ## Thus can short-circuit q(save = "no") ## 4. c means "concatenate" or "column vector" x < - c(1, 2, 3, 4, 5) ## Group together things of different types, and the lowest ## common denominator is adopted. Sometimes convenient, sometimes bad. y <- c(1, 2, 3, 4, 5, "a") y ## You can see why that would be disastrous. Right? ## 5. Math sin(x) log(x) x^2 sqrt(x) ## 6. Stat. Note return can be singular or vector sum(x) mean(x) sd(x) var(x) min(x) max(x) range(x) ## 7. How many elements are in vector x length(x) ## 8. Remove things. Erase x rm(x) ## erase everything found by ls() rm(list = ls()) ## 9. sort. sort(x) sort(x, decreasing = TRUE) ## 10. Read help pages help("sort") ## same as ?sort ## help.search("sort") #### #### Section 2 begins ## 11. t.test t.test(x) t.test(x, y) t.test(x, y, var.equal = TRUE) ## 12. sample: choose from a collection with equal likelihood sample(1:15, 1) x <- c("bill", "willie", "gil") sample(x, 1) qsample(x, 2, replace = TRUE) ## 13. Comparisons == < > < = >= identical() all.equal() & and && and (puzzle, why two ?) | || (puzzle, why two ?) NEGATION with "!" ## 14 Membership: ## selection by subscript. Use [ for subsetting ## with logical argument undergrad < - c(TRUE, FALSE, TRUE, FALSE) grade <- c(88, 44, 99, 11) grade[undergrad] ## Binary: %in% x <- c("a","b","a","b", "c") y <- c("z", "y","z", "x","a") x %in% y x[x %in% y] ## 15. Dates ## Just characters x <- c("25-01-2009", "30-12-2008", "01-01-2003") y <- as.Date(x, format = "%d-%m-%Y") y ## y is now a date object, which some regression programs ## handle in a special way. It is not just characters anymore. ## date objects can be subtracted to get time between values. sort(x) ## alphabetical sort(y) ## date progression format(y, format = "%m-%Y") weekdays(y) ## 16. Attributes, quick description of a thing's structure attributes(x) attributes(y) str(x) str(y) ## 17. Special symbols ## Reserved values : TRUE, FALSE, T, F, NA, NaN ## NEVER USE THOSE as variable names ## DO use those as variable values. ## All can be used as special symbolic values in variables. ## DO NOT put those in quotes, they are special symbols. x <- c(1, 2, 3, NA, 5, NA) ## Symbols generally we would not need ## Inf for infinity ## Special pre-defined variables pi ## 18. Packages: ## See what's installed library() ##Install install.packages(c("psych"), dep = TRUE) install.packages("psych", dep = TRUE, repos = "http://rweb.quant.ku.edu/cran") ## check available.packages() ## load package library(MASS) library(help = "MASS") ## example(function-name-here) example(lm) example(plot) ## 19. Where am I? getwd() setwd() ## use forward slashed
After Lunch
###################AFTERNOON############# ## Matrices ## 20. cbind, matrix. A vector is a COLUMN vector. In math, a ## "vector" is assumed to be column vector. This appears as a row scale1 < - c(1, 5, 6, 5, 2, 6) scale1 scale2 <- c(2, 5, 6, 8, 4, 8) scale3 <- c(4, 1, 7, 1, 2, 1) ## but it is really a column, because it can be "column" bound with cbind cbind(scale1, scale2, scale3) ## The matrix matrix( c(scale1, scale2, scale3), nrow = 6) ## Or use matrix function. Various ways to do it scaleCat <- c(scale1, scale2, scale3) myMat <- matrix(scaleCat, nrow = 6) ## assumes byrow = FALSE ## 21. Ingerrogate a matrix: find rows, column dim(myMat) ## dim gives size, but it has a super secret power: it can ## reshape vectors into matrices, or so forth. dimnames() colnames(myMat) ## gets column names colnames(myMat) <- c("scaleA", "scaleB", "scaleC") #assigns ## could add rownames with rownames() ## dimnames() reveals rownames and colnames. ## We could also set names with dimnames(), that requires a ## list structure as input ## Use bracket notation myMat[ 1, ] ##first row myMat[ , 1] ## first column myMat[3:5, 2] ## rows 3-5 in column 2 myMat[c(1,3,5), ] ## take rows 1 3 5 ## Can create temporary vector, sum that x <- myMat[ , 1] sum(x) ## or go direct (faster, more difficult to debug) sum(myMat[ , 1]) ## if matrix has colnames, can access by name colnames(myMat) <- c("hello", "goodbye", "hola") myMat[ , "hello"] myMat[ , c("hello", "goodbye")] ## 21 load() retrieves a saved R "thing" ## whatever is saved in there will "plop" into the workspace ## save(my-R-object, file = "myRthing.RData") ## Custom in R: the saved thing should be suffixed ".rda" or ".RData". ## no other suffix is accepted on CRAN for save data objects with packages. ## Note: saveRDS may be more to your linking. It saves only one ## particular thing, but the retrieval allows us to name the thing ## we get back. load("myRthing.RData") ## 22. Data Frame ## Make a data frame id <- c(1001: 1040) rt1 <- runif(40, 250, 500) rt2 <- rt1* 0.2 + runif(40, 200, 450) gender <- sample(c("M","F"), 40, replace = TRUE) dat <- data.frame(id, rt1, rt2, gender) head(dat) colnames(dat) colnames(dat) <- c("id", "thing1", "thing2", "sex") head(dat) ## colnames is an example of the same name being both an ## accessor and a setter. They are actually 2 different functions, ## but the R programmers have created a clever way so that we ## only have to remember one name for those 2 purposes. ## 23. Lists ## R functions often return lists. and we need to navigate them ## Here's the way I'd do a t.test to find out if men and women ## have different means for rt1: myttest <- t.test(rt1 ~ gender, data = dat) attributes(myttest) ## You see: ## $names ## [1] "statistic" "parameter" "p.value" "conf.int" "estimate" ## [6] "null.value" "alternative" "method" "data.name" ## $class ## [1] "htest" myttest$p.value ## same as myttest[["p.value"]] ## Why 2 brackets? This gives a list with p.value as one element in it myttest["p.value"] ## whereas former rips vector "p.vector" out of list altogether. is.list(myttest["p.value"]) is.list(myttest[["p.value"]]) ##########Final afternoon session ## 24. Descriptive information ## min() ## max() ## mean() ## sd() ## var() ## cov() ## note most allow na.rm = TRUE, but cov and cor do not ## because they are matrix oriented. Necessary to take ## explicit steps to delete some rows(with NAs) ## Everybody needs central tendency values: mean, median, mode. mean(x) mean(x, na.rm = TRUE) median(x) ## mode function returns data type, not statistical mode. ## for a factor variable x, this creates a frequency table ## and then reports back on which is most common which.max(table(x)) ## see that on the iris data data(iris) head(iris) summary(iris) ## We dance around in a circle to get the most frequently ## observed value. Another example below which.max(table(iris$Species)) theModeOfSpecies <- names(which.max(table(iris$Species))) sd(iris$Sepal.Length) ## Caution: want the mean for each column? Necessary to do this ## choose first 4 columns, then use apply to get mean of each apply(iris[ , 1:4], 2, mean, na.rm = TRUE) ## or just run rockchalk::summarize(iris) x <- sample(c("A", "B"), 150, replace = TRUE) is.factor(x) x <- factor(x) table(x) x.mode <- names(which.max(table(x))) library(psych) describe(iris) ## competes with rockchalk::summarize() and many other function ## that enhance summary.data.frame() ## 25 Inferential t.test() ## X is a matrix cor(X) cor.test(X[, 1]) ## Note problem: cor.test won't apply to whole matrix ## like some other programs might. ## What to do ? ## package HMISC has rcorr for testing matrix. ## Am testing alternatives on base R. ## Hmm. THis works, but still just one by one. cor.test( ~ Sepal.Length + Sepal.Width, data = X) X <- iris[ , 1:4] X <- as.matrix(X) ## fails: ## apply(X, 2, cor.test) ## Becoming angry, will have to think. I'm pretty sure ## this will devolve into some tedious computing on the language ## where I create a "mix and match" of variables. LIke this Xnames <- colnames(X) expand.grid(Xnames, Xnames) MM <- unique(expand.grid(Xnames, Xnames)) ## Now iterate through rows. Note redundancy and silly correlations ## calculated mapply(function(n1,n2) {cor.test(X[ ,n1], X[ ,n2])}, MM[ ,1], MM[ ,2]) ## Well those are numbers, but not clean. ## chisq.test() ## fisher's exact test, a hypo test for small N, to help ## when chisquare has small cell sizes fisher.test(data) ## Analysis of variance ## f: effect size for ANOVA, f = sdmeans/ sderror ## Cohen suggest 0.10=small, 0.25 = medium ## for small effect size: data(ToothGrowth) power.anova.test(groups = 6, n = 10, between.var = 1, within.var = 100) power.anova.test(groups = 3, n = 20, between.var = 1, within.var = 100) ## Homogeneity of variance library(car) ## levene's test ## Massage the data to get omnibus groups variable ## See the notes, I am not sure I'm tracking along with this leveneTest() mpj1 <- aov(len ~ as.factor(supp), data = ToothGrowth) A1 <- aov(len ~ as.factor(supp) * as.factor(dose), data = ToothGrowth) anova(A1)