R Summer Camp 2013 Journal

## Summer Camp Blog 2013
## Day 1

I forgot to mention:

The new copy of the "rockchalk" package is on our testing repository

install.packages("rockchalk", repos = "http://rweb.quant.ku.edu/kran", type = "source")

Version 1.7.90 is in "feature freeze" now, bug fixes only. No new functions, no new data. To see the beauty of it, run the install, then

library(rockchalk)
example(plotSlopes)
example(testSlopes)

Your eyes will dance with joy, I assure you!

Now, here are the points of interest that I collected in our first day of Stats Camp, 2013

## Paul Johnson pauljohn at ku.edu
## Pascal Deboeck pascal at ku.edu
## 2013-06-03

## R functions mentioned in lectures and
## other pithy observations

## 1. Assignment symbol < - (don't use =)

x < - 1 + 2

## 2. Print contents of x

x

## same as

print(x)

## 2. list objects in a session
ls()
## ls is unix "list files", so natural to write same in R.
## apparently ls identical to
objects()

## 3. get out of R

q()
## usually say NO on saving workspace when it asks,
## unless you want to inherit same objects next time.
## Thus can short-circuit
q(save = "no")


## 4. c means "concatenate" or "column vector"

x < - c(1, 2, 3, 4, 5)

## Group together things of different types, and the lowest
## common denominator is adopted. Sometimes convenient, sometimes bad.

y <- c(1, 2, 3, 4, 5, "a")
y

## You can see why that would be disastrous. Right?

## 5. Math

sin(x)
log(x)
x^2
sqrt(x)

## 6. Stat. Note return can be singular or vector

sum(x)
mean(x)
sd(x)
var(x)
min(x)
max(x)
range(x)

## 7. How many elements are in vector x

length(x)


## 8. Remove things. Erase x

rm(x)

## erase everything found by ls()

rm(list = ls())

## 9. sort.

sort(x)
sort(x, decreasing = TRUE)

## 10. Read help pages

help("sort")
## same as
?sort
##
help.search("sort")


####


#### Section 2 begins

## 11. t.test

t.test(x)
t.test(x, y)

t.test(x, y, var.equal = TRUE)

## 12. sample: choose from a collection with equal likelihood

sample(1:15, 1)

x <- c("bill", "willie", "gil")
sample(x, 1)
qsample(x, 2, replace = TRUE)


## 13. Comparisons

==
<
>
< =
>=

identical()
all.equal()

& and
&& and (puzzle, why two ?)
|
|| (puzzle, why two ?)

NEGATION with "!"

## 14 Membership:

## selection by subscript. Use [ for subsetting
## with logical argument

undergrad < - c(TRUE, FALSE, TRUE, FALSE)
grade <- c(88, 44, 99, 11)
grade[undergrad]


## Binary: %in%

x <- c("a","b","a","b", "c")
y <- c("z", "y","z", "x","a")
x %in% y

x[x %in% y]


## 15. Dates

## Just characters
x <- c("25-01-2009", "30-12-2008", "01-01-2003")

y <- as.Date(x, format = "%d-%m-%Y")
y

## y is now a date object, which some regression programs
## handle in a special way. It is not just characters anymore.
## date objects can be subtracted to get time between values.

sort(x) ## alphabetical

sort(y) ## date progression


format(y, format = "%m-%Y")
weekdays(y)


## 16. Attributes, quick description of a thing's structure

attributes(x)

attributes(y)

str(x)

str(y)


## 17. Special symbols

## Reserved values : TRUE, FALSE, T, F, NA, NaN
## NEVER USE THOSE as variable names

## DO use those as variable values.
## All can be used as  special symbolic values in variables.
## DO NOT put those in quotes, they are special symbols.

x <- c(1, 2, 3, NA, 5, NA)

## Symbols generally we would not need
## Inf for infinity

## Special pre-defined variables

pi


## 18. Packages:

## See what's installed

library()

##Install

install.packages(c("psych"), dep = TRUE)

install.packages("psych", dep = TRUE, repos = "http://rweb.quant.ku.edu/cran")

## check

available.packages()

## load package

library(MASS)

library(help = "MASS")


## example(function-name-here)

example(lm)

example(plot)


## 19. Where am I?

getwd()

setwd()

## use forward slashed

After Lunch


###################AFTERNOON#############

## Matrices


## 20. cbind, matrix. A vector is a COLUMN vector.  In math, a
## "vector" is assumed to be column vector. This appears as a row

scale1 < - c(1, 5, 6, 5, 2, 6)
scale1

scale2 <- c(2, 5, 6, 8, 4, 8)
scale3 <- c(4, 1, 7, 1, 2, 1)

## but it is really a column, because it can be "column" bound with cbind


cbind(scale1, scale2, scale3)

## The matrix

matrix( c(scale1, scale2, scale3), nrow = 6)

## Or use matrix function. Various ways to do it

scaleCat <- c(scale1, scale2, scale3)

myMat <- matrix(scaleCat, nrow = 6) ## assumes byrow = FALSE



## 21. Ingerrogate a matrix: find rows, column

dim(myMat)

## dim gives size, but it has a super secret power: it can
## reshape vectors into matrices, or so forth. 

dimnames()

colnames(myMat) ## gets column names

colnames(myMat) <- c("scaleA", "scaleB", "scaleC") #assigns

## could add rownames with rownames()

## dimnames() reveals rownames and colnames.

## We could also set names with dimnames(), that requires a 
## list structure as input


## Use bracket notation

myMat[ 1, ] ##first row

myMat[ , 1] ## first column

myMat[3:5, 2]  ## rows 3-5 in column 2

myMat[c(1,3,5), ]  ## take rows 1 3 5


## Can create temporary vector, sum that

x <- myMat[ , 1]
sum(x)
## or go direct (faster, more difficult to debug)

sum(myMat[ , 1])

## if matrix has colnames, can access by name

colnames(myMat) <- c("hello", "goodbye", "hola")


myMat[ , "hello"]

myMat[ , c("hello", "goodbye")]



## 21 load() retrieves a saved R "thing"
## whatever is saved in there will "plop" into the workspace

## save(my-R-object, file = "myRthing.RData")

## Custom in R: the saved thing should be suffixed ".rda" or ".RData".
## no other suffix is accepted on CRAN for save data objects with packages.

## Note: saveRDS may be more to your linking. It saves only one
## particular thing, but the retrieval allows us to name the thing
## we get back.

load("myRthing.RData")


## 22. Data Frame

## Make a data frame

id <- c(1001: 1040)
rt1 <- runif(40, 250, 500)
rt2 <- rt1* 0.2 + runif(40, 200, 450)
gender <- sample(c("M","F"), 40, replace = TRUE)

dat <- data.frame(id, rt1, rt2, gender)

head(dat)

colnames(dat)

colnames(dat) <- c("id", "thing1", "thing2", "sex")

head(dat)

## colnames is an example of the same name being both an
## accessor and a setter.  They are actually 2 different functions,
## but the R programmers have created a clever way so that we
## only have to remember one name for those 2 purposes.




## 23. Lists

## R functions often return lists. and we need to navigate them

## Here's the way I'd do a t.test to find out if men and women
## have different means for rt1:

myttest <- t.test(rt1 ~ gender, data = dat)

attributes(myttest)

## You see:
## $names
## [1] "statistic"   "parameter"   "p.value"     "conf.int"    "estimate"   
## [6] "null.value"  "alternative" "method"      "data.name"  

## $class
## [1] "htest"

myttest$p.value
## same as

myttest[["p.value"]]

## Why 2 brackets?  This gives a list with p.value as one element in it

myttest["p.value"]

## whereas former rips vector "p.vector" out of list altogether.

is.list(myttest["p.value"])

is.list(myttest[["p.value"]])



##########Final afternoon session


## 24. Descriptive information
## min()
## max()
## mean()
## sd()
## var()
## cov()



## note most allow na.rm = TRUE, but cov and cor do not
## because they are matrix oriented. Necessary to take
## explicit steps to delete some rows(with NAs)

## Everybody needs central tendency values: mean, median, mode.

mean(x)

mean(x, na.rm = TRUE)
median(x)

## mode function returns data type, not statistical mode.

## for a factor variable x, this creates a frequency table
## and then reports back on which is most common
which.max(table(x))

## see that on the iris data

data(iris)

head(iris)


summary(iris)

## We dance around in a circle to get the most frequently
## observed value. Another example below
which.max(table(iris$Species))

theModeOfSpecies <- names(which.max(table(iris$Species)))


sd(iris$Sepal.Length)


## Caution: want the mean for each column? Necessary to do this
## choose first 4 columns, then use apply to get mean of each

apply(iris[ , 1:4], 2, mean, na.rm = TRUE)


## or just run

rockchalk::summarize(iris)



x <- sample(c("A", "B"), 150, replace = TRUE)
is.factor(x)
x <- factor(x)

table(x)
x.mode <- names(which.max(table(x)))


library(psych)
describe(iris)
## competes with rockchalk::summarize() and many other function
## that enhance summary.data.frame()



## 25 Inferential

t.test()

## X is a matrix

cor(X)

cor.test(X[, 1])

## Note problem: cor.test won't apply to whole matrix
## like some other programs might.

## What to do ?

## package HMISC has rcorr for testing matrix.

## Am testing alternatives on base R.

## Hmm. THis works, but still just one by one.
cor.test( ~ Sepal.Length + Sepal.Width, data = X)


X <- iris[ , 1:4]
X <- as.matrix(X)

## fails:
##
apply(X, 2, cor.test)

## Becoming angry, will have to think. I'm pretty sure
## this will devolve into some tedious computing on the language
## where I create a "mix and match" of variables. LIke this

Xnames <- colnames(X)

expand.grid(Xnames, Xnames)

MM <- unique(expand.grid(Xnames, Xnames))

## Now iterate through rows. Note redundancy and silly correlations
## calculated

mapply(function(n1,n2) {cor.test(X[ ,n1], X[ ,n2])}, MM[ ,1], MM[ ,2])

## Well those are numbers, but not clean.
##


chisq.test()


## fisher's exact test, a hypo test for small N, to help
## when chisquare has small cell sizes
fisher.test(data)


## Analysis of variance

## f: effect size for ANOVA, f = sdmeans/ sderror
## Cohen suggest 0.10=small, 0.25 = medium
## for small effect size:
data(ToothGrowth)
power.anova.test(groups = 6,  n = 10, between.var = 1, within.var = 100)

power.anova.test(groups = 3, n = 20, between.var = 1, within.var = 100)


## Homogeneity of variance

library(car) ## levene's test

## Massage the data to get omnibus groups variable

## See the notes, I am not sure I'm tracking along with this
leveneTest()

mpj1 <- aov(len ~ as.factor(supp), data = ToothGrowth)

A1 <- aov(len ~ as.factor(supp) * as.factor(dose), data = ToothGrowth)

anova(A1)

About pauljohn

Archives

Meta

Meta