Rtips

(I thought it was cute to call this "StatsRus" but the Toystore's laywer called and, well, you know...)

If you need a tip sheet for R, here it is.

This is not a substitute for R documentation, just a list of things I'm using to remember what I read in the email list.

Heed the words of Brian D. Ripley, "One enquiry came to me yesterday which suggested that some users of binary distributions do not know that R comes with two Guides in the doc/manual directory plus an FAQ and the help pages in book form. I hope those are distributed with all the binary distributions (they are not made nor installed by default). Windows-specific versions are available in PDF in rw1001d1.zip and rw1001d2.zip"

Credit for the FaqManager software goes to Stas Bekman, its author.

Table of contents

1 Data Input/Output
2 Working with data frames: Recoding, selecting, aggregating
3 Matrices and vector operations
4 Applying functions, tapply, etc
5 Graphing
6 Common Statistical Chores
7 Model Fitting (Regression-type things)
8 Packages: Where do I find....
9 Misc. web resources not listed in main R materials (?)
10 R workspace
11 Interface with the operating system
12 Stupid R tricks: basics you can't live without
13 Misc R usages I find interesting


1 Data Input/Output


1.1 Bring raw numbers into R   (31/12/2001 Paul Johnson <pauljohn@ku.edu>)

This is truly easy. Suppose you've got numbers in a space-separated file "myData", with column names in the first row. R's

myDataFrame<-read.table("myData",header=TRUE)

command works great.

If you type "?read.table" it tells about importing files with other delimiters.

Suppose you have tab delimited data with blank spaces to indicate "missing" values. Do this:

> myDataFrame<-read.table("myData",sep="\t",na.strings=" ",header=TRUE)

(new R1.3 feature) Suppose your data is in a gzipped file, myData.gz. Do this:

myDataFrame <- read.table(gzfile("myData.gz"), header=T)

If you read columns in from separate files, combine into a data frame as:

variable1 <- scan("file1")
variable2 <- scan("file2")
mydata <- cbind(variable1, variable2)
#or use the equivalent:
#mydata <- data.frame(variable1, variable2)
#Optionally save dataframe in R object file with:
write.table(mydata, file="filename3")

[top]

1.2 Basic notation on data access   (06/04/2001 Paul Johnson <pauljohn@ku.edu>)

To access the columns of a data frame "x" with the column number, say x[,1], to get the first column. If you know the column name, say "pjVar1", it is the same as x$pjVar1.

If you attach that data frame with attach(x) then you can just refer to the variable pjVar1 and R will find it in the data frame. I have had some trouble with this.

In a regression, include the option data=x and then you can use pjVar1 as a variable name. That makes the cleanest code.

Grab an element in a list as x[[1]]

For instance, if a data frame is "nebdata" then

   nebdata[1, 2]
or
   nebdata[nebdata$Volt=="ABal", 2]
or by attaching the data frame
   attach(nebdata)
   nebdata[Volt=="ABal", 2]
   detach(nebdata)
(from Diego Kuonen)

[top]

1.3 Checkout the new Data Import/Export manual   (13/08/2001 Paul Johnson <pauljohn@ku.edu>)

With R-1.2, the R team released a manual about how to get data into R and out of R. That's the first place to look if you need help. Right now it is at http://lib.stat.cmu.edu/R/CRAN/doc/manuals/R-data.pdf

[top]

1.4 Exchange data between R and other programs (Excel, etc)   (09/29/2005 Paul Johnson <pauljohn@ku.edu>)

Patience is the key. If you understand the format in which your data is currently held, chances are good you can translate it into R with more or less ease.

Most commonly, people seem to want to import Microsoft Excel spreadsheets. I believe there is an ODBC approach to this, but I think it is only for MS Windows. For all users, the better approach is to proceed as follows.

Step 1. Manually massage your Excel sheet so that it has a row of names at the top, and it has numbers or NA values filled in for all cells.

Step 2. First, try the easy route. In the gregmisc package collection, there's a subsection called "gdata" and it has a "read.xls" method. As long as you tell it which sheet you want to read out of your data set, and you add header=T and whatever other options you'd like in an ordinary read.table usage, then it has worked well for us.

Step 3. Suppose step 2 did not work. Then you have something irregular in your Excel sheet and you should proceed as follows. Open the Excel sheet in a spreadsheet (we like gnumeric best, but all are probably OK) and check it over, and then choose File/Save As and look for an option like "text configurable" or such. Choose the delimiter as the tab key, and if you have options for numerical output, try to turn off the usage of commas or other unnecessary visual delimiters in numbers.

After you save that file in text mode, open it in Emacs and look through to make sure that 1) your variable names in the first row do not have spaces or quotation marks or underscores or special characters, and 2) look to the bottom to make sure your spreadsheet program did not insert any crud. If so, delete it. If you see commas in numeric variables, just use the Edit/Search/replace option to delete them.

Then read in your data with read.table ("filename",header=T,sep="\t")

I have tested the "foreign" package by importing an SPSS file and it worked great. I've had great results importing Stata Data sets too.

Here's a big caution for you, however. If your SPSS or Stata numeric variables have some "value lables", say 98=No Answer and 99=Missing, then R will think that the variable is a factor, and it will convert it into a factor with 99 possible values. The foreign library commands for reading spss and dta files have options to stop R from trying to help with factors and I'd urge you to read the manual and use them.

If you use read.spss, for example, setting use.value.labels=F will stop R from creating factors at all. If you don't want to go that far, there's an option max.value.labels that you can set to 5 or 10, and stop it from seeing 98=Missing and then creating a factor with 98 values. It will only convert variables that have fewer than 5 or 10 values. If you use read.dta (for Stata), you can use the option convert.factors=F.

Also, if you are using read.table, you may have trouble if your numeric variables have any "illegal" values, such as letters. Then R will assume you really intend them to be factors and it will sometimes be tough to fix. If you add the option as.is=T, it will stop that cleanup effort by R.

At one time, the SPSS import support in foreign did not work for me, and so I worked out a routine of copying the SPSS data into a text file, just as described for Excel.

SPSS: Use |File, Export| menu in the data editor and export as a tab-separated file (for simplicity, say |mydata.txt| in the top directory on the |C:| drive). That creates tab-separated columns. Make sure that it contains a header line with the variable names and use
read.csv2("C:/mydata.txt",sep="\t")
to read it in. This refers to locales where the comma is used as decimal separator, otherwise use read.csv().

I have a notoriously difficult time with SAS XPORT files and don't even try anymore. I've seen several email posts by Frank Harrel in r-help and he has some pretty strong words about it. I do have one working example of importing the Annenberg National Election Study into R from SAS and you can review that at http://lark.cc.ku.edu/~pauljohn/ANES/2002. I wrote a long boring explanation. Honestly, I think the best thing to do is to find a bridge between SAS and R, say use some program to convert the SAS into Excel, and go from there. Or just write the SAS data set to a file and then read into R with read.table() or such.

[top]

1.5 Merge data frames   (04/23/2004 Paul Johnson <pauljohn@ku.edu>)

update:Merge is confusing! But if you study this, you will see everything in perfect clarity:

x1 <- rnorm(100)
x2 <- rnorm(100)
x3 <- rnorm(100)
x4 <- rnorm(100)
ds1 <- data.frame(city=rep(1,100),x1=x1,x2=x2)
ds2 <- data.frame(city=rep(2,100),x1=x1,x3=x3,x4=x4)
merge(ds1,ds2,by.x=c("city","x1"),by.y=c("city","x1"),all=T)

The trick is to make sure R understands which are the "common variables" in the two datasets so it lines them up, and then all=T is needed to say that you don't want to throw away the variables that are only in one set or the other. Read the help page over and over, you eventually get it.

More examples:

experiment<-data.frame(times=c(0,0,10,10,20,20,30,30),expval=c(1,1,2,2,3,3,4,4))

simul<-data.frame(times=c(0,10,20,30),simul=c(3,4,5,6))

I want a merged datatset like:
> times expval simul
> 1 0 1 3
> 2 0 1 3
> 3 10 2 4
> 4 10 2 4
> 5 20 3 5
> 6 20 3 5
> 7 30 4 6
> 8 30 4 6

Suggestions

merge(experiment, simul)

does all the work for you. (from Brian D. Ripley)

Or consider:

exp.sim<-data.frame(experiment,simul=simul$simul[match(experiment$times,simul$times)])(from Jim Lemon)

I have dataframes like this:
state count1 percent1
CA 19 0.34
TX 22 0.35
FL 11 0.24
OR 34 0.42
GA 52 0.62
MN 12 0.17
NC 19 0.34

state count2 percent2
FL 22 0.35
MN 22 0.35
CA 11 0.24
TX 52 0.62

And I want state count1 percent1 count2 percent2
CA 19 0.34 11 0.24
TX 22 0.35 52 0.62
FL 11 0.24 22 0.35
OR 34 0.42 0 0
GA 52 0.62 0 0
MN 12 0.17 22 0.35
NC 19 0.34 0 0
( from Yu-Ling Wu )

In response, Ben Bolker said

state1 <- c("CA","TX","FL","OR","GA","MN","NC")
count1 <- c(19,22,11,34,52,12,19)
percent1 <- c(0.34,0.35,0.24,0.42,0.62,0.17,0.34)

state2 <- c("FL","MN","CA","TX")
count2 <- c(22,22,11,52)
percent2 <- c(0.35,0.35,0.24,0.62)

data1 <- data.frame(state1,count1,percent1)
data2 <- data.frame(state2,count2,percent2)

datac <- data1m <- match(data1$state1,data2$state2,0)
datac$count2 <- ifelse(m==0,0,data2$count2[m])
datac$percent2 <- ifelse(m==0,0,data2$percent2[m])

If you didn't want to keep all the rows in both data sets (but just theshared rows) you could use

merge(data1,data2,by=1)

[top]

1.6 Add one row at a time   (14/08/2000 Paul Johnson <pauljohn@ku.edu>)

Question: I would like to create an (empty) data frame with "headings" for every column (column titles) and then put data row-by-row into this data frame (one row for every computation I will be doing), i.e.

 no. time temp pressure <---the headings
 1   0     100   80     <---first result
 2   10    110   87     <---2nd result .....
 

Answer: Depends if the cols are all numeric: if they are a matrix would be better.

If you know the number of results in advance, say, N, do this

df <- data.frame(time=numeric(N), temp=numeric(N), pressure=numeric(N))
df[1, ] <- c(0, 100, 80)
df[2, ] <- c(10, 110, 87)
...
or
m <- matrix(nrow=N, ncol=3)
colnames(m) <- c("time", "temp", "pressure")
m[1, ]  <- c(0, 100, 80)
m[2, ] <- c(10, 110, 87)
The matrix form is better size it only needs to access one vector (a matrix is a vector with attributes), not three.

If you don't know the final size you can use rbind to add a row at a time, but that is substantially less efficient as lots of re-allocation is needed. It's better to guess the size, fill in and then rbind on a lot more rows if the guess was too small.(from Brian Ripley)

[top]

1.7 Need yet another different kind of merge for data frames   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

(from Stephen Arthur)

Example Data:

File1 C A T File2 1 2 34 56 2 3 45 67 3 4 56 78

To Yield:

 
> C A T 1 2 34 56
> C A T 2 3 45 67
> C A T 3 4 56 78

This works:

 
repcbind <- function(x,y) {
  nx <- nrow(x)
  ny <- nrow(y)
  if (nx<ny)
    x <- apply(x,2,rep,length=ny)
  else if (ny<nx)
    y <- apply(y,2,rep,length=nx)
  cbind(x,y)
}
(from Ben Bolker)

[top]

1.8 Check if an object is NULL   (06/04/2001 Paul Johnson <pauljohn@ku.edu>)

If you load things, R does not warn you when they are not found, it records them as NULL. You have the responsibility of checking them. Use

is.null(list$component)

to check a thing named component in a thing named list.

Accessing non-existent dataframe columns with "[" does give an error, so you could do that instead.

>data(trees)
>trees$aardvark
NULL
>trees[,"aardvark"]
Error in [.data.frame(trees, , "aardvark") : subscript out of bounds (from Thomas Lumley)

[top]

1.9 Generate random numbers   (22/08/2000 Paul Johnson <pauljohn@ku.edu>)

runif() --uniform random on [0,1) rnorm() --normal random

multivariate normal: Given covar matrix:

0.25,    0.20
0.20,    0.25

Brian Ripley said: "mvrnorm in package MASS (part of the VR bundle).

mvrnorm(2, c(0,0), matrix(c(0.25, 0.20, 0.20, 0.25), 2,2))

and Peter Dalgaard observed "a less general solution for this particular case would be

rnorm(1,sd=sqrt(0.20)) + rnorm(2,sd=sqrt(0.05))

Wait. You want randomly drawn integers? Here:

# If you mean sampling without replacement:
>sample(1:10,3,replace=FALSE)
#If you mean with replacement:
>sample(1:10,3,replace=TRUE)
(from Bill Simpson)

[top]

1.10 Generate random numbers with a fixed mean/variance   (06/09/2000 Paul Johnson <pauljohn@ku.edu>)

If you generate random numbers with a given tool, you don't get a sample with the exact mean you specify. A generator with a mean of 0 will create samples with varying means, right?

So, how to force the sample to have a mean of 0? Take a 2 step approach:

R> x <- rnorm(100, mean = 5, sd = 2)
R> x <- (x - mean(x))/sqrt(var(x))
R> mean(x)
[1] 1.385177e-16
R> var(x)
[1] 1
and now create your sample with mean 5 and sd 2:
R> x <- x*2 + 5
R> mean(x)
[1] 5
R> var(x)
[1] 4
(from Torsten.Hothorn)

[top]

1.11 Weighted replicates   (30/01/2001 Paul Johnson <pauljohn@ku.edu>)

x<-c(10,40,50,100)  # income vector for instance
w<-c(2,1,3,2)       # the weight for each observation in x with the same
rep(x,w)
[1]  10  10  40  50  50  50 100 100

(from P. Malewski)

[top]

1.12 Expand dataset to represent weighted data   (06/04/2001 Paul Johnson <pauljohn@ku.edu>)

Also, most of the functions that have weights have frequency weights rather than probability weights: that is, setting a weight equal to 2 has exactly the same effect as replicating the observation.

If you have frequency weights you may have to expand your dataset.

  expanded.data<-as.data.frame(lapply(compressed.data,
                   function(x) rep(x,compressed.data$weights)))
should do it.

(from Thomas Lumley)

[top]

1.13 Convert contingency table to data frame   (06/09/2000 Paul Johnson <pauljohn@ku.edu>)

Given a 8 dimensional crosstab, you want a data frame with 8 factors and 1 column for frequencies of the cells in the table.

R1.2 introduces a function as.data.frame.table() to handle this.

This can also be done manually. Here's a function(it's a simple wrapper around expand.grid):

   dfify <- function(arr, value.name = "value", dn.names =
names(dimnames(arr))) {

Version <- "$Id: dfify.sfun,v 1.1 1995/10/09 16:06:12 d3a061 Exp $" dn <- dimnames(arr <- as.array(arr)) if(is.null(dn)) stop("Can't data-frame-ify an array without dimnames") names(dn) <- dn.names ans <- cbind(expand.grid(dn), as.vector(arr)) names(ans)[ncol(ans)] <- value.name ans }

The name is short for "data-frame-i-fy".

For your example, assuming your multi-way array has proper dimnames, you'd just do:

my.data.frame <- dfify(my.array, value.name="frequency")

(from Todd Taylor)

[top]

1.14 Write: data in ascii form   (31/12/2001 Paul Johnson <pauljohn@ku.edu>)

Say I have a command that produced a 28 x 28 data matrix. How can I output the matrix into a txt file (rather than copy/paste the matrix)?

write.table(mat,file="filename")

Pesl Thomas said, "I want to output a file to my disc with the contents of a data frame I called "tab"

> tab

or 1 2 3 4 5 6 1 57 80 25 23 46 23 2 106 35 59 8 51 4

You can use

write.table(unclass(tab))

You could also do

  write(t(tab),ncol=NCOL(tab))
since write() transposes rows and columns. (from Thomas Lumley)

Note MASS library has a function write.matrix which is faster if you need to write a numerical matix, not a data frame. Good for big jobs.

[top]

2 Working with data frames: Recoding, selecting, aggregating


2.1 Add variables to a data frame (or list)   (02/06/2003 Paul Johnson <pauljohn@ku.edu>)

In r-help, I asked " I keep finding myself in a situation where I want to calculate a variable name and then use it on the left hand side of an assignment. For example

> iteration <- 1
> varName <- paste("run",iteration,sep="")
> myList$parse(text=varName) <- aColumn

I got many useful answers that opened my eyes!

Brian Ripley said:

For a data frame you could use
mydf[paste("run",iteration,sep="")] <- aColumn
and for a list or a data frame
Robject[[paste("run",iteration,sep="")]] <- aColumn

And Thomas Lumley added: " If you wanted to do something of this sort for which the above didn't workyou could also learn about substitute():

  eval(substitute(myList$newColumn<-aColumn),
     list(newColumn=as.name(varName)))
as an alternative to parse().

-(Thomas Lumley)

[top]

2.2 Create variable names "on the fly"   (10/04/2001 Paul Johnson <pauljohn@ku.edu>)

Paste together a variable name, set it to a value. Use assign. As in

> assign(paste("file", 1, "max", sep=""), 1)
> ls()
[1] "file1max"
(Brian Ripley, June 18, 2001)

[top]

2.3 Recode one column, output values into another column   (12/05/2003 Paul Johnson <pauljohn@ku.edu>)

Please read the documentation for transform() and replace() and also learn how to use the magic of R vectors.

The transform() function works only for data frames. Suppose a data frame is called "mdf" and you want to add a new variable "newV" that is a function of var1 and var2:

mdf <- transform(mdf, newV=log(var1) + var2))

I'm inclined to take the easy road when I can. Proper use of indexes in R will help a lot, especially for recoding discrete valued variables. Some cases are particularly simple because of the way arrays are processed.

Suppose you create a variable, and then want to reset some values to missing. Go like this:

 x <- rnorm(10000)
x[x > 1.5] <- NA
And if you don't want to replace the original variable, create a new one first ( xnew <- x ) and then do that same thing to xnew.

You can put other variables inside the brackets, so if you want x to equal 234 if y equals 1, then

x[y == 1] <- 234

Suppose you have v1, and you want to add another variable v2 so that there is a translation. If v1=1, you want v2=4. If v1=2, you want v2=4. If v1=3, you want v2=5. This reminds me of the old days using SPSS and SAS. I think it is clearest to do:

> v1 <- c(1,2,3)
# now initialize v2
> v2 <- rep( -9, length(v1))
# now recode v2
> v2[v1==1] <- 4
> v2[v1==2]<-4
> v2[v1==3]<-5
> v2[1] 4 4 5

Note that R's "ifelse" command can work too:

x<-ifelse(x>1.5,NA,x)

One user recently asked how to take data like a vector of names and convert it to numbers, and 2 good solutions appeared:

y  <- c( "OLDa", "ALL",  "OLDc", "OLDa", "OLDb", "NEW", "OLDb", "OLDa", "ALL","...")
> el <- c("OLDa", "OLDb", "OLDc", "NEW", "ALL")
> match(y,el)[1] 1 5 3 1 2 4 2 1 5 NA
or

> f <- factor(x,levels=c("OLDa", "OLDb", "OLDc", "NEW", "ALL") )
> as.integer(f)
> [1] 1 5 3 1 2 4 2 1 5

I asked Mark Myatt (author of the ReX guide mentioned below) for more examples:

For example, suppose I get a bunch of variables coded on a scale

   1 = no
6 = yes
8 = tied
9 = missing
10 = not applicable.
Recode that into a new variable name with 0=no, 1=yes, and all else NA.

It seems like the replace() function would do it for single values but you end up with empty levels in factors but that can be fixed by re-factoring the variable. Here is a basic recode() function:

recode <- function(var, old, new){
x <- replace(var, var == old, new)
if(is.factor(x)) factor(x) else x}
For the above example:
 test <- c(1,1,2,1,1,8,1,2,1,10,1,8,2,1,9,1,2,9,10,1)
testtest <- recode(test, 1, 0)
test <- recode(test, 2, 1)
test <- recode(test, 8, NA)
test <- recode(test, 9, NA)
test <- recode(test, 10, NA)
test
Although it is probably easier to use replace():
test <- c(1,1,2,1,1,8,1,2,1,10,1,8,2,1,9,1,2,9,10,1)
testtest <- replace(test, test == 8 | test == 9 | test == 10, NA)
test <- replace(test, test == 1, 0)
test <- replace(test, test == 2, 1)
test

I suppose a better function would take from and to lists as arguments:


recode <- function(var, from, to){
x <- as.vector(var)
for (i in 1:length(from)){
x <- replace(x, x == from[i], to[i])}
if(is.factor(var)) factor(x) else x}
For the example:
test <- c(1,1,2,1,1,8,1,2,1,10,1,8,2,1,9,1,2,9,10,1)
testtest <- recode(test, c(1,2,8:10), c(0,1))
test
and it still works with single values.

Suppose somebody gives me ascale from 1 to 100, and I want to collapse it into 10 groups, how do I go about it?

Mark says: Use cut() for this. This cuts into 10 groups:


test <- trunc(runif(1000,1,100))
groups <-cut(test,seq(0,100,10))
table(test, groups)
To get ten groups without knowing the minimum and maximum value you canuse pretty():

groups <- cut(test,pretty(test,10))
table(test, groups)
You can specify the cut-points:

groups <- cut(test,c(0,20,40,60,80,100))
table(test, groups)
And they don't need to be even groups:

groups <- cut(test,c(0,30,50,75,100))
table(test, groups)
Mark added, "I think I will add this sort of thing to the REX pack."

2003-12-01, someone asked how to convert a vector of numbers to characters, such as

>if x[i] < 250 then col[i] = "red"
>else if x[i] < 500 then col[i] = "blue"
and so forth. Many interesing answers appeared in R-help. A big long nested ifelse would work, as in:
x.col <- ifelse(x < 250, "red",
   ifelse(x<500, "blue", ifelse(x<750, "green", "black"))) 

There were some nice suggestions to use cut, such as Gabor Grothendeick's advice:
The following results in a character vector:

>colours <- c("red", "blue", "green","back")
>colours[cut(x, c(-Inf,250,500,700,Inf),right=F,lab=F)]
While this generates a factor variable:
colours <- c("red", "blue", "green","black")
cut(x, c(-Inf,250,500,700,Inf),right=F,lab=colours)

[top]

2.4 Create indicator (dummy) variables   (20/06/2001 Paul Johnson <pauljohn@ku.edu>)

2 examples:

c is a column, you want dummy variable, one for each valid value. First, make it a factor, then use model.matrix():

> x <- c(2,2,5,3,6,5,NA)
> xf <- factor(x,levels=2:6)
> model.matrix(~xf-1)
  xf2 xf3 xf4 xf5 xf6
1   1   0   0   0   0
2   1   0   0   0   0
3   0   0   0   1   0
4   0   1   0   0   0
5   0   0   0   0   1
6   0   0   0   1   0
attr(,"assign")
[1] 1 1 1 1 1

(from Peter Dalgaard )

Question: I have a variable with 5 categories and I want to create dummy variables for each category.

Answer: Use row indexing or model.matrix.

ff <- factor(sample(letters[1:5], 25, replace=TRUE))

diag(nlevels(ff))[ff,]

or

model.matrix(~ ff - 1)

(from Brian D. Ripley)

[top]

2.5 Create lagged values of variables for time series regression   (02/06/2003 Paul Johnson <pauljohn@ku.edu>)

Peter Dalgaard explained, "the simple way is to create a new variable which shifts the response, i.e.

yshft <- c(y[-1], NA) # pad with missing
summary(lm(yshft ~ x + y))
Alternatively, lag the regressors:
N <- length(x)
xlag <- c(NA, x[1:(N-1)])
ylag <- c(NA, y[1:(N-1)])
summary(lm(y ~ xlag + ylag))

[top]

2.6 How to drop factor levels for datasets that don't have observations with those values?   (08/01/2002 Paul Johnson <pauljohn@ku.edu>)

The best way to drop levels, BTW, is

problem.factor <- problem.factor[,drop=TRUE] (from Brian D. Ripley)

[top]

2.7 Select/subset observations out of a dataframe   (08/30/2003 Paul Johnson <pauljohn@ku.edu>)

Want to take observations for which variable Y is greater than A and less than B: do

X[Y>=A & Y<=B]

Suppose you want observations with c=1 in df1. This makes a new data frame you want.

df2 <- df1[df1$c == 1,]

and note that indexing is pretty central to using S (the language), so it is worth learning all the ways to use it. (from Brian Ripley)

Or use "match" select values from the column "d" by taking the ones that match the values of another column, as in

> d <- t(array(1:20,dim=c(2,10)))
> i <- c(13,5,19)
> d[match(i,d[,1]), 2][1] 14 6 20
(from Peter Wolf)

Till Baumgaertel wanted to select observations for men over age 40, and sex was coded either m or M. Here are two working commands:

1.) maleOver40 <- subset(d, sex %in% c("m","M") & age > 40)
2.) maleOver40 <- d[(d$sex == "m" | d$sex == "M") & d$age >40,]
To decipher that, do
?"%in" to find out about match() and the %in% operator. If you want to grab the rows for which the variable "subject" is 15 or 19, try:
nameofdataframe$subject %in% c(19,15)
to get a True/False indication for each row in the data frame, and you can then use that output to pick the rows you want:
indicator <- nameofdataframe$subject %in% c(19,15)
nameofdataframe[indicator,]

How to deal with values that are already marked as missing? John Fox says:

x2 <- na.omit(x)
produces a copy of the data frame x with all rows that contain missing data removed. The function na.exclude could be used also. For more information on missings, check help : ?na.exclude.

For exclusion of missing, Peter Dalgaard likes

subset(x,complete.cases(x)) or x[complete.cases(x),]
adding "is.na(x) is preferable to x != "NA"

[top]

2.8 Delete first observation for each   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

Given data like:

 1  ABK     19910711     11.1867461      0.0000000       108
 2  ABK     19910712     11.5298979     11.1867461      111
 6 CSCO   19910102      0.1553819      0.0000000         106
 7 CSCO   19910103      0.1527778      0.1458333         166
remove the first observation for each value of the "sym" variable (the one coded ABK,CSCO, etc). . If you just need to remove rows 1,5, and 13, do:
newhilodata <- hilodata[-c(1,6,13),]
To solve the more general problem of omitting the first in each group, assuming "sym" is a factor, try something like
newhilodata <- subset(hilodata, diff(c(0,as.integer(sym))) != 0)
(actually, the as.integer is unnecessary because the c() will unclass the factor automagically)

(from Peter Dalgaard)

Alternatively, you could use the match function because it returns the first match. Suppose jm is the data set. Then:

> match(unique(jm$sym), jm$sym)
 [1]  1  6 13
 > jm <- jm[ -match(unique(jm$sym), jm$sym), ]
(from Douglas Bates)

As Robert pointed out to me privately: duplicated() does the trick

  subset(hilodata, duplicated(sym))
has got to be the simplest variant.

(from Peter Dalgaard)

[top]

2.9 Select a random sample of data   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

sample(N, n, replace=F)

and

seq(N)[rank(runif(N)) <= n]

is another general solution.

(from Brian D. Ripley)

[top]

2.10 Selecting Variables for Models: Don't forget the subset function   (15/08/2000 Paul Johnson <pauljohn@ku.edu>)

You can manage data directly by deleting lines or so forth, but subset() can be used to achieve the same effect without editing the data at all. Do ?select to find out more. Subset is also an option in many statistical functions like lm.

Peter Dalgaard gave this example, using the "builtin" dataset airquality.

data(airquality)
names(airquality)
lm(Ozone~.,data=subset(airquality,select=Ozone:Month))
lm(Ozone~.,data=subset(airquality,select=c(Ozone:Wind,Month)))
lm(Ozone~.-Temp,data=subset(airquality,select=Ozone:Month))

The "." on the RHS of lm means "all variables" and the subset command on the rhs picks out different variables from the dataset. "x1:x2" means variables between x1 and x2, inclusive.

[top]

2.11 Process all numeric variables, ignore character variables?   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

data<-read.table("lastline.txt",header=T,as.is = TRUE) indices<-1:dim(data)[2] indices<-na.omit(ifelse(indices*sapply(data,is.numeric),indices,NA)) mean<-sapply(data[,indices],mean) sd<-sapply(data[,indices],sd)

[top]

2.12 Sorting by more than one variable   (06/09/2000 Paul Johnson <pauljohn@ku.edu>)

Can someone tell me how can I sort a list that contains duplicates (name) but keeping the duplicates together when sorting the values.

name  M
1234  8
1234  8.3
4321  9
4321  8.1

I also would like to set a cut-off, so that anything below a certain values will not be sorted.(from Kong, Chuang Fong)

I take it that the cutoff is on the value of M. OK, suppose it is the value of `coff'.

sort.ind <- order(name, pmax(coff, M))  #  sorting index
name <- name[sort.ind]
M <- M[sort.ind]
Notice how using pmax() for "parallel maximum" you can implement the cutoff by raising all values below the mark up to the mark thus putting them all into the same bin as far as sorting is concerned.

If your two variables are in a data frame you can combine the last two steps into one, of course.

sort.ind <- order(dat$name, pmax(coff, dat$M))
dat <- dat[sort.ind, ]
In fact it's not long before you are doing it all in one step:

dat <- dat[order(dat$name, pmax(coff, dat$M)), ]

(from Bill Venables)

I want the ability to sort a data frame lexicographically according to several variables

Here's how:

spsheet[order(name,age,zip),] (from Peter Dalgaard)

[top]

2.13 Rank within subgroups defined by a factor   (06/09/2000 Paul Johnson <pauljohn@ku.edu>)

Read the help for by()

> by(x[2], x$group, rank)
x$group: A
[1] 4.0 1.5 1.5 3.0
------------------------------------------------------------ 
x$group: B
[1] 3 2 1

> c(by(x[2], x$group, rank), recursive=T) A1 A2 A3 A4 B1 B2 B3 4.0 1.5 1.5 3.0 3.0 2.0 1.0

(from Brian D. Ripley)

[top]

2.14 Work with missing values (na.omit, is.na,...   (01/05/2005 Paul Johnson <pauljohn@ku.edu>)

To make sure missings are omitted from a dataset called "insure":

 
thisdf <- na.omit(insure[,c(1, 19:39)]) 
body.m <- lm(BI.PPrem ~ ., data=thisdf, na.action=na.fail) 

is.na() returns a T for missings, and that can be used to spot and change them.

To convert missing values to a numerical value, Matt Wiener (2005-01-05) says " The commands below construct a matrix, insert some NA's, and then convert them all to 0.

> temp1 <- matrix(runif(25), 5, 5)
> temp1[temp1 < 0.1] <- NA
> temp1[is.na(temp1)] <- 0

[top]

2.15 Aggregate values, one for each line   (16/08/2000 Paul Johnson <pauljohn@ku.edu>)

Question: I want to read values from a text file - 200 lines, 32 floats per line - and calculate a mean for each of the 32 values, so I would end up with an "x" vector of 1-200 and a "y" vector of the 200 means.

Peter Dalgaard says do thi:

y <- apply(as.matrix(read.table("myfile")), 1, mean) x <- seq(along=y)

(possibly adding a couple of options to read.table, depending on the file format)

[top]

2.16 Create new data frame to hold aggregate values for each factor   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

How about this: Z<-aggregate(X, f, sum)

(assumes all X variables can be summed)

Or: [If] X contains also factors. I have to select variables for which summary statistics have to be computed. So I used:

Z <- data.frame(f=levels(f),x1=as.vector(tapply(x1,f,sum)))

(from Wolfgang Koller)

[top]

2.17 Selectively aggregate a data frame   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

Given

10 20  23 44 33
10 20  33 23 67
and you want
10 20   56 67 100
try this:
dat<-read.table("data.dat",header=TRUE)
aggregate(dat[,-(1:2)], by=list(std=dat$std, cf=dat$cf), sum)
note the first two columns are excluded by [,(1:2)] and the by optoin preserves those values in the output.

[top]

2.18 Rip digits out of real numbers one at a time   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

I want to "take out" the first decimal place of each output, plot them based on their appearence frequencies. Then take the second decimal place, do the same thing.

a<- log(1:1000)
d1<-floor(10*(a-floor(a)))  # first decimal
  par(mfrow=c(2,2))
  hist(d1,breaks=c(-1:9))
  table(d1)
d2<-floor(10*(10*a-floor(10*a))) # second decimal
  hist(d2,breaks=c(-1:9))
  table(d2)
(from Yudi Pawitan)

x <- 1:1000

ndig <- 6

(ii <- as.integer(10^(ndig-1) * log(x)))[1:7] (ci <- formatC(ii, flag="0", wid= ndig))[1:7] cm <- t(sapply(ci, function(cc) strsplit(cc,NULL)[[1]])) cm [1:7,]

apply(cm, 2, table) #--> Nice tables

# The plots : par(mfrow= c(3,2), lab = c(10,10,7)) for(i in 1:ndig) hist(as.integer(cm[,i]), breaks = -.5 + 0:10, main = paste("Distribution of ", i,"-th digit"))

(from Martin Maechler)

[top]

2.19 Grab an item from each of several matrices in a List   (14/08/2000 Paul Johnson <pauljohn@ku.edu>)

Let Z denote the list of matrices. All matrices have the same order. Suppose you need to take element [1,2] from each.

lapply(Z, function(x) x[1,2])

should do this, giving a list. Use sapply if you want a vector. (Brian Ripley)

[top]

2.20 Get vector showing values in a dataset   (10/04/2001 Paul Johnson <pauljohn@ku.edu>)

xlevels<-sort(unique(x))

Which you can use in a contour plot, as in

contour(xlevels,ylevels,zvals)

[top]

2.21 Calculate the value of a string representing an R command   (13/08/2000 Paul Johnson <pauljohn@ku.edu>)

Use the eval(parse()) pairing:

        String2Eval <- "A valid R statement"
        eval(parse(text = String2Eval)) 
(from Mark Myatt)

Also check out substitute(), as.name() et al. for other methods of manipulating expressions and function calls (from Peter Dalgaard)

Or

eval(parse(text="ls()"))

Or

eval(parse(text = "x[3] <- 5"))

[top]

2.22 "Which" can grab the index values of cases satisfying a test   (06/04/2001 Paul Johnson <pauljohn@ku.edu>)

To analyse large vectors of data using boxplot to find outliers, try:

which(x==boxplot(x,range=1)$out)

[top]

2.23 Find unique lines in a matrix/data frame   (31/12/2001 Paul Johnson <pauljohn@ku.edu>)

Jason Liao writes:

"I have 10000 integer triplets stored in A[1:10000, 1:3]. I would like to find the unique triplets among the 10000 ones with possible duplications."

Peter Dalgaard answers, "As of 1.4.0 (!):

unique(as.data.frame(A))"

[top]

2.24 Select only numerical variables   (12/11/2002 Paul Johnson <pauljohn@ku.edu>)

sapply(dataframe, is.numeric)
or
which(sapply(data.frame, is.numeric))
Thomas Lumley 11/11/2002

[top]

3 Matrices and vector operations


3.1 Create a vector, append values   (01/05/2005 Paul Johnson <pauljohn@ku.edu>)

Some things in R are so startlingly simple and different from other programs that the experts have a difficult time understanding our misunderstanding.

Mathematically, I believe a vector is a column in which all elements are of the same type (e.g., real numbers). (1,2,3) is a vector of integers as well as a column in the data frame sense. If you put dissimilar-typed items into an R column, it really creates a "list" object. (In R, a list is a collection of dissimilar types).

To create a mathematical vector, you can just initialize a column of numbers with c(). As long as all of your input is of the same type, you don't hit any snags. Note the is.vector() function will say this is a vector:

 
> v <- c(1, 2, 3) 
> is.vector(v) 

Caution: If your input is not all of the same type, then R converts it to characters, which you may not intend.

 
> gg <- c("1",3,4) 
> is.vector(gg) 
[1] TRUE 
> gg 
[1] "1" "3" "4" 
> is.character(gg) 
[1] TRUE 

The worst case scenario is that it manufactures a list. The double brackets [[]] are the big signal that you really have a list:

 
> bb <- c(1,c,4) 
> is.vector(bb) 
[1] TRUE 
> bb   
[[1]] 
[1] 1 

[[2]] .Primitive("c")

[[3]] [1] 4

> is.character(bb) [1] FALSE > is.integer(bb) [1] FALSE > is.list(bb) [1] TRUE

To make sure you get what you want, you can use the vector() function, and then tell it what type of elements you want in your vector. Please note it puts a "false" value in each position, not a missing.

 
> vector(mode="integer",10) 
 [1] 0 0 0 0 0 0 0 0 0 0 
> vector(mode="double",10) 
 [1] 0 0 0 0 0 0 0 0 0 0 
> vector(mode="logical",10) 
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE 
There are also methods to directly create the particular kind of vector you want (factor, character, integer, etc). For example, if you want a real-valued vector, there is a shortcut method numeric()
> x <- numeric(10)
> x
 [1] 0 0 0 0 0 0 0 0 0 0

If you want to add values onto the end of a vector, you have options!

If you want to add a single value onto the vector, try:

 
> r <- rnorm(1) 
> v <- append(v, r) 

The length() command can not only tell you how long a vector is, but it can be used to RESIZE it. This command says if the length of v has reached the value vlen, then expand the length by doubling it.

 
if (vlen == length(v)) length(v) <- 2*length(v) 

These were mentioned in an email from Gabor Grothendieck to the list on 2005-01-04.

[top]

3.2 How to create an identity matrix?   (16/08/2000 Paul Johnson <pauljohn@ku.edu>)

diag(n)

Or, go the long way:

n<-c(5)
 I <- matrix(0,nrow=n,ncol=n)
I[row(I)==col(I)] <- 1
E.D.Isaia

[top]

3.3 Convert matrix "m" to one long vector   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

dim(m)<-NULL, or c(m) (from Peter Dalgaard )

[top]

3.4 Creating a peculiar sequence (1 2 3 4 1 2 3 1 2 1)   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

I don't know why this is useful, but it shows some row and matrix things:

I ended up using Brian Ripley's method, as I got it first and it worked, ie.

A <- matrix(1, n-1, n-1) 
rA <- row(A) 
rA[rA + col(A) <= n] 
However, thanks to Andy Royle I have since discovered that there is a much more simple and sublime solution:
sequence(n-1:1) 
[1] 1 2 3 4 1 2 3 1 2 1 
Karen Kotschy

[top]

3.5 Select every n'th item   (14/08/2000 Paul E Johnson <pauljohn@ku.edu>)

extract every nth element from a very long vector, vec?

vec[seq(n, length(vec), n)](from Brian Ripley)

seq(1,11628,length=1000) will give 1000 evenly spaced numbers from 1:11628 that you can then index with. (from Thomas Lumley)

My example:

vec <- rnorm(1999)
newvec <- vec[1, length(vec), 200]
> newvec
 [1]  0.2685562  1.8336569  0.1371113  0.2204333 -1.2798172  0.3337282
 [7] -0.2366385  0.5060078  0.9680530  1.2725744
This shows the items in vec at indexes (1, 201, 401, ..., 1801)

[top]

3.6 Find index of a value nearest to 1.5 in a vector   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

n <- 1000
x <- sort(rnorm(n))
x0 <- 1.5
dx <- abs(x-x0)
which(dx==min(dx))
(from Jan Schelling)

 which(abs(x - 1.5) == min(abs(x - 1.5)))
(from Uwe Ligges)

[top]

3.7 Find index of nonnegative items in vector   (18/06/2001 Paul Johnson <pauljohn@ku.edu>)

which(x!=0) (from Uwe Ligges)

rfind <- function(x) seq(along=x)[x != 0] (from Brian D. Ripley)

Concerning speed and memory efficiency I find

as.logical(x)

is better than

x!=0

and

seq(along=x)[as.logical(x)]

is better than

which(as.logical(x))

thus

which(x!=0)

is shortest and

rfind <- function(x)seq(along=x)[as.logical(x)]

seems to be computationally most efficient

(from Jens Oehlschlägel-Akiyoshi)

[top]

3.8 Find index of missing values   (15/08/2000 Paul Johnson <pauljohn@ku.edu>)

Suppose the vector "Pes" has 600 observations. Don't do this:

(1:600)[is.na(Pes)]

The `approved' method is

seq(along=Pes)[is.na(Pes)]

In this case it does not matter as the subscript is of length 0, but it has floored enough library/package writers to be worth thinking about.

(from Brian Ripley)

However, the solution I gave

which(is.na(Pes))

is the one I still really recommend; it does deal with 0-length objects, and it keeps names when there are some, and it has an `arr.ind = FALSE' argument to return array indices instead of vector indices when so desired. (from Martin Maechler)

[top]

3.9 Find index of largest item in vector   (16/08/2000 Paul Johnson <pauljohn@ku.edu>)

T[which(A==max(A,na.rm=TRUE))]

(pj: note na.rm deprecated? I think it is now na.omit)

[top]

3.10 Replace values in a matrix   (22/11/2000 Paul Johnson <pauljohn@ku.edu>)

> tmat <- matrix(rep(0,3*3),ncol=3) > tmat [,1] [,2] [,3] [1,] 0 0 0 [2,] 0 0 0 [3,] 0 0 0 > tmat[tmat==0]<-1 > tmat [,1] [,2] [,3] [1,] 1 1 1 [2,] 1 1 1 [3,] 1 1 1 (from Jan Goebel)

If la is a data frame, you have to coerce the data into matrix form, with:

 la <- as.matrix(la)
 la[la==0] <- 1

Try this:

la <- ifelse(la == 0, 1, la)
(from Martyn Plummer)

[top]

3.11 Delete particular rows from matrix   (06/04/2001 Paul Johnson <pauljohn@ku.edu>)

> x <- matrix(1:10,,2)
> x[x[,1]%in%c(2,3),]
> x[!x[,1]%in%c(2,3),]
(from Peter Malewski)

mat[!(mat$first %in% 713:715),]
(from Peter Dalgaard )

[top]

3.12 Count number of items meeting a criterion   (01/05/2005 Paul Johnson <pauljohn@ku.edu>)

Apply "length()" to results of which() described in previous question, as in

  
length(which(v<7))  
or  
sum( v<7,na.rm=TRUE)  

If you apply sum() to a matrix, it will scan over all columns. To focus on a particular column, use subscripts like sum(v[,1]>0) or such. If you want separate counts for columns, there's a method colSums() that will count separately for each column.

[top]

3.13 Compute partial correlation coefficients from correlation matrix   (08/12/2000 Paul Johnson <pauljohn@ku.edu>)

I need to compute partial correlation coefficients between multiple variables (correlation between two paired samples with the "effects of all other variables partialled out")? (from Kaspar Pflugshaupt)

Actually, this is quite straightforward. Suppose that R is the correlation matrix among the variables. Then,

        Rinv<-solve(R)
        D<-diag(1/sqrt(diag(Rinv)))
        P<- -D %*% Rinv %*% D
The off-diagonal elements of P are then the partial correlations between each pair of variables "partialed" for the others. (Why one would want to do this is another question.)

(from John Fox )

In general you invert the variance-covariance matrix and then rescale it so the diagonal is one. The off-diagonal elements are the negative partial correlation coefficients given all other variables.

pcor2 <- function(x){
        conc <- solve(var(x))
        resid.sd <- 1/sqrt(diag(conc))
        pcc <- - sweep(sweep(conc, 1, resid.sd, "*"), 2, resid.sd, "*")
        return(pcc)
}

pcor2(cbind(x1,x2,x3))

see J. Whittaker's book "Graphical models in applied multivariate statistics" from Martyn Plummer)

This is the version I'm using now, together with a test for significance of each coefficient (H0: coeff=0):

f.parcor <-
function (x, test = F, p = 0.05)
{
    nvar <- ncol(x)
    ndata <- nrow(x)
    conc <- solve(cor(x))
    resid.sd <- 1/sqrt(diag(conc))
    pcc <- -sweep(sweep(conc, 1, resid.sd, "*"), 2, resid.sd,
        "*")
    colnames(pcc) <- rownames(pcc) <- colnames(x)
    if (test) {
        t.df <- ndata - nvar
        t <- pcc/sqrt((1 - pcc^2)/t.df)
        pcc <- list(coefs = pcc, significant = t > qt(1 - (p/2),
            df = t.df))
    }
    return(pcc)
}
(from Kaspar Pflugshaupt)

[top]

3.14 Compute Canonical variates   (03/22/2008 Paul Johnson <pauljohn@ku.edu>)

Suppose you have data matrices A and B on the same observations. Then

 
cancor <- function(A, B)  
{ 
   Ap <- prcomp(scale(A, T, F), retx=T) 
   Apc <- Ap$x %*% diag(1/Ap$sdev) 
   Bp <- prcomp(scale(B, T, F), retx=T) 
   Bpc <- Bp$x %*% diag(1/Bp$sdev) 
   Sigma <- crossprod(Apc, Bpc)/(nrow(A) - 1) 
   s <- svd(Sigma, ncol(A), ncol(B)) 
   return(list(cor=s$d, canvar.x=Apc %*% s$u, canvar.y=Bpc %*% s$v)) 
} 
should do the trick. The canonical variates are zero-mean, unit-variance (unlike S-PLUS). (from Brian D. Ripley)

[top]

3.15 Create a multidimensional matrix (R array)   (20/06/2001 ? <?>)

Brian Ripley said:

my.array<-array(0,dim=c(10,5,6,8))

will give you a 4-dimensional 10 x 5 x 6 x 8 array.

Or

array.test <- array(1:64,c(4,4,4))
array.test[1,1,1]
1
array.test[4,4,4]
64

[top]

3.16 Combine a lot of matrices   (20/06/2001 ? <?>)

Given a list with 100s of (nx2) matrices, how do you combine them into a giant (Nx2) matrix?

list is the list of matrices.

nr <- sapply(list, nrow)
cs <- cumsum(nr)
st <- c(0, cs[-length(cs)]) + 1
res <- matrix(, sum(nr), 2)
for(i in seq(along=nr)) res[st[i]:cs[i],] <- list[[i]]
(from Brian Ripley)

[top]

3.17 Create "neighbor" matrices according to specific logics   (20/06/2001 ? <?>)

Want a matrix of 0s and 1s indicating whether a cell has a neighbor at a location:

N <- 3
x <- matrix(1:(N^2),nrow=N,ncol=N)
rowdiff <- function(y,z,mat)abs(row(mat)[y]-row(mat)[z])
coldiff <- function(y,z,mat)abs(col(mat)[y]-col(mat)[z])
rook.case <- function(y,z,mat){coldiff(y,z,mat)+rowdiff(y,z,mat)==1}
bishop.case <- function(y,z,mat){coldiff(y,z,mat)==1 & rowdiff(y,z,mat)==1}
queen.case <- function(y,z,mat){rook.case(y,z,mat) | bishop.case(y,z,mat)}
matrix(as.numeric(sapply(x,function(y)sapply(x,rook.case,y,mat=x))),ncol=N^2,nrow=N^2)
matrix(as.numeric(sapply(x,function(y)sapply(x,bishop.case,y,mat=x))),ncol=N^2,nrow=N^2)
matrix(as.numeric(sapply(x,function(y)sapply(x,queen.case,y,mat=x))),ncol=N^2,nrow=N^2)
(from Ben Bolker)

[top]

3.18 "Matching" two columns of numbers by a "key"   (20/06/2001 ? <?>)

The question was:

I have a matrix of predictions from an proportional odds model (using the polr function in MASS), so the columns are the probabilities of the responses, and the rows are the data points. I have another column with the observed responses, and I want to extract the probabilities for the observed responses.

As a toy example, if I have

> x <- matrix(c(1,2,3,4,5,6),2,3)
> y <- c(1,3)
and I want to extract the numbers in x[1,1] and x[2,3] (the columns being indexed from y), what do I do?

Is

x[cbind(seq(along=y), y)]

what you had in mind? The key is definitely matrix indexing. (from Brian Ripley)

[top]

3.19 Create Upper or Lower Triangular matrix   (20/06/2001 ? <?>)

There are many ways to do this. Whenever it comes up in r-help, there is a different answer.

Suppose you want a matrix like

a^0  0    0
a^1  a^0  0
a^2  a^1  a^0

I don't know why you'd want that, but look at the differences among answers this elicited.

  ma <- matrix(rep(c(a^(0:n), 0), n+1), nrow=n+1, ncol=n+1)
  ma[upper.tri(ma)] <- 0
(from Uwe Ligges)

> n <- 3
> a <- 2
> out <- diag(n+1)
> out[lower.tri(out)] <-
a^apply(matrix(1:n,ncol=1),1,function(x)c(rep(0,x),1:(n-x+1)))[lower.tri(out)]
> out
(from Woodrow Setzer)

tmpmat <- function(a,n) {
  x <- matrix(a,nrow=n,ncol=n)
  x <- x^pmax(0,row(x)-col(x))
  x[row(x)
(from Ben Bolker)

If you want an upper triangular matrix, this use of "ifelse" looks great to me:

fun <- function(x,y) { ifelse(y>x, x+y, 0) }
(then use this with outer()) or
m <- outer(x,y,FUN="+")
m[lower.tri(m, diag=T)] <- 0
(from Kaspar Pflugshaupt)

[top]

3.20 Calculate inverse of X   (20/06/2001 ? <?>)

solve(A) yields the inverse of A.

[top]

3.21 Interesting use of Matrix Indices   (20/06/2001 ? <?>)

Here's a similar problem.

Mehdi Ghafariyan said "I have two vectors A=c(5,2,2,3,3,2) and B=c(2,3,4,5,6,1,3,2,4,3,1,5,1,4,6,1,4) and I want to make the following matrix using the information I have from the above vectors.

 
      0 1 1 1 1 1 
      1 0 1 0 0 0 
      0 1 0 1 0 0 
      1 0 1 0 1 0
      1 0 0 1 0 1 
      1 0 0 1 0 0 
so the first vector says that I have 6 elements therefore I have to make a 6 by 6 matrix and then I have to read 5 elements from the second vector , and put 1 in [1,j] where j=2,3,4,5,6 and put zero elsewhere( i.e. in [1,1]) and so on. Any idea how this can be done in R ?

Use matrix indices:

 a<-c(5,2,2,3,3,2) 
 b<-c(2,3,4,5,6,1,3,2,4,3,1,5,1,4,6,1,4)

n<-length(a) M<-matrix(0,n,n) M[cbind(rep(1:n,a),b)]<-1

(from Peter Dalgaard )

[top]

3.22 Eigenvalues example   (20/06/2001 ? <?>)

Tapio Nummi asked about double-checking results of spectral decomposition

In what follows, m is this matrix:

  0.4015427  0.08903581 -0.2304132
  0.08903581 1.60844812  0.9061157
-0.23041322 0.9061157   2.9692562

Brian Ripley posted:

> sm <- eigen(m, sym=TRUE)
> sm
$values
[1] 3.4311626 1.1970027 0.3510817

$vectors [,1] [,2] [,3] [1,] -0.05508142 -0.2204659 0.9738382 [2,] 0.44231784 -0.8797867 -0.1741557 [3,] 0.89516533 0.4211533 0.1459759

> V <- sm$vectors > t(V) %*% V [,1] [,2] [,3] [1,] 1.000000e+00 -1.665335e-16 -5.551115e-17 [2,] -1.665335e-16 1.000000e+00 2.428613e-16 [3,] -5.551115e-17 2.428613e-16 1.000000e+00

> V %*% diag(sm$values) %*% t(V) [,1] [,2] [,3] [1,] 0.40154270 0.08903581 -0.2304132 [2,] 0.08903581 1.60844812 0.9061157 [3,] -0.23041320 0.90611570 2.9692562

[top]

4 Applying functions, tapply, etc


4.1 Return multiple values from a function   (30/01/2001 Paul Johnson <pauljohn@ku.edu>)

Below is a summary on returning variable from subfunction, Thanks to Brian Ripley and Douglas Bates:

To access the variables within a function, return a list or data structure assigned to a variable in the calling function. For more than one varaible, assign them to a list variable and separate the components in the calling function using listname$variablename.

"test" <-
    function()
  {
    "hello" <-
      function()    
        {
          x <- 1:3
          y <- 23:34
          z <- cbind(x,y)
          return(x,y,z)              
        }
    c <- hello()
    return(c$z,c$x, c$y)
  }
Sam McClatchie

[top]

4.2 Grab "p" values out of a list of significance tests   (22/08/2000 Paul Johnson <pauljohn@ku.edu>)

Suppose chisq1M11 is a list of htest objects, and you want a vector of p values. Kjetil Kjernsmo observed this works:

> apply(cbind(1:1000), 1, function(i) chisq1M11[[i]]$p.value)

And Peter Dalgaard showed a more elegant approach:

 
sapply(chisq1M11,function(x)x$p.value)

In each of these, a simple R function called "function" is created and applied to each item

[top]

4.3 ifelse usage   (06/04/2001 Paul Johnson <pauljohn@ku.edu>)

If you want to calculate something, but only if y is *between* 0 and 1 then

ifelse(y==0,0,y*log(y))

(from Ben Bolker)

[top]

4.4 Apply to create matrix of probabilities, one for each "cell"   (14/08/2000 Paul Johnson <pauljohn@ku.edu>)

Suppose you want to calculate the gamma density for various values of "scale" and "shape". So you create vectors "scales" and "shapes", then create a grid if them with expand.grid, then write a function, then apply it (this example courtesy of Jim Robison-Cox) :

gridS <- expand.grid(scales, shapes)
survLilel <- function(ss) sum(dgamma(Survival,ss[1],ss[2]))
Likel <- apply(gridS,1,survLilel)

Actually, I would use

sc <- 8:12; sh <- 7:12
args <- expand.grid(scale=sc, shape=sh)

matrix(apply(args, 1, function(x) sum(dgamma(Survival, scale=x[1], shape=x[2], log=T))), length(sc), dimnames=list(scale=sc, shape=sh))

(from Brian Ripley)

[top]

4.5 Outer.   (15/08/2000 Paul Johnson <pauljohn@ku.edu>)

outer(x,y,f) does just one call to f with arguments created by stacking x and y together in the right way, so f has to be vectorised. (from Thomas Lumley)

[top]

4.6 Check if something is a formula/function   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

Formulae have class "formula", so inherits(obj, "formula") looks best. (from Prof Brian D Ripley)

And if you want to ensure that it is ~X+Z+W rather than Y~X+Z+W you can use

inherits(obj,"formula") && (length(obj)==2)

(from Thomas Lumley)

[top]

4.7 Optimize with a vector of variables   (11/08/2000 Paul Johnson <pauljohn@ku.edu>)

The function being optimized has to be a function of a single argument. If alf and bet are both scalars you can combine them into a vector and use

opt.func <- function(arg) 
  t(Y-(X[,1] * arg[1] + X[,2] * arg[2])^delta) %*% covariance.matrix.inverse %*%
            (Y-(X[,1] * arg[1] + X[,2] * arg[2])^delta)
(from Douglas Bates)

[top]

4.8 slice.index, like in S+   (14/08/2000 Paul E Johnson <pauljohn@ku.edu>)

slice.index<-function(x,i){k<-dim(x)[i];sweep(x,i,1:k,function(x,y)y)}

(from Peter Dalgaard)

[top]

5 Graphing


5.1 Adjust features with par before graphing   (18/06/2001 Paul Johnson <pauljohn@ku.edu>)

par() is a multipurpose command to affect qualities of graphs. Examples include:

1. put the next plots into a single picture:

par(mfrow=c(3,2))
creates a 3x2 matrix of plots.

2. Adjust margins, as in

par(mfrow=c(2,1))
par(mar=c(10, 5, 4, 2))
x <- 1:10
plot(x)
box("figure", lty="dashed")
par(mar=c(5, 5, 10, 2))
plot(x)
box("figure", lty="dashed")

[top]

5.2 Save graph output   (09/29/2005 Paul Johnson <pauljohn@ku.edu>)

There are two ways to save and/or print graphs. Before you create your graphs, you can decide on a format for output. A graph on the screen is using the default device, either x11 on Unix or windows on MS Windows.

Saving output is a complicated topic, and let me start by just telling you some examples that work "pretty well."

If you want to save a graph on the screen as an encapsulated post script file for inclusion in a document, do this

dev.copy(postscript, file="myfile.eps", height=6, width=6, horizontal=F, onefile=F)
dev.off()

Height and width are in inches, and the onefile=F option is absolutely vital if you want the result to be truly useful in your output document. The onefile=F makes sure the EPS bounding box is set, so programs that use the figure can find the boundaries in it.

If you want to save the same as a picture image, say a png file with white background, do

dev.copy(png,filename="myfile.png",height=600, width=800,bg="white")
dev.off()

png format required height and width in pixels, not inches like postscript.

Note you can also set the pointsize for text. Peter Dalgaard told me that on postscript devices, a point is 1/72nd of an inch. If you expect to shrink an image for final presentation, set a big pointsize, like 20 or so, in order to avoid having really small type.

The dev.copy approach works well ALMOST ALL OF THE TIME. Sometimes, however, you get in trouble because the size on the computer screen does not translate into the size you specify in your output. So you can either re-do your screen output or revise your copy command.

The R experts recommend that instead of copying a result from screen into a device file, we should instead create the device and then run the plot commands. Then R knows it is supposed to create the graph for the indicated device and everything is supposed to be internally consistent.

R uses the term "device" to refer to the various graphical output formats. The command dev.off() is vital because it lets the device know you are finished drawing in there.

There are many devices representing kinds of file output. You can choose among device types, postscript, png, jpeg, bitmap, etc. (windows users can choose windows metafile). Type

 
>?device 
to see a list of devices R finds on your system.

For each device, you can find out more about it.

 
?x11 
or 
?png 
or, in MS Windows 
?windows 
(and so forth) 

You start a device with a configuration command, like

 
png(filename="myfile.png",width=800, height=600,bg="blue") 
for the default png (a gif-like) format.

Suppose we want png output. We create an instance of a png device and adjust features for that device, as in:

 
> png(filename="mypicture.png", width=480, height=640, pointsize=12) 

Note, if no filename is here, it will print to the default printer.

You can then run your graph, and when it is finished you have to close the device:
 
> dev.off() 
Some print devices can accept multiple plots, and dev.off() is needed to tell them when to stop recording. On a single-plot device, if you try to run another graph before closing this one, you'll get an error.

The postscript device has options to output multiple graphs, one after the other. If you want to set options to the postscript device, but don't actually want to open it at the moment, use ps.options.

Here's an example of how to make a jpeg:

 
jpeg(filename="plot.jpg") 
plot(sin, 2*pi) 
dev.off() 

After you create a device, check what you have on:

 
> dev.list() 
And, if you have more than one device active, you tell which one to use with dev.set(n), for the n device. Suppose we want the third device:
 
> dev.set(3) 

For me, the problem with this approach is that I don't usually know what I want to print until I see it. If I am using the jpeg device, there is no screen output, no picture to look at. So I have to make the plot, see what I want, turn on an output device and run the plot all over in order to save a file. It seems complicated, anyway.

So what is the alternative?

If you already have a graph on the screen, and did not prepare a device, use dev.copy() or dev.print(). This may not work great if the size of your display does not match the target device.

dev.copy() opens the device you specify. It is as if (but not exactly like) you ran the png() command before you made the graph:

 
dev.copy(device=png, file="foo", width=500, height=300) 
dev.off() 

The dev.copy() function takes a device type as the first argument, and then it can take any arguments that would ordinarily be intended for that device, and then it can take other options as well. Be careful, some devices want measurements in inches, while others want them in pixels. Note that, if you do not specify a device, dev.copy() expects a which= option to be specified telling it which pre-existing device to use.

If you forget the last line, it won't work. It's like a write command.

If you use dev.copy or dev.print, you may run into the problem that your graph elements have to be resized and they don't fit together the way you expect. The default x11 size is 7in x 7in, but postscript size is 1/2 inch smaller than the usable papersize. That mismatch means that either you should set the size of the graphics window on the monitor display device to match the eventual output device, or you fiddle the dev.copy() or dev.print() command to make sure the sizes are correct. Recently I was puzzled over this and Peter Dalgaard said you can force the sizes to match, as in

  1. Force the pdf output to match the size of the monitor:
     
    dev.copy(pdf, file="foo", height=7, width=7) 
    
    or
  2. change the display size before creating the graphs so that its size matches the intended device you might use for output:
     
    x11(height=6, width=6) 
    

There are some specialized functions that can do this in a single step, as in dev.copy2eps() or dev2bitmap(). dev.copy2eps creates an eps file, and I'm not sure how it is different from the eps-compatible output created by the postscript() device. dev.copy2eps *is* an indirect way of calling dev.copy(device=postscript,...). It takes the dimensions from the current plot, and sets a couple of options. The same is true of dev.print().

The dev.print() function, as far as I can see, is basically the same as dev.copy(), except it has two appealing features. First, the graph is rescaled according to paper dimensions, and the fonts are rescaled accordingly. Due to the problem mentioned above, not everything gets rescaled perfectly, however, so take care. Second, it automatically turns off the device after it has printed/saved its result.

dev.print(device=postscript,file="yourFileName.ps")

If you just need to test your computer setup, Bill Simpson offered this. Here is a plot printing example

 
x<-1:10 
y<-x 
plot(x,y) 
dev.print(width=5,height=5, horizontal=FALSE) 

A default jpeg is 480x480, but you can change that:

 
jpeg(filename="plot.jpg" width = 460, height = 480, pointsize = 12, quality = 85) 
The format of png use is the same.

As of R1.1, the dev.print() and dev.copy2eps() will work when called from a function, for example:

 
> ps <- function(file="Rplot.eps", width=7, height=7, ...) { 
        dev.copy2eps(file=file, width=width, height=height, ...) 
   } 
> data(cars) 
> plot(cars) 
> ps() 

The mismatch of "size" between devices even comes up when you want to print out an plot. This command will print to a printer:

 
> dev.print(height=6, width=6, horizontal=FALSE) 
You might want to include pointsize=20 or whatever so the text is in proper proportion to the rest of your plot.

One user observed, "Unfortunately this will also make the hash marks too big and put a big gap between the axis labels and the axis title...", and in response Brian Ripley observed: "The problem is your use of dev.print here: the ticks change but not the text size. dev.copy does not use the new pointsize: try

 
x11(width=3, height=3, pointsize=8) 
x11(width=6, height=6, pointsize=16) 
dev.set(2) 
plot(1:10) 
dev.copy() 
Re-scaling works as expected for new plots but not re-played plots. Plotting directly on a bigger device is that answer: plots then scale exactly, except for perhaps default line widths and other things where rasterization effects come into play. In short, if you want postscript, use postscript() directly.

The rescaling of a windows() device works differently, and does rescale the fonts. (I don't have anything more to say on that now)

[top]

5.3 How to save separate files of plot output   (10/04/2001 Paul Johnson <pauljohn@ku.edu>)

postscript(file="test3.%d.ps",onefile=FALSE) gives files called test3.1.ps, test3.2.ps and so on. (from Thomas Lumley)

[top]

5.4 Control papersize   (15/08/2000 Paul Johnson <pauljohn@ku.edu>)

options(papersize="letter")

whereas ps.options is only for the postscript device

[top]

5.5 Integrating R graphs into documents: Tex and EPS   (20/06/2001 Paul Johnson <pauljohn@ku.edu>)

Advice seems to be to output R graphs in a scalable vector format, eps on Linux or eps or metafile on windows. If you are using Tex for document prep, on linux use "tetex" and on windows it has an implementation called http://www.tile.net/listserv/fptex.html and another called Miktex www.miktex.org I've installed Miktex and its pretty g