Vector variables in R

The “atomic” variable modes are

  1. integer
  2. double (floating point numeric)
  3. character
  4. logical (TRUE or FALSE represent 1 and 0) AKA Boolean

We’ll not discuss

  1. complex
  2. raw

Not all variables are vectors

Atomic: a single “column” of information

Factors: are not vectors in R, they are more complicated variables. For same reason, ordered factors are not vectors.

I guessed that Date objects are vectors, but not according to the help page ?vector

These are “multi-variable” structures

  1. matrix
  2. list
  3. data.frame

A matrix is still “atomic” because we can conceptualize it as a vector that is broken into columns. Not true of lists or data frames.

c and explicit typing

Easy to create vectors

Authors often introduce vectors by the c() function. Here’s a vector, for example

x <- c(13, 2, 33, 4, 35)

x is a column vector, mathematically speaking, even though it prints out horizontally to “save space”.

x
[1] 13  2 33  4 35

R has no true “scalar” valeus. Even if you declare a single element

y <- 5
is.vector(y)
[1] TRUE

It is not necessary to type y[1] to obtain the value, however. But is allowed:

identical(y[1], y)
[1] TRUE

Access elements

By integer subscripts in brackets.

Retrieve elements one a time.

x[1]
[1] 13
x[5]
[1] 35

Use an index vector

x[c(3, 4, 5)]
[1] 33  4 35

Can separate calculation of the index (make 2 steps)

indx <- c(2, 4)
x[indx]
[1] 2 4
Omit by negative subscripts
x[-4]
[1] 13  2 33 35
x[c(-3,-4)]
[1] 13  2 35
A TRUE-FALSE Vector can be an index.

Pull out items 1 and 4 by setting them as true

indx <- c(TRUE, FALSE, FALSE, TRUE, FALSE)
x[indx]
[1] 13  4
More examples: Using logical comparisons to select (filter) values.

I’ll use those “TRUE” values to filter the values from x which are greater than 0

xgt0 <- x > 0
x2 <- x[xgt0]
x2
[1] 13  2 33  4 35

Often, we’d do that selection in one step, but you don’t understand what’s happening unless you do the 2 separate steps (good both for novices and bug-checkers).

x2 <- x[x>0]
x2
[1] 13  2 33  4 35

Could use which() to achieve same

xwh <- which(x > 0)
x[xwh]
[1] 13  2 33  4 35
                                        #or
x[which(x > 0)]
[1] 13  2 33  4 35

c() is a friend and an enemy.

Convenient

Reasons why necomers like c()

  1. convenient
  2. hyper-flexible: can throw together anything
  3. often does what we want
  4. can create named vector easily

Convenient

c() is brief, easy to remember

c() might stand for “combine”, “collect”, “concatenate”

Often works as expected, saves work that might be boring/repetitive.

When I said c() is flexible, I had in mind that

  1. it asks for additional memory and combines vectors gracefully).
x1 <- c(33, 22)
x2 <- c(55.1, 55, 58, 11, 12)
x3 <- c(x1, x2)
x3
[1] 33.0 22.0 55.1 55.0 58.0 11.0 12.0

Behind the scenes, here’s what has to happen to create x3.

  1. The number of elements in x1 and x2 must be counted
  2. Memory must be requested for a vector equal to the requirement.
  3. The individual elements must be copied into the newly allocated values.
  1. c() is very helpful because it can, literally, combine completely different kinds of things and give a sensible result. (That’s pleasant and dangerous)

Named vector

z <- c("beta0" = 0.1, "beta1" = 1.1, "beta2" = 0.04)

Note the quotations are not necessary on the names, I am just accustomed to typing them. Previous is equivalent to running one command to create the vector and then using the assignment version of names(z2) to attach the names.

z2 <- c(0.1, 1.1, 0.04)
names(z2) <- c("beta0", "beta1", "beta2")
z2
beta0 beta1 beta2 
 0.10  1.10  0.04 

In real life, I’d avoid so much typing by pasting the names together with a statement like

z2 <- c(0.1, 1.1, 0.04)
names(z2) <- paste0("beta", 0:2)
z2
beta0 beta1 beta2 
 0.10  1.10  0.04 

Named vectors cause some calculations to go slower in R, we would not make a huge structure with named elements. However, for small-medium vectors, named vectors are often very convenient. Naming the elements reduces the danger of accessing the wrong value by a numeric index. We also benefit by keeping a cleaner workspace. We avoid creating separate values for \(\beta_0\), \(\beta_1\) and so forth, we just retrieve them by name if we need them:

z["beta0"]
beta0 
  0.1 

If the names get in your way, use the unname function

unname(z)
[1] 0.10 1.10 0.04

The c() function also has a superpower feature, the recursive argument. If recursive is true, then c() will dig through lists (not discussed here) and pull out their individual elements.

What’s bad about c()?

c() “guesses” at the type of data we want to store.

There’s a difference between an integer and a floating point number, right? The difference is much bigger in computer math than in pencil and paper math.

Why the difference? Computers use 0’s and 1’s to record numbers. The integer \(1\) is \(63\) \(0\)’s followed by a \(1\). The integer \(3\) is \(62\) \(0\)’s followed by \(11\). Integers are exact!

Floating point numbers are approximations built on, say, 64 bit values. A number which appears as 3 on the screen might in fact be 2.999999999234 because of rounding error.

  1. Integer comparisons are OK, can use “==” and “!=” for equal and not equal.
x <- c(5L, 10L, 15L, 20L, 25L, 30L)
y <- seq(5L, 30L, 5L)
x == y
[1] TRUE TRUE TRUE TRUE TRUE TRUE

The “L” means “long integer”. In R, all integers are “long” (64 bits).

The identical() function can be used to compare entire vectors.

identical(x, y)
[1] TRUE

Fixes

Other ways to let R know you want an integer vector

  1. declare x as an integer before assigning values.
x <- integer(5)
## same as
## x <- vector(mode = "integer", length = 5)

Then we have a somewhat stupid chore of putting values into x

x[1] <- 13L
x[2] <- 2L
x[3] <- 33L
x[4] <- 4L
x[5] <- 35L
is.integer(x)
[1] TRUE

That is tedious.

Are the “L”’s needed? Apparently yes. Observe:

x <- integer(5)
x[1] <- 13
x[2] <- 2
x[3] <- 33
x[4] <- 4
x[5] <- 35
is.integer(x)
[1] FALSE
  1. In the usual situation, people might use “coercion” after creating x.
x <- c(13, 2, 33, 4, 35)
x <- as.integer(x)

In this case, the coercion appears to be harmless.

Sometimes, the coercion is not so harmless. In effect, it “rounds down”.

x <- c(13, 2, 33, 4, 35.0001)
x <- as.integer(x)
x
[1] 13  2 33  4 35

R has functions floor() and round() if you really do intend that to happen.

The computer treats math with integers in a different way than with floating point values. If values truly are integers, OK! If one is a float, watch out!

  1. Floating point number problems

We can’t feel too terrifically confident that a number which appears as 1.0 (a floating point) is equal to 1L (an integer).

This example seems not too worrisome

x <- 5
y <- c(4L, 5L, 6L)
x == y
[1] FALSE  TRUE FALSE
z <- c(4, 5, 6)
y == z
[1] TRUE TRUE TRUE

I don’t know why z is seen as equal to y, it seems to me it is not, as we deduce from

identical(y, z)
[1] FALSE

But look at this horrifying example from the help page ?all.equal

x <- pi*(1/4 + 1:10)
xtan <- tan(x)
## Looks like integers
xtan
 [1] 1 1 1 1 1 1 1 1 1 1
is.integer(xtan)
[1] FALSE
xtan == 1L
 [1] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE

As a result, we conclude comparisons between floating point numbers are strongly discouraged. R’s all.equal() and zapsmall() functions are intended to help with comparison of floating point values.

zapsmall(xtan) == 1L
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Danger!

Accidental data corruption

If we use it unthinkingly, c() will destroy data (or, well, alter it unexpectedly).

Suppose we have some values and there is a missing score, which we accidentally represent as “NA”.

x3 <- c(1, 2, 3, "NA", 5)

What is x now?

is.integer(x3)
[1] FALSE
is.double(x3)
[1] FALSE
is.character(x3)
[1] TRUE
x3
[1] "1"  "2"  "3"  "NA" "5" 

What did I mean to do? Use the R symbol NA, without quotes, to indicate that the fourth score was missing.

x4 <- c(1, 2, 3, NA, 5)
is.character(x4)
[1] FALSE
x4
[1]  1  2  3 NA  5
is.na(x4)
[1] FALSE FALSE FALSE  TRUE FALSE

The return value from is.na() is an example of a logical vector, the values are either TRUE or FALSE. Those are symbolic equivalents of 0 and 1. See?

x4missing <- is.na(x4)
x4missing == 1
[1] FALSE FALSE FALSE  TRUE FALSE

Vectorized calculations in R

“Vectorized” means fewer for loops

Many (not all) functions in R are vectorized. It is not necessary to apply a function individually to the elements (say, in a “for loop”). Instead, we handle a whole vector in one step.

x1 <- 1:10
3 * x1
 [1]  3  6  9 12 15 18 21 24 27 30
log(x1)
 [1] 0.0000000 0.6931472 1.0986123 1.3862944 1.6094379 1.7917595
 [7] 1.9459101 2.0794415 2.1972246 2.3025851
sqrt(x1)
 [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
 [8] 2.828427 3.000000 3.162278
exp(x1)
 [1]     2.718282     7.389056    20.085537    54.598150   148.413159
 [6]   403.428793  1096.633158  2980.957987  8103.083928 22026.465795

Similarly, addition, subtraction, and multiplication are vectorized

x2 <- 55:64
x1 + x2
 [1] 56 58 60 62 64 66 68 70 72 74
x2 - x1
 [1] 54 54 54 54 54 54 54 54 54 54
0.1 * x2 - x1
 [1]  4.5  3.6  2.7  1.8  0.9  0.0 -0.9 -1.8 -2.7 -3.6

The symbol “*" indicates ‘term wise’ multiplication. It is not an “inner product” or “dot product” as in linear algebra.

Many R funtions produce vectors

  1. Random number generators
set.seed(234234)
x <- rnorm(10)
head(x)
[1] -0.1308295 -0.6777994  0.1435791 -0.4879708 -0.1845969  0.5976032
is.vector(x)
[1] TRUE
is.double(x)
[1] TRUE

Head is shortcut for x[1:6], see ?head

  1. Sequence seq()

  2. Replicate rep()

  3. Logical comparisons create logical vectors.

xgt0 <- x > 0
head(xgt0)
[1] FALSE FALSE  TRUE FALSE FALSE  TRUE
is.logical(xgt0)
[1] TRUE

cbind and rbind

The cbind and rbind functions are the vector-wise equivalents of c(). These are both 1) handy and 2) dangerous.

cbind: combine columns side by side

A vector is, by definition, a column structure. Lets make 2 columns and bind them together.

x1 <- 1:5
x2 <- seq(100, 180, by = 20)
X <- cbind(x1, x2)
X
     x1  x2
[1,]  1 100
[2,]  2 120
[3,]  3 140
[4,]  4 160
[5,]  5 180

The object X is a matrix, which we will discuss in a separate set of notes.

class(X)
[1] "matrix"

We don’t want go get bogged-down now here about what a matrix is, or what a “class” is in R, or how a matrix is different from a vector. We will get bogged-down in that later.

cbind: what’s dangerous about that?

  1. The unintended “demotion” or “promotion” of variable types occurs, as in c(). All of the columns may be altered by a single character in one of them.
x1 <- c(1, 2, 3, "NA", 5)
x2 <- seq(100, 180, by = 20)
X <- cbind(x1, x2)
X
     x1   x2   
[1,] "1"  "100"
[2,] "2"  "120"
[3,] "3"  "140"
[4,] "NA" "160"
[5,] "5"  "180"
mode(X)
[1] "character"
  1. “Recycling” will re-use values in a sometimes unexpected way:
x1 <- c(1, 2, NA)
x2 <- seq(100, 180, by = 20)
X <- cbind(x1, x2)
Warning in cbind(x1, x2): number of rows of result is not a multiple
of vector length (arg 1)
X
     x1  x2
[1,]  1 100
[2,]  2 120
[3,] NA 140
[4,]  1 160
[5,]  2 180

We do see the warning there, but this is very dangerous behavior. It is an example of why it is not recommended to turn off warnings (or develop the habit of ignoring them).

rbind: not entirely expected result

rbind stands for “row” bind.

When I first applied rbind to two (column) vectors,

x <- c(1, 2, 3)
y <- c(4, 5, 6)

I expected the result would be a column (1, 2, 3, 4, 5, 6). I was (mistakenly) expecting that, since both x and y are (column) vectors, R would treat them that way.

However, the behavior of rbind is different, entirely!

rbind(x, y)
  [,1] [,2] [,3]
x    1    2    3
y    4    5    6

That’s was a surprise to me. What happened? When we gave the two vectors to rbind(), R was thinking to itself “Ah, they must want me to treat those two vectors as rows!”.

And why would R have a right to think so? If I want to “stack together” two column vectors, I ought to use the c() function. That’s what c() is actually intended for, after all!

c(x, y)
[1] 1 2 3 4 5 6

The other lesson in this is that although vectors in R are generally thought of as column vectors, you can’t take that to the bank. Simply put, always do your best to double-check calculations to make sure you are getting what you expect.

Afterthought 1: Transpose

Vectors are columns. In R, they are a separate type of storage. Remember they are columns.

Question What is the transpose of a column?

Answer A row.

But in R there is no such thing as a “row vector”. So what do we receive if we use the “transpose” operator on a column vector?

x <- c(10, 11, 12, 13, 14, 15, 16)
x
[1] 10 11 12 13 14 15 16
xt <- t(x)
xt
     [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]   10   11   12   13   14   15   16
class(xt)
[1] "matrix"

In R, the only way to represent a “row” is to talk about a matrix with just one row. That’s an important technical difference because R has vectors as columns, but, as we shall see, it also has matrices with only one column in them, but those one column matrices are not equivalent to an R vector in many ways.

Afterthought II: Super confusing problem of storage mode versus R class

The actual storage work is handled in C, where the term “type” is used for variables. The types are “int” (integer), “long” (integer that can hold more values), “float” (floating point real number), “double” (a double-precision floating point number), and so forth. In the R documentation, these are referred to as “types” (or the very closely related “storage modes”).

The reason for inserting this section is the ambiguity between the terms “numeric” and “double” (or double width floating point value) in various contexts.

Class

Many R users will never concern themselves with type or storage mode, but they will be interested in the “class” of an R object. The idea of object “class” frames almost all of the R user’s day-to-day interaction with R.

R marks each object with an attribute called “class” and that attribute is used by the R runtime system to make good guesses about what users need when they make requests. The term “class” embraces a much wider type of data structures than the “integer” “double”, “character” storage mode family. These classes are the structures that have made S and R famous, including factors, Dates, lists, data frames, and matrices. These things, of course, have to exist in memory with a certain structure, but since there are no built-in C equivalents of lists or dates, there is no danger of confusion.

Where Class and Storage terminology do not differ (integer, logical, character)

There is no confusion about the meaning of storage mode or class in the cases of “character” and “logical” variables. The R classes “character” and “logical” are exactly what you expect. They are vectors for which the storage mode is “character” or “logical”. There’s no trouble.

Consider a logical vector. I believe the output from coercion into other types is mostly understandable.

x <- c(TRUE, FALSE, FALSE, TRUE)
is.logical(x)
[1] TRUE
as.character(x)
[1] "TRUE"  "FALSE" "FALSE" "TRUE" 
as.integer(x)
[1] 1 0 0 1

Numeric: Where Class and Storage terminology differ

See the “Note on names” in the help page “?numeric”. The confusion is that the name “numeric” sometimes means “floating point double precision numbers” while sometimes it includes both integers and floating-point numbers. The treatment is different in the older family of S3 functions. In S4 family, numeric means double-precision floating point values.

We will demonstrate the difference by starting with that logical vector.

x <- c(TRUE, FALSE, FALSE, TRUE)
z <- as.numeric(x)
z
[1] 1 0 0 1
is.integer(z)
[1] FALSE
is.double(z)
[1] TRUE

The 0’s and 1’s in z represent floating point values, not integer 0L and 1L. The as.numeric function always generates a floating point value, even though we might wish we could have integer 0L and 1L.

Now lets try the same exercise from another direction. The ambiguity of “numeric” will reveal itself.

x <- c(TRUE, FALSE, FALSE, TRUE)
z <- as.integer(x)
z
[1] 1 0 0 1
is.numeric(z)
[1] TRUE
is.double(z)
[1] FALSE

The difference between “is.numeric” and “as.numeric” flows from the fact that as.numeric always creates floating point numbers, while “is.numeric” returns TRUE if the storage mode of the vector is integer or floating point. Those are all “numbers”, especially when we need to differentiate them from character or logical variables.

Function Collection

  1. c General purpose concatenator often used to allocate vectors

  2. vector(): allocates space for a vector of given type. Same as functions double(), `integer()``, and so forth.

  3. is.___ functions are for checking a thing’s

  4. as.___ family is for coercing a variable of one type into another class. as.integer(), as.double(), as.logical(). A general purpose “as()” function can be used instead, with arguments.

  5. 1:10 is shorthand for seq(1L, 10L, 1L)

x1 <- 1:10
is.integer(x1)
[1] TRUE
x2 <- seq(1L, 10L, 1L)
identical(x1, x2)
[1] TRUE

Session Info

R version 3.4.1 (2017-06-30)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 17.04

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.7.0
LAPACK: /usr/lib/lapack/liblapack.so.3.7.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  base     

other attached packages:
[1] crmda_0.44

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.11    digest_0.6.12   rprojroot_1.2   plyr_1.8.4     
 [5] xtable_1.8-2    backports_1.1.0 magrittr_1.5    evaluate_0.10  
 [9] stringi_1.1.5   openxlsx_4.0.17 rmarkdown_1.6   tools_3.4.1    
[13] stringr_1.2.0   kutils_1.19     yaml_2.1.14     compiler_3.4.1 
[17] htmltools_0.3.6 knitr_1.16      methods_3.4.1  

Available under Created Commons license 3.0 CC BY