Introduction

R includes functions that can convert information about dates and times into useful variables, so we can find, for example, the number of hours that elapsed between the time of my admission into prison and my eventual release. It is possible to add dates together, or subtract them, and the calculation is meaningful (that is, consistent with the calendar).

As I understand it, the R programmers have done all the work of taking the leap year into account, so we could ask for the number of days between the inauguration of Richard Nixon and Barack Obama and get a correct answer.

There are different mechanisms in place for management of time information. The major difference between these approaches is that we might have dates by themselves, or we might also have times (hour, minutes, seconds) with the dates. A date with time involves a much more elaborate data structure, of course, but with a much richer end result. We could, for example, calculate the number of minutes that passed between the glorious Denver Bronco victory in Superbowl 33 and the glorious Denver Bronco victory in Superbowl 50.

I’ve been aware of the possibilities for a long time, but was not interested in dates and times. I suppose becoming old makes one more aware of that. I was never required to calculate the difference in days or minutes between two events. That changed when we had a lot of date-time data about the activities of elderly people who were wearing accelerometer devices. And then we had data about times of admission into foster care and release from it for children in one state.

Working Example Excel File

I was prompted to prepare this note because a student at the summeR statistical institute asked for help with an Excel file. I made a tiny imitation for the example.

The example file is named date-20160524.xlsx. Hopefully, a copy of that is available in the same folder as this file, but if that fails, there is a copy at

http://pj.freefaculty.org/scraps/date-20160524.xlsx

That’s saved from a spread sheet program. Over the years, we have had more-or-less success in importing these directly into R. Sometimes less. Lately, we have had good luck with the openxlsx package for R, which was created by Alexander Walker. For us, that package appears to be something of a breakthrough: a truly open source, free-to-use and redistribute routine for opening and saving XLSX files. Without installing new software (such as Perl) or relying on a commercial product, we can import the most XLSX files that people email to us.
The package is on the CRAN server, here’s the BibTeX for the citation, just for fun.

@Manual{,
title = {openxlsx: Read, Write and Edit XLSX Files},
author = {Alexander Walker},
year = {2015},
note = {R package version 3.0.0},
url = {https://CRAN.R-project.org/package=openxlsx},
}

Dates and Times

If data is only dates, but not times are associated, then an R “Date” object is called for. This is the simplest type of data and it is the safest to use. (Date information does not depend on the time zone. That’s why it is closer to ‘worry free’.)

Concentrate on the Date problem first.

R’s Sys.Date function will return today’s date, so we have something to work with

z <- Sys.Date()
z
[1] "2016-08-30"

The thing z prints as a formatted structure because the object has a class attribute, which causes the print function to render it in that way. It is of class “Date”, as we see here

class(z)
[1] "Date"

What will the date be 6 days from now?

z + 6
[1] "2016-09-05"

What will date be 112 days from now?

z + 112
[1] "2016-12-20"

Create a new variable holding the date 200 days in the future

z2 <- z + 200

The difference between z2 and z should be 200 days. Lets cross our fingers and check.

z2 - z
Time difference of 200 days

If the data includes only dates, no times, then it is not actually meaningful to look for difference in minutes. But you can if you want:

difftime(z2, z, units = "mins")
Time difference of 288000 mins
difftime(z2, z, units = "secs")
Time difference of 17280000 secs

See ?difftime

About importing Date objects from Excel.

I made a test example Excel sheet that has a couple of date columns, and then one string column that looks like a date, then 2 that have the Excel time in them.

The openxlsx import can try to convert dates for us, and I have that first as example here, but I did not truly understand what was going on until I ran the second example, where I did not let openxlsx convert date/time columns automatically.

As you test this, remember to run str() frequently to see what the data has.

Now, a word about data input in Excel. The terminology is not exactly the same as within R. When setting Format Cell in Excel, I notice that there is one option for “date” and one for “time”. Appears to me the time format is just for for hours:min:sec and Excel has date formats that might be only days (04/01/2016) but can also do dates with times added in (04/01/2016 2:00 PM) or (04/01/2016 14:00). As a result, I’m not entirely sure what will happen when that’s imported into R. I show a few variations below to give the idea of the differences.

After testing, I believe that if the read.xlsx import process only spots date columns and it is not, at the moment, able to differentiate date columns from date-with-time columns coming from Excel. Perhaps I’m not declaring them properly in Excel.

However, if Excel has some non-empty times associated with the cells, in R we could use a more-complicated-than-Date storage class.
The two options are R classes named POSIXct and POSIXlt. These both store time in seconds and both can be used to calculate time between two events. The only difference between them is the internal storage format and that POSIXct is not human-readable because of how it is stored (seconds), while a POSIXlt object has the same instant in time saved with a more common looking combination of year, month, day, etc. If we have POSIXct data, we can extract the same year, month, day information, so there is no real practical difference. The only challenge we face is storage and conversion between the two, and an incredibly difficult proof-reading challenge of checking if our calculations are correct. If a column is saved with the wrong time zone, and then we subtract from a column that has the right time zone, well, the answer will be wrong.

Here’s some R help output on those two types of storage.

## POSIXct: seconds since 1970 (UTC time zone) as number.
## POSIXlt: a named list of vectors,
## 'sec"
## 'min"
## 'hour'
## "mday"  (day of the month 1-31)
## 'year'
## 'wday" 0-6, day sunday - sat
## 'yday' day of the year
## 'isdst' daylight time saving flag
## 
## ‘"POSIXct"’ is more convenient for including in data frames, and
##     ‘"POSIXlt"’ is closer to human-readable forms.

If we get wrestle the data correctly into a POSIXct column, we are safer because it uses the universal time code (UTC) and we don’t have to worry about subtracting 3PM Central Time from 8PM Pacific Time. This clarity should lead us to wish we had POSIXct storage. We’ll never get the difference in minutes or seconds wrong. On the other hand, after we do calculations in UTC, we need to convert to some local time zone in order to understand what we have.

One final detail. In the Excel sheet, it is not truly necessary to use a specialized time format. Many Excel users will simply use a text column. A date might saved as a character string, using some understandable format like “2016-05-22” or “05/22/2016”. There is a way to convert that to an R Date object. We will explore “strptime” below.

First, inspect the XLSX file in Excel.

See what you have! open “date-20160524.xlsx”. There are columns named x1 x2 x3 x4 and x5. Right click on the cells to see what the formats are.
The only simple one is x3, pure and simple text. x1 and 2 are date objects, but I chose not to ask Excel for times as well. On the other hand, x4 and x5 columns have both date and time information stored in them.

I’m guessing that columns 1 and 2 will be imported as R Date class objects, while 4 and 5 will either lose the time element (my fear!) or they will become POSIX date-time variables.

Open the XLSX file with read.xlsx

To load an XLSX file, first we use the library function to attach the openxlsx package.

library(openxlsx)

The function read.xlsx does the work. Note, the ``detectDates argument is set as TRUE, meaning we are hoping that read.xlsx will notice which columns are marked as date information in Excel and it will do the right thing.

dat1 <- read.xlsx("date-20160524.xlsx", detectDates = TRUE)

It looks about right for columns 1 through 3.

dat1
          x1         x2       x3         x4         x5
1 1999-01-01 2012-01-03 1999/1/3 2000-01-01 2002-04-01
2 2000-02-01 1999-04-05 1998/2/1 2000-02-02 2000-02-02
3 2009-03-03 2000-02-02 1996/9/1 2012-08-01 2012-08-02
str(dat1)
'data.frame':   3 obs. of  5 variables:
 $ x1: Date, format: "1999-01-01" ...
 $ x2: Date, format: "2012-01-03" ...
 $ x3: chr  "1999/1/3" "1998/2/1" "1996/9/1"
 $ x4: Date, format: "2000-01-01" ...
 $ x5: Date, format: "2002-04-01" ...
summary(dat1)
       x1                   x2                  x3           
 Min.   :1999-01-01   Min.   :1999-04-05   Length:3          
 1st Qu.:1999-07-18   1st Qu.:1999-09-03   Class :character  
 Median :2000-02-01   Median :2000-02-02   Mode  :character  
 Mean   :2002-10-02   Mean   :2003-10-14                     
 3rd Qu.:2004-08-17   3rd Qu.:2006-01-17                     
 Max.   :2009-03-03   Max.   :2012-01-03                     
       x4                   x5            
 Min.   :2000-01-01   Min.   :2000-02-02  
 1st Qu.:2000-01-17   1st Qu.:2001-03-02  
 Median :2000-02-02   Median :2002-04-01  
 Mean   :2004-03-22   Mean   :2004-12-21  
 3rd Qu.:2006-05-03   3rd Qu.:2007-06-02  
 Max.   :2012-08-01   Max.   :2012-08-02  

The Excel sheet has columns 1 and 2 saved as date information. We asked read.xlsx to notice date variables and try to convert them into R dates automatically.

I’m a little disappointed that columns x4 and x5 don’t have time elements.

Subtract dates

Lets see the benefit by calculating the difference in days between x1 and x2. We can treat the date columns as if they were numeric columns by subtracting them.

dat1$x1 - dat1$x2
Time differences in days
[1] -4750   302  3317

The difftime function can be used to give us the same result in different units:

difftime(dat1$x1, dat1$x2, units = "mins")
Time differences in mins
[1] -6840000   434880  4776480
difftime(dat1$x1, dat1$x2, units = "days")
Time differences in days
[1] -4750   302  3317
difftime(dat1$x1, dat1$x2, units = "hours")
Time differences in hours
[1] -114000    7248   79608

In the last example, you see where R has “overloaded” the minus sign, so that when we write dat1$x1 - dat1$x2, the R runtime system notices that x1 is a date variable and it sends the work of calculating that difference to the datetime function. My tendency would be to just use the datetime function when I need the difference in dates, but the minus sign seems to be a crowd pleaser.

The R runtime system tries to conceal some details. In case of curiosity, I suggest the user compare the output of

print(dat1$x1)
[1] "1999-01-01" "2000-02-01" "2009-03-03"

which is a beautified date representation of the information, with the internal value of that variable, which is not human readable:

print(unclass(dat1$x1))
[1] 10592 10988 14306

At several points, I have been fooled by R’s print function because it shows what it thinks I want, rather than all of the information. The way to make it show everything is to apply unclass to the object first, before printing. After unclass is applied, the print function shows everything.

Convert text information into year, month, date format.

Note x3 is just a text variable. The function strptime is used to convert the text string into a structure of type POSIXlt. That’s not just a Date variable, it is a much more elaborate time storage type.

dat1$x3.2 <- strptime(dat1$x3, format = "%Y/%m/%d")

Because the time information is missing when we import the data, the variable x3.2 appears to have no time components. Neither the print nor the str functions reveal the inner structure to us, when the inner variables (such as time) have no information.

dat1$x3.2
[1] "1999-01-03 CST" "1998-02-01 CST" "1996-09-01 CDT"
str(dat1$x3.2)
 POSIXlt[1:3], format: "1999-01-03" "1998-02-01" "1996-09-01"

But those additional elements are there if we remove the class designator, which lays bare the internal storage:

unclass(dat1$x3.2)
$sec
[1] 0 0 0

$min
[1] 0 0 0

$hour
[1] 0 0 0

$mday
[1] 3 1 1

$mon
[1] 0 1 8

$year
[1] 99 98 96

$wday
[1] 0 0 0

$yday
[1]   2  31 244

$isdst
[1] 0 0 1

$zone
[1] "CST" "CST" "CDT"

$gmtoff
[1] NA NA NA

Since that column does not make use of the time information, we can convert it to a Date variable. The function as.Date coerces the POSIXlt dat1$x3.2 into a date object. Note because there were no minutes or hours, we don’t lose anything here.

justWondered <- as.Date(dat1$x3.2)
unclass(justWondered)
[1] 10594 10258  9740

The variable x3.2 is a more powerful structure because it includes time as possible information within its structure. That is POSIXlt format, which appears to us as a Date object with enrichment that allows it to store hours:minutes:seconds as well.

To see the difference between a Date variable, such as x1, try to ask the Date variable what hour value it has.

dat1$x1$hour

When I do that, I see

Error in dat1$x1$hour : $ operator is invalid for atomic vectors. 

That happens because x1 has no hour information.

On the other hand, x3.2 has more flavor. Because it is POSIXlt, it has “hour” “min” “sec” and so forth. Unfortunately in this example, they are all 0 because the input time was only dates,

dat1$x3.2$hour
[1] 0 0 0
dat1$x3.2$min
[1] 0 0 0

Import the Excel file again, but don’t detectDates

Lets make sure we understand what’s going on by turning off the automatic date converter. Set detectDates as FALSE.

dat2 <- read.xlsx("date-20160524.xlsx", detectDates = FALSE)
dat2
     x1    x2       x3       x4       x5
1 36161 40911 1999/1/3 36526.58 37347.04
2 36557 36255 1998/2/1 36558.08 36558.12
3 39875 36558 1996/9/1 41122.79 41123.88

Somehow, I was not expecting that. Those are the integers that Excel uses internally to remember dates. On screen, inside the spread sheet program, they look like dates. But down under, they are not.

The columns that were dates in Excel are now just numbers.

str(dat2)
'data.frame':   3 obs. of  5 variables:
 $ x1: num  36161 36557 39875
 $ x2: num  40911 36255 36558
 $ x3: chr  "1999/1/3" "1998/2/1" "1996/9/1"
 $ x4: num  36527 36558 41123
 $ x5: num  37347 36558 41124
summary(dat2)
       x1              x2             x3                  x4       
 Min.   :36161   Min.   :36255   Length:3           Min.   :36527  
 1st Qu.:36359   1st Qu.:36406   Class :character   1st Qu.:36542  
 Median :36557   Median :36558   Mode  :character   Median :36558  
 Mean   :37531   Mean   :37908                      Mean   :38069  
 3rd Qu.:38216   3rd Qu.:38734                      3rd Qu.:38840  
 Max.   :39875   Max.   :40911                      Max.   :41123  
       x5       
 Min.   :36558  
 1st Qu.:36953  
 Median :37347  
 Mean   :38343  
 3rd Qu.:39235  
 Max.   :41124  

We can recover the dates like this:

dat2$x1time <- convertToDate(dat2$x1)
dat2$x2time <- convertToDate(dat2$x2)

That looks like a victory, we can subtract dates, just as we did before

dat2$x1time - dat2$x2time
Time differences in days
[1] -4750   302  3317

I suppose we better ask R what it has

str(dat2)
'data.frame':   3 obs. of  7 variables:
 $ x1    : num  36161 36557 39875
 $ x2    : num  40911 36255 36558
 $ x3    : chr  "1999/1/3" "1998/2/1" "1996/9/1"
 $ x4    : num  36527 36558 41123
 $ x5    : num  37347 36558 41124
 $ x1time: Date, format: "1999-01-01" ...
 $ x2time: Date, format: "2012-01-03" ...

Now lets see what happens with the last two columns. There are 2 functions convertToDate and convertToDateTime. We need to test both.

dat2$x4date <- convertToDate(dat2$x4)
dat2$x4datetime <- convertToDateTime(dat2$x4)

dat2$x5date <- convertToDate(dat2$x5)
dat2$x5datetime <- convertToDateTime(dat2$x5)

Ask R what it has

str(dat2)
'data.frame':   3 obs. of  11 variables:
 $ x1        : num  36161 36557 39875
 $ x2        : num  40911 36255 36558
 $ x3        : chr  "1999/1/3" "1998/2/1" "1996/9/1"
 $ x4        : num  36527 36558 41123
 $ x5        : num  37347 36558 41124
 $ x1time    : Date, format: "1999-01-01" ...
 $ x2time    : Date, format: "2012-01-03" ...
 $ x4date    : Date, format: "2000-01-01" ...
 $ x4datetime: POSIXct, format: "2000-01-01 14:00:02" ...
 $ x5date    : Date, format: "2002-04-01" ...
 $ x5datetime: POSIXct, format: "2002-04-01 00:59:59" ...

The variable x5datetime2 can be converted into a POSIXlt object where we dictate the format as it will appear to us on the screen (see the format argument):

dat2$x5datetime2 <- strptime(dat2$x5datetime, format = "%Y-%m-%d %H:%M:%OS")
dat2$x5datetime2$hour
[1]  0  3 21
dat2$x5datetime2$min
[1] 59  0  0
dat2$x5datetime2$sec
[1] 59  0  0
str(dat2)
'data.frame':   3 obs. of  12 variables:
 $ x1         : num  36161 36557 39875
 $ x2         : num  40911 36255 36558
 $ x3         : chr  "1999/1/3" "1998/2/1" "1996/9/1"
 $ x4         : num  36527 36558 41123
 $ x5         : num  37347 36558 41124
 $ x1time     : Date, format: "1999-01-01" ...
 $ x2time     : Date, format: "2012-01-03" ...
 $ x4date     : Date, format: "2000-01-01" ...
 $ x4datetime : POSIXct, format: "2000-01-01 14:00:02" ...
 $ x5date     : Date, format: "2002-04-01" ...
 $ x5datetime : POSIXct, format: "2002-04-01 00:59:59" ...
 $ x5datetime2: POSIXlt, format: "2002-04-01 00:59:59" ...

A column with the time storage format POSIXlt can be subtracted from a column with storage type POSIXct. I hope. Lets subtract some times, see what happens

dat2$x5datetime2 - dat2$x4datetime
Time differences in mins
[1] 1.181460e+06 5.996667e+01 1.560017e+03
difftime(dat2$x5datetime2, dat2$x4datetime, "seconds")
Time differences in hours
[1] 19684.999167    -5.000556    21.000278

So far, this appears simple to me, but there is a very stern warning in the end of the help page for POSIXlt:

 Great care is needed when comparing objects of class ‘"POSIXlt"’.
 Not only are components and attributes optional; several
 components may have values meaning ‘not yet determined’ and the
 same time represented in different time zones will look quite
 different.
 

I suppose I would not bet my fortune on differences in date-time objects without double-checking some calculations. We have seen problems when the time zone in a computer is set incorrectly, and so date-time objects created there cause weird differences when used on another comuter. One of the assistants in our lab declared in 2015 that we would always use Pacific Standard Time, which sounded great. Except for the fact we are in Kansas.

I hope this cleared up some mysteries and did not create (m)any new ones.

If you are not confused enough, consider more reading

## ?DateAndTimeClasses
## ?POSIXlt
## ?Date
## ?Ops.Date
## ?strptime

There are nice test examples in strptime

R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
[1] openxlsx_3.0.0 rmarkdown_1.0 

loaded via a namespace (and not attached):
 [1] magrittr_1.5    formatR_1.4     tools_3.3.1     htmltools_0.3.5
 [5] yaml_2.1.13     Rcpp_0.12.6     stringi_1.1.1   knitr_1.13     
 [9] stringr_1.0.0   digest_0.6.9    evaluate_0.9   

Available under Created Commons license 3.0 CC BY