Abstract
Most users think of dates as character strings, for example “Feb. 20, 1999”. R is happy enough to treat them as character stings, but we might want more powerful calculations. If we want, for example, the difference–in minutes, hours, and seconds–between two time points, we can get it if we let R know that’s what we want. A column of character string information can be converted into various kinds of date and time variables, which allow subtraction, addition, and so forth. This note explores some examples in conjunction with data stored in an MS Excel file, which is also available in this directory.R includes functions that can convert information about dates and times into useful variables, so we can find, for example, the number of hours that elapsed between the time of my admission into prison and my eventual release. It is possible to add dates together, or subtract them, and the calculation is meaningful (that is, consistent with the calendar).
As I understand it, the R programmers have done all the work of taking the leap year into account, so we could ask for the number of days between the inauguration of Richard Nixon and Barack Obama and get a correct answer.
There are different mechanisms in place for management of time information. The major difference between these approaches is that we might have dates by themselves, or we might also have times (hour, minutes, seconds) with the dates. A date with time involves a much more elaborate data structure, of course, but with a much richer end result. We could, for example, calculate the number of minutes that passed between the glorious Denver Bronco victory in Superbowl 33 and the glorious Denver Bronco victory in Superbowl 50.
I’ve been aware of the possibilities for a long time, but was not interested in dates and times. I suppose becoming old makes one more aware of that. I was never required to calculate the difference in days or minutes between two events. That changed when we had a lot of date-time data about the activities of elderly people who were wearing accelerometer devices. And then we had data about times of admission into foster care and release from it for children in one state.
I was prompted to prepare this note because a student at the summeR statistical institute asked for help with an Excel file. I made a tiny imitation for the example.
The example file is named date-20160524.xlsx. Hopefully, a copy of that is available in the same folder as this file, but if that fails, there is a copy at
http://pj.freefaculty.org/scraps/date-20160524.xlsx
That’s saved from a spread sheet program. Over the years, we have had more-or-less success in importing these directly into R. Sometimes less. Lately, we have had good luck with the openxlsx
package for R, which was created by Alexander Walker. For us, that package appears to be something of a breakthrough: a truly open source, free-to-use and redistribute routine for opening and saving XLSX files. Without installing new software (such as Perl) or relying on a commercial product, we can import the most XLSX files that people email to us.
The package is on the CRAN server, here’s the BibTeX for the citation, just for fun.
@Manual{,
title = {openxlsx: Read, Write and Edit XLSX Files},
author = {Alexander Walker},
year = {2015},
note = {R package version 3.0.0},
url = {https://CRAN.R-project.org/package=openxlsx},
}
If data is only dates, but not times are associated, then an R “Date” object is called for. This is the simplest type of data and it is the safest to use. (Date information does not depend on the time zone. That’s why it is closer to ‘worry free’.)
R’s Sys.Date
function will return today’s date, so we have something to work with
z <- Sys.Date()
z
[1] "2016-08-30"
The thing z
prints as a formatted structure because the object has a class attribute, which causes the print function to render it in that way. It is of class “Date”, as we see here
class(z)
[1] "Date"
What will the date be 6 days from now?
z + 6
[1] "2016-09-05"
What will date be 112 days from now?
z + 112
[1] "2016-12-20"
Create a new variable holding the date 200 days in the future
z2 <- z + 200
The difference between z2 and z should be 200 days. Lets cross our fingers and check.
z2 - z
Time difference of 200 days
If the data includes only dates, no times, then it is not actually meaningful to look for difference in minutes. But you can if you want:
difftime(z2, z, units = "mins")
Time difference of 288000 mins
difftime(z2, z, units = "secs")
Time difference of 17280000 secs
See ?difftime
I made a test example Excel sheet that has a couple of date columns, and then one string column that looks like a date, then 2 that have the Excel time in them.
The openxlsx import can try to convert dates for us, and I have that first as example here, but I did not truly understand what was going on until I ran the second example, where I did not let openxlsx convert date/time columns automatically.
As you test this, remember to run str()
frequently to see what the data has.
Now, a word about data input in Excel. The terminology is not exactly the same as within R. When setting Format Cell in Excel, I notice that there is one option for “date” and one for “time”. Appears to me the time format is just for for hours:min:sec and Excel has date formats that might be only days (04/01/2016) but can also do dates with times added in (04/01/2016 2:00 PM) or (04/01/2016 14:00). As a result, I’m not entirely sure what will happen when that’s imported into R. I show a few variations below to give the idea of the differences.
After testing, I believe that if the read.xlsx import process only spots date columns and it is not, at the moment, able to differentiate date columns from date-with-time columns coming from Excel. Perhaps I’m not declaring them properly in Excel.
However, if Excel has some non-empty times associated with the cells, in R we could use a more-complicated-than-Date storage class.
The two options are R classes named POSIXct and POSIXlt. These both store time in seconds and both can be used to calculate time between two events. The only difference between them is the internal storage format and that POSIXct is not human-readable because of how it is stored (seconds), while a POSIXlt object has the same instant in time saved with a more common looking combination of year, month, day, etc. If we have POSIXct data, we can extract the same year, month, day information, so there is no real practical difference. The only challenge we face is storage and conversion between the two, and an incredibly difficult proof-reading challenge of checking if our calculations are correct. If a column is saved with the wrong time zone, and then we subtract from a column that has the right time zone, well, the answer will be wrong.
Here’s some R help output on those two types of storage.
## POSIXct: seconds since 1970 (UTC time zone) as number.
## POSIXlt: a named list of vectors,
## 'sec"
## 'min"
## 'hour'
## "mday" (day of the month 1-31)
## 'year'
## 'wday" 0-6, day sunday - sat
## 'yday' day of the year
## 'isdst' daylight time saving flag
##
## ‘"POSIXct"’ is more convenient for including in data frames, and
## ‘"POSIXlt"’ is closer to human-readable forms.
If we get wrestle the data correctly into a POSIXct column, we are safer because it uses the universal time code (UTC) and we don’t have to worry about subtracting 3PM Central Time from 8PM Pacific Time. This clarity should lead us to wish we had POSIXct storage. We’ll never get the difference in minutes or seconds wrong. On the other hand, after we do calculations in UTC, we need to convert to some local time zone in order to understand what we have.
One final detail. In the Excel sheet, it is not truly necessary to use a specialized time format. Many Excel users will simply use a text column. A date might saved as a character string, using some understandable format like “2016-05-22” or “05/22/2016”. There is a way to convert that to an R Date object. We will explore “strptime” below.
See what you have! open “date-20160524.xlsx”. There are columns named x1 x2 x3 x4 and x5. Right click on the cells to see what the formats are.
The only simple one is x3, pure and simple text. x1 and 2 are date objects, but I chose not to ask Excel for times as well. On the other hand, x4 and x5 columns have both date and time information stored in them.
I’m guessing that columns 1 and 2 will be imported as R Date class objects, while 4 and 5 will either lose the time element (my fear!) or they will become POSIX date-time variables.
To load an XLSX file, first we use the library function to attach the openxlsx package.
library(openxlsx)
The function read.xlsx
does the work. Note, the ``detectDates
argument is set as TRUE, meaning we are hoping that read.xlsx will notice which columns are marked as date information in Excel and it will do the right thing.
dat1 <- read.xlsx("date-20160524.xlsx", detectDates = TRUE)
It looks about right for columns 1 through 3.
dat1
x1 x2 x3 x4 x5
1 1999-01-01 2012-01-03 1999/1/3 2000-01-01 2002-04-01
2 2000-02-01 1999-04-05 1998/2/1 2000-02-02 2000-02-02
3 2009-03-03 2000-02-02 1996/9/1 2012-08-01 2012-08-02
str(dat1)
'data.frame': 3 obs. of 5 variables:
$ x1: Date, format: "1999-01-01" ...
$ x2: Date, format: "2012-01-03" ...
$ x3: chr "1999/1/3" "1998/2/1" "1996/9/1"
$ x4: Date, format: "2000-01-01" ...
$ x5: Date, format: "2002-04-01" ...
summary(dat1)
x1 x2 x3
Min. :1999-01-01 Min. :1999-04-05 Length:3
1st Qu.:1999-07-18 1st Qu.:1999-09-03 Class :character
Median :2000-02-01 Median :2000-02-02 Mode :character
Mean :2002-10-02 Mean :2003-10-14
3rd Qu.:2004-08-17 3rd Qu.:2006-01-17
Max. :2009-03-03 Max. :2012-01-03
x4 x5
Min. :2000-01-01 Min. :2000-02-02
1st Qu.:2000-01-17 1st Qu.:2001-03-02
Median :2000-02-02 Median :2002-04-01
Mean :2004-03-22 Mean :2004-12-21
3rd Qu.:2006-05-03 3rd Qu.:2007-06-02
Max. :2012-08-01 Max. :2012-08-02
The Excel sheet has columns 1 and 2 saved as date information. We asked read.xlsx
to notice date variables and try to convert them into R dates automatically.
I’m a little disappointed that columns x4 and x5 don’t have time elements.
Lets see the benefit by calculating the difference in days between x1 and x2. We can treat the date columns as if they were numeric columns by subtracting them.
dat1$x1 - dat1$x2
Time differences in days
[1] -4750 302 3317
The difftime
function can be used to give us the same result in different units:
difftime(dat1$x1, dat1$x2, units = "mins")
Time differences in mins
[1] -6840000 434880 4776480
difftime(dat1$x1, dat1$x2, units = "days")
Time differences in days
[1] -4750 302 3317
difftime(dat1$x1, dat1$x2, units = "hours")
Time differences in hours
[1] -114000 7248 79608
In the last example, you see where R has “overloaded” the minus sign, so that when we write dat1$x1 - dat1$x2
, the R runtime system notices that x1 is a date variable and it sends the work of calculating that difference to the datetime
function. My tendency would be to just use the datetime
function when I need the difference in dates, but the minus sign seems to be a crowd pleaser.
The R runtime system tries to conceal some details. In case of curiosity, I suggest the user compare the output of
print(dat1$x1)
[1] "1999-01-01" "2000-02-01" "2009-03-03"
which is a beautified date representation of the information, with the internal value of that variable, which is not human readable:
print(unclass(dat1$x1))
[1] 10592 10988 14306
At several points, I have been fooled by R’s print function because it shows what it thinks I want, rather than all of the information. The way to make it show everything is to apply unclass to the object first, before printing. After unclass is applied, the print function shows everything.
Note x3 is just a text variable. The function strptime
is used to convert the text string into a structure of type POSIXlt. That’s not just a Date variable, it is a much more elaborate time storage type.
dat1$x3.2 <- strptime(dat1$x3, format = "%Y/%m/%d")
Because the time information is missing when we import the data, the variable x3.2 appears to have no time components. Neither the print nor the str functions reveal the inner structure to us, when the inner variables (such as time) have no information.
dat1$x3.2
[1] "1999-01-03 CST" "1998-02-01 CST" "1996-09-01 CDT"
str(dat1$x3.2)
POSIXlt[1:3], format: "1999-01-03" "1998-02-01" "1996-09-01"
But those additional elements are there if we remove the class designator, which lays bare the internal storage:
unclass(dat1$x3.2)
$sec
[1] 0 0 0
$min
[1] 0 0 0
$hour
[1] 0 0 0
$mday
[1] 3 1 1
$mon
[1] 0 1 8
$year
[1] 99 98 96
$wday
[1] 0 0 0
$yday
[1] 2 31 244
$isdst
[1] 0 0 1
$zone
[1] "CST" "CST" "CDT"
$gmtoff
[1] NA NA NA
Since that column does not make use of the time information, we can convert it to a Date variable. The function as.Date
coerces the POSIXlt dat1$x3.2
into a date object. Note because there were no minutes or hours, we don’t lose anything here.
justWondered <- as.Date(dat1$x3.2)
unclass(justWondered)
[1] 10594 10258 9740
The variable x3.2 is a more powerful structure because it includes time as possible information within its structure. That is POSIXlt format, which appears to us as a Date object with enrichment that allows it to store hours:minutes:seconds as well.
To see the difference between a Date variable, such as x1
, try to ask the Date variable what hour value it has.
dat1$x1$hour
When I do that, I see
Error in dat1$x1$hour : $ operator is invalid for atomic vectors.
That happens because x1
has no hour information.
On the other hand, x3.2 has more flavor. Because it is POSIXlt, it has “hour” “min” “sec” and so forth. Unfortunately in this example, they are all 0 because the input time was only dates,
dat1$x3.2$hour
[1] 0 0 0
dat1$x3.2$min
[1] 0 0 0
Lets make sure we understand what’s going on by turning off the automatic date converter. Set detectDates as FALSE.
dat2 <- read.xlsx("date-20160524.xlsx", detectDates = FALSE)
dat2
x1 x2 x3 x4 x5
1 36161 40911 1999/1/3 36526.58 37347.04
2 36557 36255 1998/2/1 36558.08 36558.12
3 39875 36558 1996/9/1 41122.79 41123.88
Somehow, I was not expecting that. Those are the integers that Excel uses internally to remember dates. On screen, inside the spread sheet program, they look like dates. But down under, they are not.
The columns that were dates in Excel are now just numbers.
str(dat2)
'data.frame': 3 obs. of 5 variables:
$ x1: num 36161 36557 39875
$ x2: num 40911 36255 36558
$ x3: chr "1999/1/3" "1998/2/1" "1996/9/1"
$ x4: num 36527 36558 41123
$ x5: num 37347 36558 41124
summary(dat2)
x1 x2 x3 x4
Min. :36161 Min. :36255 Length:3 Min. :36527
1st Qu.:36359 1st Qu.:36406 Class :character 1st Qu.:36542
Median :36557 Median :36558 Mode :character Median :36558
Mean :37531 Mean :37908 Mean :38069
3rd Qu.:38216 3rd Qu.:38734 3rd Qu.:38840
Max. :39875 Max. :40911 Max. :41123
x5
Min. :36558
1st Qu.:36953
Median :37347
Mean :38343
3rd Qu.:39235
Max. :41124
We can recover the dates like this:
dat2$x1time <- convertToDate(dat2$x1)
dat2$x2time <- convertToDate(dat2$x2)
That looks like a victory, we can subtract dates, just as we did before
dat2$x1time - dat2$x2time
Time differences in days
[1] -4750 302 3317
I suppose we better ask R what it has
str(dat2)
'data.frame': 3 obs. of 7 variables:
$ x1 : num 36161 36557 39875
$ x2 : num 40911 36255 36558
$ x3 : chr "1999/1/3" "1998/2/1" "1996/9/1"
$ x4 : num 36527 36558 41123
$ x5 : num 37347 36558 41124
$ x1time: Date, format: "1999-01-01" ...
$ x2time: Date, format: "2012-01-03" ...
Now lets see what happens with the last two columns. There are 2 functions convertToDate
and convertToDateTime
. We need to test both.
dat2$x4date <- convertToDate(dat2$x4)
dat2$x4datetime <- convertToDateTime(dat2$x4)
dat2$x5date <- convertToDate(dat2$x5)
dat2$x5datetime <- convertToDateTime(dat2$x5)
Ask R what it has
str(dat2)
'data.frame': 3 obs. of 11 variables:
$ x1 : num 36161 36557 39875
$ x2 : num 40911 36255 36558
$ x3 : chr "1999/1/3" "1998/2/1" "1996/9/1"
$ x4 : num 36527 36558 41123
$ x5 : num 37347 36558 41124
$ x1time : Date, format: "1999-01-01" ...
$ x2time : Date, format: "2012-01-03" ...
$ x4date : Date, format: "2000-01-01" ...
$ x4datetime: POSIXct, format: "2000-01-01 14:00:02" ...
$ x5date : Date, format: "2002-04-01" ...
$ x5datetime: POSIXct, format: "2002-04-01 00:59:59" ...
The variable x5datetime2 can be converted into a POSIXlt object where we dictate the format as it will appear to us on the screen (see the format argument):
dat2$x5datetime2 <- strptime(dat2$x5datetime, format = "%Y-%m-%d %H:%M:%OS")
dat2$x5datetime2$hour
[1] 0 3 21
dat2$x5datetime2$min
[1] 59 0 0
dat2$x5datetime2$sec
[1] 59 0 0
str(dat2)
'data.frame': 3 obs. of 12 variables:
$ x1 : num 36161 36557 39875
$ x2 : num 40911 36255 36558
$ x3 : chr "1999/1/3" "1998/2/1" "1996/9/1"
$ x4 : num 36527 36558 41123
$ x5 : num 37347 36558 41124
$ x1time : Date, format: "1999-01-01" ...
$ x2time : Date, format: "2012-01-03" ...
$ x4date : Date, format: "2000-01-01" ...
$ x4datetime : POSIXct, format: "2000-01-01 14:00:02" ...
$ x5date : Date, format: "2002-04-01" ...
$ x5datetime : POSIXct, format: "2002-04-01 00:59:59" ...
$ x5datetime2: POSIXlt, format: "2002-04-01 00:59:59" ...
A column with the time storage format POSIXlt can be subtracted from a column with storage type POSIXct. I hope. Lets subtract some times, see what happens
dat2$x5datetime2 - dat2$x4datetime
Time differences in mins
[1] 1.181460e+06 5.996667e+01 1.560017e+03
difftime(dat2$x5datetime2, dat2$x4datetime, "seconds")
Time differences in hours
[1] 19684.999167 -5.000556 21.000278
So far, this appears simple to me, but there is a very stern warning in the end of the help page for POSIXlt:
Great care is needed when comparing objects of class ‘"POSIXlt"’.
Not only are components and attributes optional; several
components may have values meaning ‘not yet determined’ and the
same time represented in different time zones will look quite
different.
I suppose I would not bet my fortune on differences in date-time objects without double-checking some calculations. We have seen problems when the time zone in a computer is set incorrectly, and so date-time objects created there cause weird differences when used on another comuter. One of the assistants in our lab declared in 2015 that we would always use Pacific Standard Time, which sounded great. Except for the fact we are in Kansas.
I hope this cleared up some mysteries and did not create (m)any new ones.
If you are not confused enough, consider more reading
## ?DateAndTimeClasses
## ?POSIXlt
## ?Date
## ?Ops.Date
## ?strptime
There are nice test examples in strptime
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] openxlsx_3.0.0 rmarkdown_1.0
loaded via a namespace (and not attached):
[1] magrittr_1.5 formatR_1.4 tools_3.3.1 htmltools_0.3.5
[5] yaml_2.1.13 Rcpp_0.12.6 stringi_1.1.1 knitr_1.13
[9] stringr_1.0.0 digest_0.6.9 evaluate_0.9
Available under Created Commons license 3.0