R: rolling sums for groups with uneven time gaps

Question

here's the tweak to my previously posted question. Here's my data:

set.seed(3737)
DF2 = data.frame(user_id = c(rep(27, 7), rep(11, 7)),
            date = as.Date(rep(c('2016-01-01', '2016-01-03', '2016-01-05', '2016-01-07', '2016-01-10', '2016-01-14', '2016-01-16'), 2)),
            value = round(rnorm(14, 15, 5), 1))

user_id       date value
     27 2016-01-01  15.0
     27 2016-01-03  22.4
     27 2016-01-05  13.3
     27 2016-01-07  21.9
     27 2016-01-10  20.6
     27 2016-01-14  18.6
     27 2016-01-16  16.4
     11 2016-01-01   6.8
     11 2016-01-03  21.3
     11 2016-01-05  19.8
     11 2016-01-07  22.0
     11 2016-01-10  19.4
     11 2016-01-14  17.5
     11 2016-01-16  19.3

This time, I'd like to calculate cumulative sum of a value for each user_id for the specified time period, e.g. last 7, 14 days. The desirable solution would look like this:

user_id       date value v_minus7 v_minus14
     27 2016-01-01  15.0     15.0      15.0
     27 2016-01-03  22.4     37.4      37.4
     27 2016-01-05  13.3     50.7      50.7
     27 2016-01-07  21.9     72.6      72.6
     27 2016-01-10  20.6     78.2      93.2
     27 2016-01-14  18.6     61.1     111.8
     27 2016-01-16  16.4     55.6     113.2
     11 2016-01-01   6.8      6.8       6.8
     11 2016-01-03  21.3     28.1      28.1
     11 2016-01-05  19.8     47.9      47.9
     11 2016-01-07  22.0     69.9      69.9
     11 2016-01-10  19.4     82.5      89.3
     11 2016-01-14  17.5     58.9     106.8
     11 2016-01-16  19.3     56.2     119.3

Ideally, I'd like to use dplyr for this, but other packages would be fine. Thanks a lot for your help!

joel.wilson · Accepted Answer · 2017-01-18 13:49:55Z

up vote 5 down vote accepted

logic : first group by user_id, followed by date. Now for each subset of data, we are checking which all dates lie between the current date and 7/14 days back using between() which returns a logical vector.

Based on this logical vector I add the value column

library(data.table)
setDT(DF2)[, `:=`(v_minus7 = sum(DF2$value[DF2$user_id == user_id][between(DF2$date[DF2$user_id == user_id], date-7, date, incbounds = TRUE)]), 
                 v_minus14 = sum(DF2$value[DF2$user_id == user_id][between(DF2$date[DF2$user_id == user_id], date-14, date, incbounds = TRUE)])),
           by = c("user_id", "date")][]
 #   user_id       date value v_minus7 v_minus14
 #1:      27 2016-01-01  15.0     15.0      15.0
 #2:      27 2016-01-03  22.4     37.4      37.4
 #3:      27 2016-01-05  13.3     50.7      50.7
 #4:      27 2016-01-07  21.9     72.6      72.6
 #5:      27 2016-01-10  20.6     78.2      93.2
 #6:      27 2016-01-14  18.6     61.1     111.8
 #7:      27 2016-01-16  16.4     55.6     113.2
 #8:      11 2016-01-01   6.8      6.8       6.8
 #9:      11 2016-01-03  21.3     28.1      28.1
#10:      11 2016-01-05  19.8     47.9      47.9
#11:      11 2016-01-07  22.0     69.9      69.9
#12:      11 2016-01-10  19.4     82.5      89.3
#13:      11 2016-01-14  17.5     58.9     106.8
#14:      11 2016-01-16  19.3     56.2     119.3

edited 1 hour ago

answered 2 hours ago

joel.wilson

3,6821625

1

thank you, @joel.wilson, this is definitely the most crisp and straightforward solution, so I accept it, cheers! – Kasia Kulma 1 hour ago

glad to help @KasiaKulma – joel.wilson 1 hour ago

I'll need your input, @joel.wilson: I tried to run your code on my original data that consists of over 600k rows. It's taking forever to process it ( > 30 mins), so I always interrupt it before it finishes to process data. Any ideas how I could speed things up? – Kasia Kulma 49 mins ago

@KasiaKulma hey! I'm not that sure how to increase the speed though. I will look into this code in some time. Right now I dont have access to RStudio. – joel.wilson 44 mins ago

add a comment |

mpjdem · Answer 2 · 2017-01-18 13:07:08Z

You can use rollapply from zoo once you fill out the missing dates first:

library(dplyr)
library(zoo)

set.seed(3737)
DF2 = data.frame(user_id = c(rep(27, 7), rep(11, 7)),
             date = as.Date(rep(c('2016-01-01', '2016-01-03', '2016-01-05', '2016-01-07', '2016-01-10', '2016-01-14', '2016-01-16'), 2)),
             value = round(rnorm(14, 15, 5), 1))

all_combinations <- expand.grid(user_id=unique(DF2$user_id), 
                            date=seq(min(DF2$date), max(DF2$date), by="day"))

res <- DF2 %>% 
    merge(all_combinations, by=c('user_id','date'), all=TRUE) %>%
    group_by(user_id) %>% 
    arrange(date) %>% 
    mutate(v_minus7=rollapply(value, width=8, FUN=function(x) sum(x, na.rm=TRUE), partial=TRUE, align='right'),
           v_minus14=rollapply(value, width=15, FUN=function(x) sum(x, na.rm=TRUE), partial=TRUE, align='right')) %>%
    filter(!is.na(value))

alexis_laz · Answer 3 · 2017-01-18 13:50:00Z

Here is another idea with findInterval to minimize comparisons and operations. First define a function to accomodate the basic part ignoring the grouping. The following function computes the cumulative sum, and subtracts the cumulative sum at each position from the one at its respective past date:

ff = function(date, value, minus)
{
    cs = cumsum(value)  
    i = findInterval(date - minus, date, left.open = TRUE) 
    w = which(as.logical(i))
    i[w] = cs[i[w]]
    cs - i
}

And apply it by group:

do.call(rbind, 
        lapply(split(DF2, DF2$user_id), 
               function(x) data.frame(x, 
                         minus7 = ff(x$date, x$value, 7), 
                         minus14 = ff(x$date, x$value, 14))))
#      user_id       date value minus7 minus14
#11.8       11 2016-01-01   6.8    6.8     6.8
#11.9       11 2016-01-03  21.3   28.1    28.1
#11.10      11 2016-01-05  19.8   47.9    47.9
#11.11      11 2016-01-07  22.0   69.9    69.9
#11.12      11 2016-01-10  19.4   82.5    89.3
#11.13      11 2016-01-14  17.5   58.9   106.8
#11.14      11 2016-01-16  19.3   56.2   119.3
#27.1       27 2016-01-01  15.0   15.0    15.0
#27.2       27 2016-01-03  22.4   37.4    37.4
#27.3       27 2016-01-05  13.3   50.7    50.7
#27.4       27 2016-01-07  21.9   72.6    72.6
#27.5       27 2016-01-10  20.6   78.2    93.2
#27.6       27 2016-01-14  18.6   61.1   111.8
#27.7       27 2016-01-16  16.4   55.6   113.2

The above apply-by-group operation can, of course, be replaced by any method prefereable.

I had this written first with some other changes, but wasn't able to complete. Great!! +1 — joel.wilson, 1 hour ago

G. Grothendieck · Answer 4 · 2017-01-18 13:54:16Z

Here is a zoo solution.

Define a function sum_last that given a zoo object takes the sum of the values whose times are within k days of the last day in the series and define a roll function which applies it to an entire series. Then use ave to apply roll to each user_id once for k=7 and once for k=14.

Note that this makes use of the coredata argument to rollapply that was introduced in the most recent version of zoo so be sure you don't have an earlier version.

library(zoo)

# compute sum of values within k time units of last time point
sum_last <- function(z, k) {
  tt <- time(z)
  sum(z[tt > tail(tt, 1) - k])
}

# given indexes ix run rollapplyr on read.zoo(DF2[ix, -1])
roll <- function(ix, k) {
 rollapplyr(read.zoo(DF2[ix, -1]), k, sum_last, coredata = FALSE, partial = TRUE, k = k)
}

nr <- nrow(DF2)
transform(DF2, 
  v_minus7 = ave(1:nr, user_id, FUN = function(x) roll(x, 7)),
  v_minus14 = ave(1:nr, user_id, FUN = function(x) roll(x, 14)))

asked	today
viewed	108 times
active	today

current community

your communities

more stack exchange communities

R: rolling sums for groups with uneven time gaps

4 Answers 4

Your Answer

Not the answer you're looking for? Browse other questions tagged r date dplyr cumsum or ask your own question.

Linked

Hot Network Questions

current community

your communities

more stack exchange communities

R: rolling sums for groups with uneven time gaps

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged r date dplyr cumsum or ask your own question.

Linked

Related

Hot Network Questions