Efficient dataframe looping in R

Question

I would like to loop through the following data.frame and group by sequential entries, as determined by the value in X2. So in the following data.frame, we can see four groups: 1-3, 5-6, 9-13, and 16. We could have any combination of group sizes and number of groups.

                                            X1 X2               X3                       X4
1   1_21/08/2014 22:56CONTENT_ACCESS.preparing  1 21/08/2014 22:56 CONTENT_ACCESS.preparing
2   2_21/08/2014 22:57CONTENT_ACCESS.preparing  2 21/08/2014 22:57 CONTENT_ACCESS.preparing
3   3_21/08/2014 22:58CONTENT_ACCESS.preparing  3 21/08/2014 22:58 CONTENT_ACCESS.preparing
4   5_21/08/2014 23:07CONTENT_ACCESS.preparing  5 21/08/2014 23:07 CONTENT_ACCESS.preparing
5   6_21/08/2014 23:08CONTENT_ACCESS.preparing  6 21/08/2014 23:08 CONTENT_ACCESS.preparing
6   9_21/08/2014 23:29CONTENT_ACCESS.preparing  9 21/08/2014 23:29 CONTENT_ACCESS.preparing
7  10_21/08/2014 23:30CONTENT_ACCESS.preparing 10 21/08/2014 23:30 CONTENT_ACCESS.preparing
8  11_21/08/2014 23:31CONTENT_ACCESS.preparing 11 21/08/2014 23:31 CONTENT_ACCESS.preparing
9  12_21/08/2014 23:33CONTENT_ACCESS.preparing 12 21/08/2014 23:33 CONTENT_ACCESS.preparing
10 13_21/08/2014 23:34CONTENT_ACCESS.preparing 13 21/08/2014 23:34 CONTENT_ACCESS.preparing
11 16_21/08/2014 23:40CONTENT_ACCESS.preparing 16 21/08/2014 23:40 CONTENT_ACCESS.preparing

I would like to capture the timestamps in X3 so they can describe the time range (i.e. the first and last timestamp of each group) and produce this output. start_ts is the first timestamp and stop_ts is the last in each group:

student_id session_id start_ts           stop_ts             week micro_process
1          4         16 21/08/2014 22:56 21/08/2014 22:58    4          TASK
2          4         16 21/08/2014 23:07 21/08/2014 23:08    4          TASK
3          4         16 21/08/2014 23:29 21/08/2014 23:34    4          TASK
3          4         16 21/08/2014 23:40 21/08/2014 23:40    4          TASK

I haven't yet attempted the loop but would like to see how to do it without traditional looping. My code currently only captures the range of the whole group:

  student_id session_id         start_ts          stop_ts week micro_process
1          4         16 21/08/2014 22:58 21/08/2014 23:30    4          TASK

The other variables (student ID etc.) have been dummified in my example and are not strictly relevant but I would like to leave them in for completeness.

Code (which can be run directly):

library(stringr)
options(stringsAsFactors = FALSE) 

eventised_session <- data.frame(student_id=integer(),
                                session_id=integer(), 
                                start_ts=character(),
                                stop_ts=character(),
                                week=integer(),
                                micro_process=character())

string_match.df <- structure(list(X1 = c("1_21/08/2014 22:56CONTENT_ACCESS.preparing", 
                                         "2_21/08/2014 22:57CONTENT_ACCESS.preparing", "3_21/08/2014 22:58CONTENT_ACCESS.preparing", 
                                         "5_21/08/2014 23:07CONTENT_ACCESS.preparing", "6_21/08/2014 23:08CONTENT_ACCESS.preparing", 
                                         "9_21/08/2014 23:29CONTENT_ACCESS.preparing", "10_21/08/2014 23:30CONTENT_ACCESS.preparing", 
                                         "11_21/08/2014 23:31CONTENT_ACCESS.preparing", "12_21/08/2014 23:33CONTENT_ACCESS.preparing", 
                                         "13_21/08/2014 23:34CONTENT_ACCESS.preparing", "16_21/08/2014 23:40CONTENT_ACCESS.preparing"
), X2 = c("1", "2", "3", "5", "6", "9", "10", "11", "12", "13", 
          "16"), X3 = c("21/08/2014 22:56", "21/08/2014 22:57", "21/08/2014 22:58", 
                        "21/08/2014 23:07", "21/08/2014 23:08", "21/08/2014 23:29", "21/08/2014 23:30", 
                        "21/08/2014 23:31", "21/08/2014 23:33", "21/08/2014 23:34", "21/08/2014 23:40"
          ), X4 = c("CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", 
                    "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", 
                    "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", 
                    "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing", "CONTENT_ACCESS.preparing"
          )), .Names = c("X1", "X2", "X3", "X4"), row.names = c(NA, -11L
          ), class = "data.frame")

r_student_id <- 4
r_session_id <- 16
r_week <- 4
r_mic_proc <- "TASK"

string_match.df

#Get the first and last timestamp in matched sequence
r_start_ts <- string_match.df[1, ncol(string_match.df)-1]
r_stop_ts <- string_match.df[nrow(string_match.df), ncol(string_match.df)-1]

eventised_session[nrow(eventised_session)+1,] <- c(r_student_id, r_session_id, r_start_ts, r_stop_ts, r_week, r_mic_proc)

eventised_session

I would appreciate you expertise on this one. I have only ever used traditional loops.

I’ve posted a solution, but in the future, you should more clearly spell out what the operation you want to do is with explicit steps and rules. It makes it easier to answer if we don’t have to study the input and output to figure out what happened. — divibisan
– divibisan, Commented Aug 18, 2018 at 15:56

Gregor Thomas · Accepted Answer · 2018-08-20 21:36:58Z

1

We convert to numeric, subtract off a sequence so that adjacent numbers will be converted to the same number. Since you don't provide desired output and reference column names that differ from the names of your example data, I'm guessing at the end result (based on the other answer):

string_match.df$X2 = as.numeric(string_match.df$X2)
string_match.df$grp = string_match.df$X2 - 1:nrow(string_match.df)
string_match.df

library(dplyr)
string_match.df %>%
  group_by(grp) %>% 
  summarize(start = first(X3), stop = last(X3))
#     grp start            stop            
#   <dbl> <chr>            <chr>           
# 1     0 21/08/2014 22:56 21/08/2014 22:58
# 2     1 21/08/2014 23:07 21/08/2014 23:08
# 3     3 21/08/2014 23:29 21/08/2014 23:34
# 4     5 21/08/2014 23:40 21/08/2014 23:40

As a side note, be careful with the term "matrix". You used the matrix tag and used the word matrix several times in your question, but you don't have a matrix, nor should you be using one. You have a data.frame. In a matrix, all data must be the same type. In a data frame, the columns can have different types. Here you have a numeric column, two string columns, and one datetime column, so a matrix would be a poor choice. A data frame, where each of those columns can be of the appropriate class, is much better.

answered Aug 20, 2018 at 21:36

Gregor Thomas

147k22 gold badges185 silver badges320 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Funtboy Over a year ago

This looks good. I will look properly in the morning (I'm in the UK). My original question did use a matrix, or matrix inside a list, but I changed it to data.frame on divisiban's advice. I did remove references to matrix in the body of my question, but not the tags or title. Good spot.

lebatsnok · Accepted Answer · 2018-08-20 21:58:00Z

I'm using a shorter name for the data, and converting df$X2 to numeric:

df <- string_match.df  # as defined in OP
df$X2 <- as.numeric(df$X2)

You can split your data frame using a combination of cumsum and diff:

cumsum(diff(c(0,as.numdf$X2))>1)
#  [1] 0 0 0 1 1 2 2 2 2 2 3
# presumes that df$X2[1] is 1, but you can  easily make up a general case:
#  cumsum(diff(c(df$X2[1]-1,df$X2))>1)

And now just use split and lapply:

do.call(rbind,lapply(split(df, cumsum(diff(c(0,df$X2))>1)), function(x) {foo <- x$X3; data.frame(start_ts=foo[1], stop_ts=tail(foo,1))}))
# output:
          start_ts          stop_ts
0 21/08/2014 22:56 21/08/2014 22:58
1 21/08/2014 23:07 21/08/2014 23:08
2 21/08/2014 23:29 21/08/2014 23:34
3 21/08/2014 23:40 21/08/2014 23:40

The rest is a question of formatting the output as you wish.

divibisan · Accepted Answer · 2018-08-20 15:52:09Z

0

Your new question can be done pretty easily in tidyverse. The main thing you have to do is divide your observations into groups based on the timestamp variable. I assumed that the rule would be to start a new group if more than 2 minutes passed since the last observation. You can change that easily if you need to.

Once the observations are grouped, you can simply use summarize to return the results on calculations by group (in this case, the first and last timepoints):

library(dplyr)
library(lubridate)

string_match.df %>%
    select('id' = X2,                              # Select and rename variables
           'timestamp' = X3) %>%
    mutate(timestamp = dmy_hm(timestamp),          # Parse timestamp as date
           time_diff = timestamp - lag(timestamp), # Calculate time from last obs
           new_obs = time_diff > 2) |              # New obs. if >2 min from last one
                     is.na(time_diff),             #   or, if it's the 1st obs.
           group_id = cumsum(new_obs)) %>%         # Count new groups for group ID
    group_by(group_id) %>%                         # Group by 'group_id'
    summarize(start_ts = min(timestamp),           # Then return the first and last
              stop_ts = max(timestamp))            #  timestamps for each group

# A tibble: 4 x 3
  group_id start_ts            stop_ts            
     <int> <dttm>              <dttm>             
1        1 2014-08-21 22:56:00 2014-08-21 22:58:00
2        2 2014-08-21 23:07:00 2014-08-21 23:08:00
3        3 2014-08-21 23:29:00 2014-08-21 23:34:00
4        4 2014-08-21 23:40:00 2014-08-21 23:40:00

Since there was no discussion in your question about how student_id, session_id, week, and micro_process are determined, I left them out from my example. You can easily add them onto the table after, or add new rules to the summarize call if they are determined by parsing data for the group.

edited Aug 20, 2018 at 15:52

answered Aug 18, 2018 at 15:50

divibisan

12.2k11 gold badges44 silver badges63 bronze badges

8 Comments

Funtboy Over a year ago

This looks great. I have to leave the house now, sadly, but I will get back on this tomorrow. Many thanks!

Gregor Thomas Over a year ago

OP said they want to "group by sequential entries, as determined by the value in X2", so I think the odd/even grouping is a coincidence and OP wants those groups because rows 2 and 3 are 1 second apart and rows 4 and 5 are 1 second apart. I was working on an answer based on that, but because of OP's comment here I'll wait to see if I'm correct before investing more time.

Funtboy Over a year ago

Hello yes. I just skimmed this yesterday on my phone and didn't have a chance to look at it properly. Sorry if I was not clear, but @Gregor is closer to the mark here. The groupings could be any size, and there can be any number of them. As regards the difference in seconds, that is not strictly relevant, I just want the first and last timestamps of the groups. The groups themselves are indicated by the sequential numbers. The data supplied is mocked up. We could have groupings of 1-10, 12, 15, 16-20, 29-31, 40 for example.

Funtboy Over a year ago

Hello both. I have edited my original question. I have set up a different version of the input data.frame with more groups and divibisan's solution doesn't quite work as planned with this data. Probably due to the lack of clarity in my original request. I feel you guys are close though. Any ideas? Many thanks in advance...

Gregor Thomas Over a year ago

I can answer when I'm done with work this evening if no one has answered by then. Basically, the answer will be to convert your times to a POSIX date-time format, then use something like this to detect the sequences and create a grouping variable.

|

Collectives™ on Stack Overflow

Efficient dataframe looping in R

3 Answers 3

1 Comment

Comments

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related