Subset specific row and last row from data frame

Question

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.

Here is a data set

ID Score Time
1    0    0
1    3    5
1    -2   9
1    -4   17
1    -7   31
1    -1   43
2    0    0
2    -3   15
2    0    19
2    4    25
2    6    29
2    9    33
2    3    37
3    0    0
3    5    3
3    2    11

So for this data set, I would hopefully get this output:

ID Score Time
1   -7    31    
1   -1    43
2    6    29 
2    9    33
2    3    37
3    2    11

So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).

My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:

Data[Data$Score > 5 | Data$Score < -5]

Let me know if you need anymore information.

What do you want to happen to a row that satisfies both conditions? Should it appear once or twice? — blakeoft
– blakeoft, Commented Jan 26, 2017 at 21:12
Preferably just once. If it appears twice it isn't an issue, I'm sure there is a way to delete duplicate rows — useR
– useR, Commented Jan 26, 2017 at 21:16

blakeoft · Accepted Answer · 2017-01-27 15:31:10Z

3

You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.

Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
#   ID Score Time
#6   1    -1   43
#13  2     3   37
#16  3     2   11

To combine the two conditions, use rbind.

Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])

To get rid of rows that satisfy both conditions, you can use duplicated and rownames.

Data2 <- Data2[!duplicated(rownames(Data2)), ]

You can also sort if desired, of course.

edited Jan 27, 2017 at 15:31

answered Jan 26, 2017 at 21:22

blakeoft

2,4001 gold badge16 silver badges15 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Rich Scriven Over a year ago

In your rbind code, you could simplify the whole line down to df[with(df, c(which(Score > 5 | Score < -5), cumsum(rle(ID)$lengths))), ]

useR Over a year ago

Whenever I run this code, get the error

1: In Ops.factor(Score, 5) : ‘>’ not meaningful for factors 2: In Ops.factor(Score, -5) : ‘<’ not meaningful for factors

Even though I previously transform Score to numeric : DF <- transform(Table, Score = as.numeric(as.character(Score))) class(DF$Score) [1] "numeric"

blakeoft Over a year ago

@useR How did you read in your data? Make sure that you read it in so that the Score field is not a factor as you need to do numeric comparisons with it.

useR Over a year ago

I left my computer for the weekend, came back and its now working! Thanks very much. The only issued I face was that the data wasn't grouped together. By that, I mean it printed all of the times when the score was above 5 or below -5, and then it printed the final value for each ID. I solved this issue by sorting the data frame by grouping the data by ID. Data3 <- Data2[order(Data2$ID),]

blakeoft Over a year ago

@useR If you want the rows in the same exact order as the original dataframe, then you can sort by the rownames, I believe.

Rich Scriven · Accepted Answer · 2017-01-27 17:58:41Z

3

Here's a go at it in data.table, where df is your original data frame.

library(data.table)
setDT(df)

df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
#    ID Score Time
# 1:  1    -7   31
# 2:  1    -1   43
# 3:  2     6   29
# 4:  2     9   33
# 5:  2     3   37
# 6:  3     2   11

We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.

Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.

Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is

df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
#    ID Score Time
# 1:  1    -7   31
# 2:  1    -1   43
# 3:  2     6   29
# 4:  2     9   33
# 5:  2     3   37
# 6:  3     2   11

edited Jan 27, 2017 at 17:58

answered Jan 26, 2017 at 21:19

Rich Scriven

99.8k11 gold badges190 silver badges252 bronze badges

2 Comments

useR Over a year ago

Whenever I try run this code I get the error Error in [.data.frame

(DF, , .I == .I[.N] | !between(Score, -5, 5), by = ID) :    unused argument (by = ID)'  Even though ID is definitely a column name:

> colnames(DF)` [1] "ID" "Score" "Time" `

Rich Scriven Over a year ago

@useR Did you run library(data.table)? The package needs to be installed and loaded.

lmo · Accepted Answer · 2017-01-26 21:42:44Z

2

Here is another base R solution.

df[as.logical(ave(df$Score, df$ID,
                  FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]

   ID Score Time
5   1    -7   31
6   1    -1   43
11  2     6   29
12  2     9   33
13  2     3   37
16  3     2   11

abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.

answered Jan 26, 2017 at 21:42

lmo

38.6k9 gold badges63 silver badges76 bronze badges

Comments

Joe · Accepted Answer · 2018-12-04 18:23:54Z

0

Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.

library(tidyverse)
lastrows  <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()

# A tibble: 6 x 3
# Groups:   ID [3]
#      ID Score  Time
#   <int> <int> <int>
# 1     1    -7    31
# 2     1    -1    43
# 3     2     6    29
# 4     2     9    33
# 5     2     3    37
# 6     3     2    11

answered Dec 4, 2018 at 18:23

Joe

8,7512 gold badges55 silver badges60 bronze badges

Collectives™ on Stack Overflow

Subset specific row and last row from data frame

4 Answers 4

5 Comments

2 Comments

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

5 Comments

2 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related