How to aggregate a data frame based on multiple columns in R?

Question

I have the following df (dput() at the end of question):

          link person   type      ID1   bin
1311752 793482 514677   departure 347 36900
1020897  66154 514677   arrival   379 36900
1376554 858880 514679   departure 374 42300
1374956 858880 514679   arrival   374 46800
1567933 992830 514680   arrival   140 31500
1568804 992830 514680   departure 140 32400
726546   87723 514681   departure  42 25200
726547  481929 514681   arrival   233 27900
723875  481929 514681   departure 233 45000
1396595  87723 514681   arrival    42 45900

What I want is to create a new df, where I see which person goes from (type=="departure") which zone (ID1) to (type=="arrival") which zone (ID1) and at which departure bin.

As can be seen in the last 4 rows of the example above, it is possible that the same person has multiple arrival and departure. It is important therefore that the two, which are closest to one another are combined (therefore smallest delta bin).

Therefore the new df should look something like this:

person  departure   arrival bin
514677  347         379     36900
514679  374         374     42300
514681  42          233     25200
514681  233         42      45000
...

As you can see, the person 514681 travels back and forth between 42 and 233.

I guess you can say that I want to aggregate the data based on the person. I am not quite sure how this can be accomplished and would be very happy for some help!

structure(list(link = c("793482", "66154", "858880", "858880", 
"992830", "992830", "858880", "858880", "514143", "87723", "481929", 
"481929", "87723", "87723", "964586", "87723", "1013453", "pt_8591034", 
"973606", "pt_8591034", "481929", "481929", "87723", "87723", 
"399738", "399738", "87723", "738635", "481929", "738635", "481929", 
"738635", "301867", "89053", "92917", "92917", "981899", "981899", 
"89053", "92917", "89053", "89053", "699715", "699715", "89053", 
"452176", "738635", "452176", "738635", "641738", "712438", "699715", 
"215822", "621953", "634264", "362138", "634264", "634264", "763815", 
"763815", "631191", "631191", "763815", "763815", "634264", "634264", 
"10826", "10826", "634264", "999977", "1046826", "1046826", "634264", 
"999977", "848815", "848815", "999977", "999977", "848815", "848815", 
"999977", "261239", "993653", "287772", "763815", "763815", "287772", 
"494898", "824071", "594222", "494898", "494898", "452642", "452642", 
"494898", "494898", "452642", "494898", "787840", "442285"), 
    person = c(514677L, 514677L, 514679L, 514679L, 514680L, 514680L, 
    514680L, 514680L, 514680L, 514681L, 514681L, 514681L, 514681L, 
    514682L, 514682L, 514683L, 514683L, 514684L, 514684L, 514684L, 
    514684L, 514684L, 514684L, 514685L, 514685L, 514685L, 514685L, 
    514688L, 514688L, 514688L, 514688L, 514690L, 514690L, 514707L, 
    514707L, 514707L, 514707L, 514707L, 514707L, 514708L, 514708L, 
    514709L, 514709L, 514709L, 514709L, 514710L, 514710L, 514710L, 
    514710L, 514711L, 514711L, 514733L, 514733L, 514757L, 514757L, 
    514757L, 514757L, 514759L, 514759L, 514759L, 514759L, 514759L, 
    514759L, 514759L, 514759L, 514760L, 514760L, 514760L, 514760L, 
    514774L, 514774L, 514774L, 514774L, 514774L, 514774L, 514774L, 
    514774L, 514774L, 514774L, 514774L, 514774L, 514782L, 514782L, 
    514786L, 514786L, 514786L, 514786L, 514802L, 514802L, 514802L, 
    514802L, 514803L, 514803L, 514803L, 514803L, 514803L, 514803L, 
    514804L, 514804L, 514805L), type = c("departure", "arrival", 
    "departure", "arrival", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "arrival", "departure", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "arrival", "departure", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure", "arrival", "departure", 
    "arrival", "departure", "arrival", "departure", "arrival", 
    "departure", "arrival", "departure"), ID1 = c(347L, 379L, 
    374L, 374L, 140L, 140L, 374L, 374L, 401L, 42L, 233L, 233L, 
    42L, 42L, 496L, 42L, 235L, 172L, 231L, 172L, 233L, 233L, 
    42L, 42L, 267L, 267L, 42L, 58L, 233L, 58L, 233L, 58L, 292L, 
    96L, 264L, 264L, 26L, 26L, 96L, 264L, 96L, 96L, 266L, 266L, 
    96L, 57L, 58L, 57L, 58L, 123L, 55L, 266L, 77L, 412L, 407L, 
    410L, 407L, 407L, 297L, 297L, 169L, 169L, 297L, 297L, 407L, 
    407L, 566L, 566L, 407L, 403L, 380L, 380L, 407L, 403L, 436L, 
    436L, 403L, 403L, 436L, 436L, 403L, 545L, 324L, 448L, 297L, 
    297L, 448L, 209L, 152L, 205L, 209L, 209L, 204L, 204L, 209L, 
    209L, 204L, 209L, 383L, 181L), bin = c(36900, 36900, 42300, 
    46800, 31500, 32400, 34200, 61200, 72000, 25200, 27900, 45000, 
    45900, 27900, 31500, 30600, 33300, 56700, 56700, 60300, 61200, 
    63900, 65700, 28800, 32400, 75600, 77400, 24300, 26100, 45900, 
    47700, 51300, 53100, 27900, 31500, 36900, 38700, 41400, 42300, 
    62100, 63900, 15300, 16200, 45900, 47700, 35100, 35100, 67500, 
    67500, 63900, 65700, 45900, 46800, 63000, 64800, 90000, 90900, 
    27000, 28800, 45000, 46800, 63000, 67500, 68400, 72000, 48600, 
    50400, 53100, 55800, 37800, 38700, 40500, 41400, 47700, 48600, 
    51300, 52200, 62100, 63000, 63900, 64800, 22500, 24300, 19800, 
    20700, 54000, 57600, 21600, 22500, 55800, 56700, 36900, 38700, 
    51300, 52200, 62100, 63900, 30600, 34200, 33300)), row.names = c(1311752L, 
1020897L, 1376554L, 1374956L, 1567933L, 1568804L, 1378922L, 1378604L, 
796332L, 1396610L, 726547L, 723875L, 1396595L, 1398179L, 1513826L, 
1396821L, 27568L, 1607272L, 1542798L, 1606758L, 721147L, 720198L, 
1397549L, 1398033L, 621908L, 626763L, 1397225L, 1184210L, 729418L, 
1185149L, 728038L, 1183207L, 465692L, 1426554L, 1460666L, 1457289L, 
1551709L, 1550925L, 1429217L, 1461183L, 1427235L, 1428930L, 1117203L, 
1119814L, 1431191L, 683362L, 1183236L, 684338L, 1184602L, 990483L, 
1142535L, 1125893L, 288323L, 947577L, 966966L, 544664L, 969372L, 
964351L, 1230180L, 1231933L, 960341L, 960438L, 1238195L, 1241438L, 
965853L, 964667L, 105034L, 106179L, 966724L, 1586724L, 72120L, 
76232L, 969048L, 1587092L, 1360314L, 1361152L, 1585499L, 1586919L, 
1361549L, 1359631L, 1587109L, 378600L, 1573640L, 428095L, 1236840L, 
1235995L, 430374L, 757402L, 1331227L, 916001L, 760182L, 757620L, 
686066L, 685765L, 759010L, 760140L, 687455L, 758617L, 1287055L, 
676520L), class = "data.frame")

The arrival and departure bin of person 514679 aren't the same. So you just want the departure bin to be shown? — Martin Gal
– Martin Gal, Commented Jun 11, 2020 at 16:33

Michael M · Accepted Answer · 2020-06-11 19:10:53Z

1

With dplyr, this could be something like this:

library(dplyr)
data %>% 
  group_by(person) %>% 
  summarize(departure = ID1[type == "departure"][1],
            arrival = ID1[type == "arrival"][1],
            bin = bin[type == "departure"][1])

Here a refined version that tries to satisfy the additional logic. Since the problem is not too clear, I am quite certain that it is not yet 100% fine.

data %>%
  arrange(person, bin) %>%
  group_by(person) %>%
  mutate(pair = rep(row_number(), each = 2, length.out = n())) %>%
  group_by(person, pair) %>%
  summarize(departure = ID1[type == "departure"][1],
            arrival = ID1[type == "arrival"][1],
            bin = bin[type == "departure"][1])

edited Jun 11, 2020 at 19:10

answered Jun 11, 2020 at 16:36

Michael M

1,66313 silver badges14 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

Yves Over a year ago

thank you very much, this seems to partly do the trick. I'm afraid my description was to vague (ill edit the question). It is possible that one person has multiple arrival and departure. Therefore, the closest ones have to be considered and combined in the way you present it. Is it possible to adjust the code to do this?

Michael M Over a year ago

Could you add such a case with the desired outcome? It it possible that the same person appears multiple time in the output?

Yves Over a year ago

I eddited the desired output:) thank you for having a look at it

Michael M Over a year ago

Hmm, still not so well defined query. E.g. person 514680 has five lines. How would his output look like? Is it possible that there are multiple departures within the same bin? (Maybe "link" helps).

Yves Over a year ago

In case of uneven row numbers id expect simply NA. It can be that there are multiple departures within the same bin (each bin contains 900 seconds, the links are the ID1). If it happens that there are more than one departures per bin and person i would have removed them in another step as it would be hard to distinguish the origin and destination.

|

Daniel O · Accepted Answer · 2020-06-11 16:41:08Z

0

In base R we can split the data and only keep the person/bin combinations that have more than 1 row (meaning there is an arrival and a departure) then extract the data.

splitdata <- split(df,with(df,paste0(person,bin)))
splitdata <- splitdata[sapply(splitdata,nrow)>1]

do.call(rbind,
    lapply(splitdata, function(x) 
        data.frame(
            person = x[1,2], 
            departure = x$ID1[x$type == "departure"],
            arrival = x$ID1[x$type == "arrival"],
            bin = x[1,5])))

output:

            person departure arrival   bin
51467736900 514677       347     379 36900
51468456700 514684       231     172 56700
51471035100 514710        57      58 35100
51471067500 514710        58      57 67500

answered Jun 11, 2020 at 16:41

Daniel O

4,3789 silver badges21 bronze badges

1 Comment

Yves Over a year ago

thank you very much! Unfortunately I receive an error: ``` Error in data.frame(person = x[1, 2], departure = x$ID1[x$type == "departure"], Arguments imply differing number of rows: 1,0 ``` Do you know what might be the issue here?

Collectives™ on Stack Overflow

How to aggregate a data frame based on multiple columns in R?

2 Answers 2

6 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

6 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related