0

I have two data tables as follows:

set.seed(1)
landing_fees <- data.table(`date`= c(20160101,20160101,20160101,
            20160102,20160102,20160102,20160103,20160103,20160103), 
            airport=c("LAX","PHX","HNL"), period_a_fee=sample(100:200,9),
            period_b_fee=sample(100:200,9), period_c_fee=sample(100:200,9))
routes <- data.table(origin=c(rep("LAX",4),rep("PHX",4)), 
            destination=c("PHX","HNL","PHX","HNL","LAX","HNL","LAX","HNL"),
            period=rep(c("period_a","period_a","period_b","period_c"),2))

Which output as follows:

> landing_fees
       date airport period_a_fee period_b_fee period_c_fee
1: 20160101     LAX          126          106          138
2: 20160101     PHX          137          120          177
3: 20160101     HNL          156          117          192
4: 20160102     LAX          189          167          120
5: 20160102     PHX          119          137          163
6: 20160102     HNL          186          173          112
7: 20160103     LAX          197          147          125
8: 20160103     PHX          162          197          136
9: 20160103     HNL          158          192          101
> routes
   origin destination   period
1:    LAX         PHX period_a
2:    LAX         HNL period_a
3:    LAX         PHX period_b
4:    LAX         HNL period_c
5:    PHX         LAX period_a
6:    PHX         HNL period_a
7:    PHX         LAX period_b
8:    PHX         HNL period_c

The real datasets are huge with millions of origin/destination combinations and years worth of dates. I would like to add a new column to the routes table for each unique date in landing_fees$date and programatically name the column with the date and _cost (ie 20160101_cost). Then in each new column, sum the fee for origin and the fee for the destination (using matching time period) as listed in the landing_fees table. For instance, in the routes table, a new column routes$20160101_cost would be created and the value of row 1 would be: LAX period_a fee of 126 (from landing_fees table) plus PHX period_a fee of 137 (from landing_fees table) equals 263. Same calculation would be made for each origin/destination/period combination for each date.

Final output would look like this:

    origin  destination period  20160101_cost 20160102_cost 20160103_cost
1:  LAX          PHX    period_a    263          308            359
2:  LAX          HNL    period_a    282          375            355
3:  LAX          PHX    period_b    226          304            344
4:  LAX          HNL    period_c    330          232            226
5:  PHX          LAX    period_a    263          308            359
6:  PHX          HNL    period_a    293          305            320
7:  PHX          LAX    period_b    226          304            344
8:  PHX          HNL    period_c    369          275            237

Since the datasets can be fairly large, I am hoping to use efficient data.table script. I am using data.table version 1.9.6 on Windows 7 machine.

1 Answer 1

3

You can do this using melt and cast from data.table.

This flattens your cost matrix for merging:

lf_flat <- melt(landing_fees, id.vars = c("date", "airport"), variable.name = "period")[, period := gsub("_fee", "", period)]

Then merge to the routing table, and collapse cost

DT <- merge(routes, lf_flat,by.x = c("origin","period"), by.y = c("airport","period"))
DT <- merge(DT, lf_flat, by.x = c("destination","period","date"), by.y = c("airport","period","date"))
DT[, cost := value.x + value.y][,(c("value.x","value.y")) := NULL]

Which gives you the consolidated table:

DT
    destination   period     date origin cost
 1:         HNL period_a 20160101    LAX  282
 2:         HNL period_a 20160101    PHX  293
 3:         HNL period_a 20160102    LAX  375
 4:         HNL period_a 20160102    PHX  305
 5:         HNL period_a 20160103    LAX  355
 6:         HNL period_a 20160103    PHX  320
 7:         HNL period_c 20160101    LAX  330
 8:         HNL period_c 20160101    PHX  369
 9:         HNL period_c 20160102    LAX  232
10:         HNL period_c 20160102    PHX  275
11:         HNL period_c 20160103    LAX  226
12:         HNL period_c 20160103    PHX  237
13:         LAX period_a 20160101    PHX  263
14:         LAX period_a 20160102    PHX  308
15:         LAX period_a 20160103    PHX  359
16:         LAX period_b 20160101    PHX  226
17:         LAX period_b 20160102    PHX  304
18:         LAX period_b 20160103    PHX  344
19:         PHX period_a 20160101    LAX  263
20:         PHX period_a 20160102    LAX  308
21:         PHX period_a 20160103    LAX  359
22:         PHX period_b 20160101    LAX  226
23:         PHX period_b 20160102    LAX  304
24:         PHX period_b 20160103    LAX  344
    destination   period     date origin cost

I would personally leave it like this, as it gives you the same data as your desired matrix, but with more flexibility. To get your desired state, use:

dcast(DT, origin + destination + period ~ date, value.var = "cost")
   origin destination   period 20160101 20160102 20160103
1:    LAX         HNL period_a      282      375      355
2:    LAX         HNL period_c      330      232      226
3:    LAX         PHX period_a      263      308      359
4:    LAX         PHX period_b      226      304      344
5:    PHX         HNL period_a      293      305      320
6:    PHX         HNL period_c      369      275      237
7:    PHX         LAX period_a      263      308      359
8:    PHX         LAX period_b      226      304      344
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.