R loop using data.table

Question

I have a dataset of about 3 million rows. I created a small example shown below:

ex <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63020,63020,63020,63020,63020,63020,63020,63020,61112,61112))

I have another dataset of 42 rows, but have produced a smaller example:

add_on <- data.table(primary = c(63020,61112), secondary=c(63035,63445))

I need to relabel certain rows on the "trigger_cpt" column (grouped by eoc) if the trigger_cpt value happens to be one of the values in the primary column of the dataset and if there is a proc1 value that is the secondary value in the add_on dataset. If it meets the criteria, then trigger_cpt should be relabeled to the secondary code.

I was first typing everything manually, ex[,trigger_new := if(any(trigger_cpt == '63020' & proc1 == '63035')) 63035 else trigger_cpt, eoc]

and then decided to do a for loop

for(i in 1:nrow(add_on)){
  ex[,trigger_new2 := if(any(trigger_cpt == add_on[i,1] & proc1 == add_on[i,2])) add_on[i,2] else trigger_cpt, eoc]
}

however, now that I'm trying this code on my 3 million row dataset, it's been taking a long time to run. I'm not sure if there's a better approach or if there's any modifications I could make to my current code?

any help would be greatly appreciated!

expected output:

ex_final <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63035,63035,63035,63035,63035,63020,63020,63020,61112,61112))

If you have different values for 'eoc' in add_on. then you may need to create a column 'eoc' in 'add_on' with those values and then do the join by 'eoc' along with other columns — akrun
– akrun, Commented Mar 29, 2021 at 22:48

Sirius · Accepted Answer · 2021-03-29 23:25:50Z

Here is one way which produces a data.table that sets all of trigger_cpt to the secondary value if a match is found in the grouped set:


ex2 <- add_on[ex, , on=.(primary=trigger_cpt)][ , trigger_new := fifelse( secondary %in% proc1, secondary, NA_real_ ), by=eoc ]
ex.final  <- ex2[ , trigger_cpt := fcoalesce( trigger_new, primary ) ][, .(eoc,proc1,trigger_cpt) ]

Output:


> ex.final
    eoc proc1 trigger_cpt
 1:   1 63035       63035
 2:   1 63020       63035
 3:   1 92344       63035
 4:   1 63035       63035
 5:   1 27567       63035
 6:   2 63020       63020
 7:   2  1234       63020
 8:   2 55678       63020
 9:   3 61112       61112
10:   3  1236       61112

Furthermore, if feasable (it comes at a cost), I consider using setkey unless it causes more harm than good. (initial processing might make it not worth it). It speeds up downstream operations and it might make join code a lot cleaner . data.table code can be hard enough as it is. Thus:


setkey(ex, trigger_cpt )
setkey(add_on, primary )

## can now do this:
add_on[ex]

## instead of this:
add_on[ex, , on=.(primary=trigger_cpt)]

## .. in the code above.

... Furthermore ...

If you are reworking the steps of above, you will notice that add_on[ex] (which is the somewhat backwards way to do left joins in data.table), leaves you with the key column names of add_on, and not of ex. This is not important, as long as you are aware and rename columns apropriately in the end, but an alternative way of joining the data might be this:


ex2 <- merge( ex, add_on, by.x="trigger_cpt", by.y="primary" )
## and then work your way till the end with what this gives you

akrun · Accepted Answer · 2021-03-30 00:19:40Z

0

Based on the expected output

ex[, trigger_new := first(proc1), eoc]




ex
    eoc proc1 trigger_cpt trigger_new
 1:   1 63035       63020       63035
 2:   1 63020       63020       63035
 3:   1 92344       63020       63035
 4:   1 63035       63020       63035
 5:   1 27567       63020       63035
 6:   2 63020       63020       63020
 7:   2  1234       63020       63020
 8:   2 55678       63020       63020
 9:   3 61112       61112       61112
10:   3  1236       61112       61112

edited Mar 30, 2021 at 0:19

answered Mar 29, 2021 at 22:28

akrun

891k38 gold badges590 silver badges700 bronze badges

3 Comments

vizidea Over a year ago

Is there any way I could group by column eoc? That way for eoc 1, every trigger_new is 63035

akrun Over a year ago

@vizidea Can you show the expected output. Based on your code, the grouping by eoc seems to be not needed 1. You are looping by each row, then the grouping column is irrelevant

vizidea Over a year ago

I added the desired output

Collectives™ on Stack Overflow

R loop using data.table

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related