0

I have a dataset of about 3 million rows. I created a small example shown below:

ex <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63020,63020,63020,63020,63020,63020,63020,63020,61112,61112))

I have another dataset of 42 rows, but have produced a smaller example:

add_on <- data.table(primary = c(63020,61112), secondary=c(63035,63445))

I need to relabel certain rows on the "trigger_cpt" column (grouped by eoc) if the trigger_cpt value happens to be one of the values in the primary column of the dataset and if there is a proc1 value that is the secondary value in the add_on dataset. If it meets the criteria, then trigger_cpt should be relabeled to the secondary code.

I was first typing everything manually, ex[,trigger_new := if(any(trigger_cpt == '63020' & proc1 == '63035')) 63035 else trigger_cpt, eoc]

and then decided to do a for loop

for(i in 1:nrow(add_on)){
  ex[,trigger_new2 := if(any(trigger_cpt == add_on[i,1] & proc1 == add_on[i,2])) add_on[i,2] else trigger_cpt, eoc]
}

however, now that I'm trying this code on my 3 million row dataset, it's been taking a long time to run. I'm not sure if there's a better approach or if there's any modifications I could make to my current code?

any help would be greatly appreciated!

expected output:

ex_final <- data.table(eoc = c(1,1,1,1,1,2,2,2,3,3), proc1 = c(63035,63020,92344,63035,27567,63020,1234,55678,61112,1236), trigger_cpt = c(63035,63035,63035,63035,63035,63020,63020,63020,61112,61112))
1
  • If you have different values for 'eoc' in add_on. then you may need to create a column 'eoc' in 'add_on' with those values and then do the join by 'eoc' along with other columns Commented Mar 29, 2021 at 22:48

2 Answers 2

2

Here is one way which produces a data.table that sets all of trigger_cpt to the secondary value if a match is found in the grouped set:


ex2 <- add_on[ex, , on=.(primary=trigger_cpt)][ , trigger_new := fifelse( secondary %in% proc1, secondary, NA_real_ ), by=eoc ]
ex.final  <- ex2[ , trigger_cpt := fcoalesce( trigger_new, primary ) ][, .(eoc,proc1,trigger_cpt) ]

Output:


> ex.final
    eoc proc1 trigger_cpt
 1:   1 63035       63035
 2:   1 63020       63035
 3:   1 92344       63035
 4:   1 63035       63035
 5:   1 27567       63035
 6:   2 63020       63020
 7:   2  1234       63020
 8:   2 55678       63020
 9:   3 61112       61112
10:   3  1236       61112

Furthermore, if feasable (it comes at a cost), I consider using setkey unless it causes more harm than good. (initial processing might make it not worth it). It speeds up downstream operations and it might make join code a lot cleaner . data.table code can be hard enough as it is. Thus:


setkey(ex, trigger_cpt )
setkey(add_on, primary )

## can now do this:
add_on[ex]

## instead of this:
add_on[ex, , on=.(primary=trigger_cpt)]

## .. in the code above.

... Furthermore ...

If you are reworking the steps of above, you will notice that add_on[ex] (which is the somewhat backwards way to do left joins in data.table), leaves you with the key column names of add_on, and not of ex. This is not important, as long as you are aware and rename columns apropriately in the end, but an alternative way of joining the data might be this:


ex2 <- merge( ex, add_on, by.x="trigger_cpt", by.y="primary" )
## and then work your way till the end with what this gives you

Sign up to request clarification or add additional context in comments.

Comments

0

Based on the expected output

ex[, trigger_new := first(proc1), eoc]




ex
    eoc proc1 trigger_cpt trigger_new
 1:   1 63035       63020       63035
 2:   1 63020       63020       63035
 3:   1 92344       63020       63035
 4:   1 63035       63020       63035
 5:   1 27567       63020       63035
 6:   2 63020       63020       63020
 7:   2  1234       63020       63020
 8:   2 55678       63020       63020
 9:   3 61112       61112       61112
10:   3  1236       61112       61112

3 Comments

Is there any way I could group by column eoc? That way for eoc 1, every trigger_new is 63035
@vizidea Can you show the expected output. Based on your code, the grouping by eoc seems to be not needed 1. You are looping by each row, then the grouping column is irrelevant
I added the desired output

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.