How to speed up row level operation with dplyr

Question

I have a large dataset that I try to manipulate using dplyr. My data wrangling task requires row level string manipulation.

I am using the default rowwise() function, and the code is working. However, the operation is taking a lot of time to complete.

VR_vehicle_GPSLocation = c("12.36556|0.74518153|xxxxxxxxxx", 
      "-51.75810|165.55526|xxxxxxxxxx", 
      "GPS nicht verfügbar",
      "48.77410|171.08364|xxxxxxxxxx", 
      "GPS Not Available",
      "0|0|N/R",
      "32.18661| 170.56615|xxxxxxxxxx")
df = data.frame(VR_vehicle_GPSLocation)

jobs_location <- df %>%
     rowwise() %>% 
     mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
            longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
     select(latitude, longitude)

In order to speed up the process, I explored the multidyplyr library without success, I am getting an error message saying that my dataset is not a data frame.

jobs_location <- jobs %>%
  partition() %>%
  rowwise() %>% 
  mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
         longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
  collect()

First of all we don't have a MWE, so we can't really help you. Second of all, by looking at your code, I doubt you need to run strsplit by row. You probably could easily do the whole thing at once using data.table::tstrsplit. Third of all, if you want fast splits, don't use regex and don't run as.charcter per row (twice each time!). i.e., VR_vehicle_GPSLocation should be already a character before you start doing stuff and instead of '\\|' use | combined with fixed = TRUE. But there again, we need a MWE. — David Arenburg
– David Arenburg, Commented Feb 20, 2017 at 12:11
Thanks @DavidArenburg I am busy creating an example so it will make more sense. — Michael
– Michael, Commented Feb 20, 2017 at 12:27
library(data.table) ; setDT(df)[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)] and you good to go. — David Arenburg
– David Arenburg, Commented Feb 20, 2017 at 12:42
Thanks, it's clearly better. I also read your list of rules of thumbs and it makes perfect sense. I will, next time, think about alternative instead of using dplyr. — Michael
– Michael, Commented Feb 20, 2017 at 12:47
See my updated comment above. You need to make sure to avoid the gibrish before splitting, otherwise the resulting columns won't be numeric. I've used | as the initial filter (the grep command) but I'm not sure if this always true in your real data. Either-way, they idea here is not about dplyr or not, rather not to do rowwise operations when stuff can be easily vectorized. Also, will be interesting if you'll update if there was a performance improvement. See also this as this looks similar. — David Arenburg
– David Arenburg, Commented Feb 20, 2017 at 12:51

Michael · Accepted Answer · 2017-02-21 11:10:01Z

All credit to @DavidArenburg

I approached the problem from a non-efficient angle. Using a vectorized approach clearly improve the performance.

For the sake of completeness, I run the code on a random tiny subset of the whole dataset to evaluate the performance and clearly vectorization is the way to go for my problem.

Finally it is important to mention that a pre-cleaning task is needed to ensure the resulting transformation is numeric (refer to David's comment for more details)

library(dplyr)
library(data.table)
library(microbenchmark)
library(ggplot2)

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                   c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)]
)

autoplot(mbm)

A picture is worth a thousand words

Another suggestion from David is to convert the data to numeric after tstrsplit. I added two functions one that is doing the conversion on the overall columns and one that apply the type conversion after the split.

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)],

  c = sapply(setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2)], as.numeric),

  d = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                        c("latitude", "longitude") := lapply(tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2), as.numeric)]
)
autoplot(mbm)

The last variant (d) is clearly the winner.

I think that explicitly converting latitude and longitude to numeric after splitting instead of using type.convert = TRUE within tstrsplit will probably double the speed.
Thanks, @DavidArenburg - I elaborated the example a bit more based on your suggestion.

Collectives™ on Stack Overflow

How to speed up row level operation with dplyr

1 Answer 1

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related