3

I have a large dataset that I try to manipulate using dplyr. My data wrangling task requires row level string manipulation.

I am using the default rowwise() function, and the code is working. However, the operation is taking a lot of time to complete.

VR_vehicle_GPSLocation = c("12.36556|0.74518153|xxxxxxxxxx", 
      "-51.75810|165.55526|xxxxxxxxxx", 
      "GPS nicht verfügbar",
      "48.77410|171.08364|xxxxxxxxxx", 
      "GPS Not Available",
      "0|0|N/R",
      "32.18661| 170.56615|xxxxxxxxxx")
df = data.frame(VR_vehicle_GPSLocation)

jobs_location <- df %>%
     rowwise() %>% 
     mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
            longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
     select(latitude, longitude)

In order to speed up the process, I explored the multidyplyr library without success, I am getting an error message saying that my dataset is not a data frame.

jobs_location <- jobs %>%
  partition() %>%
  rowwise() %>% 
  mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
         longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
  collect()
5
  • 1
    First of all we don't have a MWE, so we can't really help you. Second of all, by looking at your code, I doubt you need to run strsplit by row. You probably could easily do the whole thing at once using data.table::tstrsplit. Third of all, if you want fast splits, don't use regex and don't run as.charcter per row (twice each time!). i.e., VR_vehicle_GPSLocation should be already a character before you start doing stuff and instead of '\\|' use | combined with fixed = TRUE. But there again, we need a MWE. Commented Feb 20, 2017 at 12:11
  • Thanks @DavidArenburg I am busy creating an example so it will make more sense. Commented Feb 20, 2017 at 12:27
  • 1
    library(data.table) ; setDT(df)[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)] and you good to go. Commented Feb 20, 2017 at 12:42
  • Thanks, it's clearly better. I also read your list of rules of thumbs and it makes perfect sense. I will, next time, think about alternative instead of using dplyr. Commented Feb 20, 2017 at 12:47
  • 2
    See my updated comment above. You need to make sure to avoid the gibrish before splitting, otherwise the resulting columns won't be numeric. I've used | as the initial filter (the grep command) but I'm not sure if this always true in your real data. Either-way, they idea here is not about dplyr or not, rather not to do rowwise operations when stuff can be easily vectorized. Also, will be interesting if you'll update if there was a performance improvement. See also this as this looks similar. Commented Feb 20, 2017 at 12:51

1 Answer 1

2

All credit to @DavidArenburg

I approached the problem from a non-efficient angle. Using a vectorized approach clearly improve the performance.

For the sake of completeness, I run the code on a random tiny subset of the whole dataset to evaluate the performance and clearly vectorization is the way to go for my problem.

Finally it is important to mention that a pre-cleaning task is needed to ensure the resulting transformation is numeric (refer to David's comment for more details)

library(dplyr)
library(data.table)
library(microbenchmark)
library(ggplot2)

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                   c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)]
)

autoplot(mbm)

A picture is worth a thousand words

enter image description here


Another suggestion from David is to convert the data to numeric after tstrsplit. I added two functions one that is doing the conversion on the overall columns and one that apply the type conversion after the split.

mbm = microbenchmark(
  a = sample_n(jobs, 100) %>%
    rowwise() %>% 
    mutate(latitude  = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
           longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])),

  b = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)],

  c = sapply(setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                 c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2)], as.numeric),

  d = setDT(sample_n(jobs, 100))[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), 
                                        c("latitude", "longitude") := lapply(tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2), as.numeric)]
)
autoplot(mbm)

The last variant (d) is clearly the winner.

enter image description here

Sign up to request clarification or add additional context in comments.

2 Comments

I think that explicitly converting latitude and longitude to numeric after splitting instead of using type.convert = TRUE within tstrsplit will probably double the speed.
Thanks, @DavidArenburg - I elaborated the example a bit more based on your suggestion.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.