I have a large dataset that I try to manipulate using dplyr. My data wrangling task requires row level string manipulation.
I am using the default rowwise() function, and the code is working. However, the operation is taking a lot of time to complete.
VR_vehicle_GPSLocation = c("12.36556|0.74518153|xxxxxxxxxx",
"-51.75810|165.55526|xxxxxxxxxx",
"GPS nicht verfügbar",
"48.77410|171.08364|xxxxxxxxxx",
"GPS Not Available",
"0|0|N/R",
"32.18661| 170.56615|xxxxxxxxxx")
df = data.frame(VR_vehicle_GPSLocation)
jobs_location <- df %>%
rowwise() %>%
mutate(latitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
select(latitude, longitude)
In order to speed up the process, I explored the multidyplyr library without success, I am getting an error message saying that my dataset is not a data frame.
jobs_location <- jobs %>%
partition() %>%
rowwise() %>%
mutate(latitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[1]),
longitude = as.numeric(unlist(strsplit(as.character(VR_vehicle_GPSLocation), split='\\|'))[2])) %>%
collect()


strsplitby row. You probably could easily do the whole thing at once usingdata.table::tstrsplit. Third of all, if you want fast splits, don't use regex and don't runas.charcterper row (twice each time!). i.e.,VR_vehicle_GPSLocationshould be already a character before you start doing stuff and instead of'\\|'use|combined withfixed = TRUE. But there again, we need a MWE.library(data.table) ; setDT(df)[grep("|", VR_vehicle_GPSLocation, fixed = TRUE), c("latitude", "longitude") := tstrsplit(VR_vehicle_GPSLocation, "|", fixed = TRUE, keep = 1:2, type.convert = TRUE)]and you good to go.dplyr.|as the initial filter (thegrepcommand) but I'm not sure if this always true in your real data. Either-way, they idea here is not aboutdplyror not, rather not to do rowwise operations when stuff can be easily vectorized. Also, will be interesting if you'll update if there was a performance improvement. See also this as this looks similar.