1

I'm currently merging 12 different data frames that are each 480,00 obs by an id and adding the columns, so it becomes a 48k obs x 14 variable data frame. However, this is taking too long to process and I'm looking for a faster way to do this.

Example

dput:

# January data
jan <- structure(list(gridNumber = c("17578", "18982", "18983", "18984", 
"18985"), PRISM_ppt_stable_4kmM2_193301_bil = c(35.7099990844727, 
36, 35.4199981689453, 33.7299995422363, 33.2799987792969)), .Names = c("gridNumber", 
"PRISM_ppt_stable_4kmM2_193301_bil"), row.names = c("17578", 
"18982", "18983", "18984", "18985"), class = "data.frame")

# February data 
feb <- structure(list(gridNumber = c("17578", "18982", "18983", "18984", 
"18985"), PRISM_ppt_stable_4kmM2_193302_bil = c(14.6199998855591, 
14.5600004196167, 14.9899997711182, 15.4700002670288, 15.5799999237061
)), .Names = c("gridNumber", "PRISM_ppt_stable_4kmM2_193302_bil"
), row.names = c("17578", "18982", "18983", "18984", "18985"), class = "data.frame")

# March Data 
mar <- structure(list(gridNumber = c("17578", "18982", "18983", "18984", 
"18985"), PRISM_ppt_stable_4kmM2_193303_bil = c(23.8400001525879, 
23.9200000762939, 24.3400001525879, 25.7900009155273, 26.5900001525879
)), .Names = c("gridNumber", "PRISM_ppt_stable_4kmM2_193303_bil"
), row.names = c("17578", "18982", "18983", "18984", "18985"), class = "data.frame")

dplyr Code:

  library(dplyr)
  datalist <- list(jan, feb, mar)
  full <- Reduce(function(x,y) {full_join(x,y, by = "gridNumber")}, datalist)

This code obviously runs much faster because of the low obs, but is there a faster way to do this?

2 Answers 2

3

Here is an approach using data.table and reshape2

library(data.table)
library(reshape2)
# create a list of data frames, and coerce to data.tables
month_list <- lapply(list(jan,feb,mar),setDT)


# add id column with old variable name and rename value column 
for(i in seq_along(month_list)){
  set(month_list[[i]],j="ID",value = names(month_list[[i]])[2])
  setnames(month_list[[i]],  names(month_list[[i]])[2], "value")


}
# put in long form
long_data <- rbindlist(month_list)

# then use `dcast.data.table` to make wide

wide <- dcast.data.table(long_data, gridNumber~ID, value = 'value')
Sign up to request clarification or add additional context in comments.

5 Comments

This is really fast; although, I don't understand the middle for loop. Why is this a necessary step?
@Amstell, the for loop is adding a column identifying the name the data set (by the name of the 2nd column, and then renaming the second column to allow the data to be stored in 3 columns in long form)
is loading reshape2 still necessary in 1.9.6?
@jangorecki perhaps not.
Maybe worth noting that stacking in long form depends on the columns' having the same class (as is the case for the OP)... In that case, I'd argue for sticking to long form.
0

Dunno if this will be faster, but:

list(jan = jan %>% rename(PRISM = PRISM_ppt_stable_4kmM2_193301_bil), 
     feb = feb %>% rename(PRISM = PRISM_ppt_stable_4kmM2_193302_bil), 
     mar = mar %>% rename(PRISM = PRISM_ppt_stable_4kmM2_193303_bil)) %>%
  bind_rows(.id = "month") %>%
  spread(month, PRISM)

2 Comments

I take care of the rename after the data has been merged with colnames()
In this case, it is necessary for all of the PRISM columns to have the same name for bind_rows to work. The names of the list (jan = ) will end up as the new names of the PRISM columns after the reshape

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.