30

I am looking for an efficient (both computer resource wise and learning/implementation wise) method to merge two larger (size>1 million / 300 KB RData file) data frames.

"merge" in base R and "join" in plyr appear to use up all my memory effectively crashing my system.

Example
load test data frame

and try

test.merged<-merge(test, test)

or

test.merged<-join(test, test, type="all")  
    -

The following post provides a list of merge and alternatives:
How to join (merge) data frames (inner, outer, left, right)?

The following allows object size inspection:
https://heuristically.wordpress.com/2010/01/04/r-memory-usage-statistics-variable/

Data produced by anonym

2
  • 8
    sql.df or data.table? Commented Jun 21, 2012 at 21:34
  • After gutting the nice responses below, I was able to find: stackoverflow.com/questions/4322219/… (though the question was not about large df but about saving milliseconds, it did get similar answers as below). Commented Jun 22, 2012 at 2:51

3 Answers 3

27

Here are some timings for the data.table vs. data.frame methods.
Using data.table is very much faster. Regarding memory, I can informally report that the two methods are very similar (within 20%) in RAM use.

library(data.table)

set.seed(1234)
n = 1e6

data_frame_1 = data.frame(id=paste("id_", 1:n, sep=""),
                          factor1=sample(c("A", "B", "C"), n, replace=TRUE))
data_frame_2 = data.frame(id=sample(data_frame_1$id),
                          value1=rnorm(n))

data_table_1 = data.table(data_frame_1, key="id")
data_table_2 = data.table(data_frame_2, key="id")

system.time(df.merged <- merge(data_frame_1, data_frame_2))
#   user  system elapsed 
# 17.983   0.189  18.063 


system.time(dt.merged <- merge(data_table_1, data_table_2))
#   user  system elapsed 
#  0.729   0.099   0.821 
Sign up to request clarification or add additional context in comments.

Comments

22

Here's the obligatory data.table example:

library(data.table)

## Fix up your example data.frame so that the columns aren't all factors
## (not necessary, but shows that data.table can now use numeric columns as keys)
cols <- c(1:5, 7:10)
test[cols] <- lapply(cols, FUN=function(X) as.numeric(as.character(test[[X]])))
test[11] <- as.logical(test[[11]])

## Create two data.tables with which to demonstrate a data.table merge
dt <- data.table(test, key=names(test))
dt2 <- copy(dt)
## Add to each one a unique non-keyed column
dt$X <- seq_len(nrow(dt))
dt2$Y <- rev(seq_len(nrow(dt)))

## Merge them based on the keyed columns (in both cases, all but the last) to ...
## (1) create a new data.table
dt3 <- dt[dt2]
## (2) or (poss. minimizing memory usage), just add column Y from dt2 to dt
dt[dt2,Y:=Y]

3 Comments

Thanks for the great answer. I guess if you want the original order maintained you add a 1:grow(df) column and use it as the first element of the key?
@EtienneLow-Décarie -- That's a good question. I think you do want to add such a column, but don't make it an element of the key. That way you can use it to re-order the data at any point. (It should not be part of the key since it's just an order marker, not a variable/group identifier having the same meaning in different data sets).
Does data.table spell the end of the need for apply and plyr!? Pretty impressive!
0

Do you have to do the merge in R? If not, merge the underlying data files using a simple file concatenation and then load them into R. (I realize this may not apply to your situation -- but if it does, it could save you a lot of headache.)

1 Comment

It has to be done live in R as it is a step in an optimization routine writing to disk would probably be a bottleneck. Thanks though.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.