1

I am trying to mutate a new column into a very large dataframe by using combinations of two variables to draw a value from another dataframe. I know I can do this without using a for loop, but the original dataframe is ~2.7 million rows and so a for loop takes prohibitively long.

Essentially, I have two dataframes like such:

df1 = data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))
df1
   years sites        var1
1   2001     1 -0.01419947
2   2002     2  0.53729311
3   2003     3  0.89222231
4   2004     4  0.20600240
5   2005     5  0.24541548

df2 = data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
df2
      2001      2002      2003      2004      2005
1 -0.1217767 0.6703649 3.6074038 2.3048512 3.0022530
2  2.6167986 1.7673236 0.9467254 0.9903685 1.8533297
3  0.3192424 2.2183726 0.8783466 2.7741821 0.1847018
4  2.3599459 0.5653315 3.8843616 3.3171480 2.9135520
5  1.5399871 2.8127713 1.2168152 2.1788167 2.1153329

where for df2 the column names are years, the rownames are sites, and in the actual version of df1, each year/site combination is present many times.

I would like to end up with:

   years sites        var1    NewVar
1   2001     1 -0.01419947  1.322451
2   2002     2  0.53729311  3.083238
3   2003     3  0.89222231  1.106300
4   2004     4  0.20600240  2.723593
5   2005     5  0.24541548  2.4919137

Such that the variable NewVar is made based on the combination of the values in years and sites which indicate the appropriate value to draw from df2.

What I want to do, is mutate a column in df1 that, for each row, uses the combination of the site and year variables to find the appropriate value in df2. As I mentioned before, I know this is possible to do with a for loop but on the computer I'm running on, a similar foreach loop took 6 hours running on 3 cores, so I'm really hoping for a quicker mutate version.

The only other thought for a solution I've had so far is to use a combination of indexing and a ton of ifelse statements, but the real versions of the dataframes I'm working with have 702 unique combinations of site and years.

Any help would be greatly appreciated!

3 Answers 3

3

I think data.table is probably better option here, however just to illustrate the same logic in tidyverse:

library(tidyverse)

df2 %>%                                            # pipe in df2 
  rowid_to_column('sites') %>%                     # assign rownames to 'sites'
  gather(key = years, value = newVar, -sites) %>%  # transworm df2 to long form
  mutate(years = as.numeric(years)) %>%            # convert 'years' into numeric    
  right_join(df1, by = c('years', 'sites')) %>%    # join df1 and df2 
  select(years, sites, var1, newVar)               # rearrange columns

#   years sites       var1   newVar
# 1  2001     1 -0.2324031 3.652280
# 2  2002     2 -1.6015391 4.144123
# 3  2003     3 -1.9813792 3.514144
# 4  2004     4 -0.6039213 2.334821
# 5  2005     5  0.3302109 3.416026
Sign up to request clarification or add additional context in comments.

Comments

2

You could try to solve this using the data.table package. It is a very fast package for larger amounts of data.

The idea is to melt df2 to long format, so each year X site value is stored in a separat line in the dataframe. After doing so df1 and df2_long can be joined together by the key-elements years and sites.

edit: you can leaf out all the data.table:: if you load the data.table package. I just used them to indicate the data.table functions.

set.seed(123)

df1 <- data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n = 5)))
    
df2 <- data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                        `2002` = rnorm(n = 5, mean = 2, sd = 1),
                        `2003` = rnorm(n = 5, mean = 2, sd = 1),
                        `2004` = rnorm(n = 5, mean = 2, sd = 1),
                        `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) = c(2001:2005); rownames(df2) = c(1:5)
    
# helpercolum to melt the data
df2$site = rownames(df2)

# melt data and change varnames
df2_long = data.table::melt(df2, id.vars = "site")
names(df2_long) = c("sites", "years", "NewVar")

# set df1 as data.table
data.table::setDT(df1)

# set df2 as data.table and convert the factors to numerics, as @Gregor suggested 
# in his post (this way you dont have to deal with common factor-struggles)
data.table::setDT(df2_long)
df2_long$sites = as.numeric(as.character(df2_long$sites))
df2_long$years = as.numeric(as.character(df2_long$years))

# set key-columns on which the join should be made
data.table::setkey(df1, years, sites)
data.table::setkey(df2_long, years, sites)

# leftjoin the data
df2_long[df1]

Thanks for your input @Gregor on rather changing factors to numeric then other way round.

Comments

0

An one-liner without reshaping and using the efficient row/column indexing from base R would be

df1$newvar <- df2[cbind(df1$sites, match(df1$years,names(df2)))]  
df1
#  years sites        var1     newvar
#1  2001     1 -0.56047565 3.71506499
#2  2002     2 -0.23017749 2.35981383
#3  2003     3  1.55870831 0.03338284
#4  2004     4  0.07050839 1.27110877
#5  2005     5  0.12928774 3.25381492

data

set.seed(123)
df1 <- data.frame(cbind(years = c(2001:2005), sites = c(1:5), var1 = rnorm(n=5)))

df2 <- data.frame(cbind(`2001` = rnorm(n = 5, mean = 2, sd = 1),
                       `2002` = rnorm(n = 5, mean = 2, sd = 1),
                       `2003` = rnorm(n = 5, mean = 2, sd = 1),
                       `2004` = rnorm(n = 5, mean = 2, sd = 1),
                       `2005` = rnorm(n = 5, mean = 2, sd = 1)))
colnames(df2) <- 2001:2005
rownames(df2) <- 1:5

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.