3

I am trying to create a new variable using ifelse by combining data from two data.frames (similar to this question but without factors).

My problem is that df1 features yearly data, whereas vars in df2 are temporally aggregated: e.g. df1 has multiple obs (1997,1998,...,2005) and df2 only has a range (1900-2001).

For illustration, a 2x2 example would look like

df1$id <- c("2","20")
df1$year <- c("1960","1870")

df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")

I want to combine both in such a way that the id (same variable exists in both) is matched, and further, the year in df1 is within the range of df2. I tried the following

df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear & 
df1$year<df2$endyear,1,0)

Which ideally should return 1 and 0, respectively.

But instead I get warning messages:

1: In df1$id == df2$id : longer object length is not a multiple of shorter object length

2: In df1$year >= df2$styear : longer object length is not a multiple of shorter object length

3: In df1$year < df2$endyear : longer object length is not a multiple of shorter object length

For the record, the 'real' df1 has 500 obs and df2 has 14. How can I make this work?

Edit: I realised some obs in df2 are repeated, with multiple periods e.g.

id    styear    endyear
1      1800      1915
1      1950      2002
2      1912      1988
3      1817      2000

So, I believe what I need is something like a double-ifelse:

df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear & 
df1$year<df2$endyear | df1$year>=df2$styear & 
df1$year<df2$endyear,1,0)

Obviously, this wouldn't work, but it is a way to get out of the duplicates-problem.

For example, if id=1 in df1$year=1801, it will pass the first year-range test (1801 is between 1800-1915), but fail the second one (1801 is not between 1950-2002), so it is only coded once and no extra rows are added (currently the duplicates add extra rows).

3
  • see: rdocumentation.org/packages/data.table/versions/1.9.6/topics/… Commented Oct 5, 2016 at 21:07
  • @Bulat Hello, foverlaps was recommended by others too, I can't seem to get it to work - says "Duplicate columns are not allowed in overlap joins. This may change in the future." Commented Oct 6, 2016 at 11:03
  • can you provide a reproducible example please. Commented Oct 6, 2016 at 15:06

4 Answers 4

1
df1$id <- c("2","20")
df1$year <- c("1960","1870")

df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")

library(dplyr)
df3 <- left_join(df1,df2,by = "id") %>% filter(year <= endyear,year >= startyear)

I highly recommend the dplyr package for data manipulation.

Sign up to request clarification or add additional context in comments.

4 Comments

dplyr is great indeed, wish I was better at it. Re: your answer, this does work in the sense that the identification is executed correctly when it is "1", but the resulting dataset is only a subset of the original - the "0" cases are listwise deleted. Is there a way to keep df1 intact?
Strangely enough, going from inner_join to left_join did not have an effect - still losing all the obs with 0.
try flipping around the order of the data frames in the argument
Changing the order did not have any effect either, still losing around half. However, it led me to reading about dplyr more and now I found a solution of sorts.
0

With base R:

df1 <- data.frame(id=c(2,20,22), year=c(1960,1870, 2016))
df2 <- data.frame(id=c(2,20,21), styear=c(1800,1900,2000), endyear=c(2001,1950,2016))

df1
id year
1  2 1960
2 20 1870
3 22 2016

df2

id styear endyear
1  2   1800    2001
2 20   1900    1950
3 21   2000    2016

df1 <- merge(df1, df2, by='id', all.x = TRUE)
df1$new.var <- !is.na(df1$styear) & df1$year>=df1$styear & df1$year< df1$endyear
df1 <- df1[c('id', 'year', 'new.var')]

df1
  id year new.var
1  2 1960    TRUE
2 20 1870   FALSE
3 22 2016   FALSE

3 Comments

This is more in line with what I have in mind. One thing though - I have repeating id-year combinations in df1 - the code does well when there is no duplicates, but results in a rotating TRUE/FALSE in some cases. Thoughts?
Can we have the unique id-year pairs from the df1 selected first, then create the new-var. Once creaated, the duplicate id-years from df1 that were left out, will exactly have same new.var value, so if you have count of the duplicate id-year pairs, it will just be a matter of replicating / adding them (e.g., with rbind).
I created id-years, but still can't get around the problem of matching. Have you seen the edit at the end of my question? I think that is the cause.
0

Alright, I made it work for myself. Beware, it is quite convoluted and probably contain some redundancies. After a brief look at the data wrangling cheatsheet, assuming you have df1 and df2 with an identical var and df2 contains new.var, one can do the following:

library(dplyr)
#Join everything, all values and rows
df3 <- full_join(df1,df2,by="id")
#filter out obs those year is greater than endyear
df3 <- filter(df3,df3$year<=df3$endyear)
#same, the other way around
df3 <- filter(df3,df3$year>=df3$styear) 
df3 <- distinct(df3) #remove duplicate rows (at least I had some)

As far as I can tell by looking at the end result, this method only extracts information from the correct time period while dropping all other time periods in df2. Then, it is a matter of merging with the original data.frame (df1) and filling in the NAs:

df1 <- merge(df1,df3,by=(id),all.x=TRUE)
df1 <- distinct(df1) #just to make sure, I still had three
df1$new.var <- ifelse(is.na(df1$new.var),0,df1$new.var)

which is what I wanted.

Comments

0

This can be solved easily and efficiently using non-equi joins in data.table devel version (1.9.7+):

library(data.table)
setDT(df1); setDT(df2) # converting to data.table in place

df1[, new.var := df2[df1, on = .(id, styear <= year, endyear >= year),
                     .N > 0, by = .EACHI]$V1]
df1
#   id year new.var
#1:  2 1960    TRUE
#2: 20 1870   FALSE

The above join looks for matches in df2 for each row of df1 (by = .EACHI), and checks the number of matching rows (.N).

4 Comments

This sounds brilliant, however currently does not give me the same results you got - I tried it both on my actual data and on @sandipan's mwe, both give the same warning: Error in [.data.table(df2, df1, on = list(id, styear <= year, endyear >= : object 'styear' not found
@rfsrc start over in a new R session (and make sure you have installed the devel version).
Okay, now a bit better after removing the package and installing the devel. So, now it definitely works on the mwe, but it still gives an error with the actual data: Error in eval(expr, envir, enclos) : object 'styear' not found. I'm quite sure there's a styear in my df2, tried changing class but still no difference.
@rfsrc I'm not sure how to help you unfortunately without a reproducible example

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.