R: ifelse statement test involving multiple dataframes

Question

I am trying to create a new variable using ifelse by combining data from two data.frames (similar to this question but without factors).

My problem is that df1 features yearly data, whereas vars in df2 are temporally aggregated: e.g. df1 has multiple obs (1997,1998,...,2005) and df2 only has a range (1900-2001).

For illustration, a 2x2 example would look like

df1$id <- c("2","20")
df1$year <- c("1960","1870")

df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")

I want to combine both in such a way that the id (same variable exists in both) is matched, and further, the year in df1 is within the range of df2. I tried the following

df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear & 
df1$year<df2$endyear,1,0)

Which ideally should return 1 and 0, respectively.

But instead I get warning messages:

1: In df1$id == df2$id : longer object length is not a multiple of shorter object length

2: In df1$year >= df2$styear : longer object length is not a multiple of shorter object length

3: In df1$year < df2$endyear : longer object length is not a multiple of shorter object length

For the record, the 'real' df1 has 500 obs and df2 has 14. How can I make this work?

Edit: I realised some obs in df2 are repeated, with multiple periods e.g.

id    styear    endyear
1      1800      1915
1      1950      2002
2      1912      1988
3      1817      2000

So, I believe what I need is something like a double-ifelse:

df1$new.var <- ifelse(df1$id==df2$id & df1$year>=df2$styear & 
df1$year<df2$endyear | df1$year>=df2$styear & 
df1$year<df2$endyear,1,0)

Obviously, this wouldn't work, but it is a way to get out of the duplicates-problem.

For example, if id=1 in df1$year=1801, it will pass the first year-range test (1801 is between 1800-1915), but fail the second one (1801 is not between 1950-2002), so it is only coded once and no extra rows are added (currently the duplicates add extra rows).

see: rdocumentation.org/packages/data.table/versions/1.9.6/topics/… — Bulat
– Bulat, Commented Oct 5, 2016 at 21:07
@Bulat Hello, foverlaps was recommended by others too, I can't seem to get it to work - says "Duplicate columns are not allowed in overlap joins. This may change in the future." — user6550364
– user6550364, Commented Oct 6, 2016 at 11:03

Stephen · Accepted Answer · 2016-10-05 22:45:26Z

1

df1$id <- c("2","20")
df1$year <- c("1960","1870")

df2$id <- df1$id
df2$styear <- c("1800","1900")
df2$endyear <- c("2001","1950")

library(dplyr)
df3 <- left_join(df1,df2,by = "id") %>% filter(year <= endyear,year >= startyear)

I highly recommend the dplyr package for data manipulation.

edited Oct 5, 2016 at 22:45

answered Oct 5, 2016 at 20:43

Stephen

3242 silver badges9 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

user6550364 Over a year ago

dplyr is great indeed, wish I was better at it. Re: your answer, this does work in the sense that the identification is executed correctly when it is "1", but the resulting dataset is only a subset of the original - the "0" cases are listwise deleted. Is there a way to keep df1 intact?

user6550364 Over a year ago

Strangely enough, going from inner_join to left_join did not have an effect - still losing all the obs with 0.

Stephen Over a year ago

try flipping around the order of the data frames in the argument

user6550364 Over a year ago

Changing the order did not have any effect either, still losing around half. However, it led me to reading about dplyr more and now I found a solution of sorts.

Sandipan Dey · Accepted Answer · 2016-10-05 21:10:11Z

0

With base R:

df1 <- data.frame(id=c(2,20,22), year=c(1960,1870, 2016))
df2 <- data.frame(id=c(2,20,21), styear=c(1800,1900,2000), endyear=c(2001,1950,2016))

df1
id year
1  2 1960
2 20 1870
3 22 2016

df2

id styear endyear
1  2   1800    2001
2 20   1900    1950
3 21   2000    2016

df1 <- merge(df1, df2, by='id', all.x = TRUE)
df1$new.var <- !is.na(df1$styear) & df1$year>=df1$styear & df1$year< df1$endyear
df1 <- df1[c('id', 'year', 'new.var')]

df1
  id year new.var
1  2 1960    TRUE
2 20 1870   FALSE
3 22 2016   FALSE

answered Oct 5, 2016 at 21:10

Sandipan Dey

23.4k4 gold badges59 silver badges72 bronze badges

3 Comments

user6550364 Over a year ago

This is more in line with what I have in mind. One thing though - I have repeating id-year combinations in df1 - the code does well when there is no duplicates, but results in a rotating TRUE/FALSE in some cases. Thoughts?

Sandipan Dey Over a year ago

Can we have the unique id-year pairs from the df1 selected first, then create the new-var. Once creaated, the duplicate id-years from df1 that were left out, will exactly have same new.var value, so if you have count of the duplicate id-year pairs, it will just be a matter of replicating / adding them (e.g., with rbind).

user6550364 Over a year ago

I created id-years, but still can't get around the problem of matching. Have you seen the edit at the end of my question? I think that is the cause.

user6550364 · Accepted Answer · 2016-10-06 13:56:48Z

Alright, I made it work for myself. Beware, it is quite convoluted and probably contain some redundancies. After a brief look at the data wrangling cheatsheet, assuming you have df1 and df2 with an identical var and df2 contains new.var, one can do the following:

library(dplyr)
#Join everything, all values and rows
df3 <- full_join(df1,df2,by="id")
#filter out obs those year is greater than endyear
df3 <- filter(df3,df3$year<=df3$endyear)
#same, the other way around
df3 <- filter(df3,df3$year>=df3$styear) 
df3 <- distinct(df3) #remove duplicate rows (at least I had some)

As far as I can tell by looking at the end result, this method only extracts information from the correct time period while dropping all other time periods in df2. Then, it is a matter of merging with the original data.frame (df1) and filling in the NAs:

df1 <- merge(df1,df3,by=(id),all.x=TRUE)
df1 <- distinct(df1) #just to make sure, I still had three
df1$new.var <- ifelse(is.na(df1$new.var),0,df1$new.var)

which is what I wanted.

eddi · Accepted Answer · 2016-10-06 16:55:19Z

0

This can be solved easily and efficiently using non-equi joins in data.table devel version (1.9.7+):

library(data.table)
setDT(df1); setDT(df2) # converting to data.table in place

df1[, new.var := df2[df1, on = .(id, styear <= year, endyear >= year),
                     .N > 0, by = .EACHI]$V1]
df1
#   id year new.var
#1:  2 1960    TRUE
#2: 20 1870   FALSE

The above join looks for matches in df2 for each row of df1 (by = .EACHI), and checks the number of matching rows (.N).

answered Oct 6, 2016 at 16:55

eddi

49.5k6 gold badges109 silver badges157 bronze badges

4 Comments

user6550364 Over a year ago

This sounds brilliant, however currently does not give me the same results you got - I tried it both on my actual data and on @sandipan's mwe, both give the same warning: Error in [.data.table(df2, df1, on = list(id, styear <= year, endyear >= : object 'styear' not found

eddi Over a year ago

@rfsrc start over in a new R session (and make sure you have installed the devel version).

user6550364 Over a year ago

Okay, now a bit better after removing the package and installing the devel. So, now it definitely works on the mwe, but it still gives an error with the actual data: Error in eval(expr, envir, enclos) : object 'styear' not found. I'm quite sure there's a styear in my df2, tried changing class but still no difference.

eddi Over a year ago

@rfsrc I'm not sure how to help you unfortunately without a reproducible example

Collectives™ on Stack Overflow

R: ifelse statement test involving multiple dataframes

4 Answers 4

4 Comments

3 Comments

Comments

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

3 Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related