Looping through two dataframes and adding columns inside of the loop

Question

I have a problem when specifying a loop with a data frame. The general idea I have is the following: I have an area which contains a certain number of raster quadrants. These raster quadrants have been visited irregularily over several years (e.g. from 1950 -2015).

I have two data frames: 1) a data frame containing the IDs of the rasterquadrants (and one column for the year of first visit of this quadrant):

df1<- as.data.frame(cbind(c("12345","12346","12347","12348"),rep(NA,4)))
df1[,1]<- as.character(df1[,1])
df1[,2]<- as.numeric(df1[,2])
names(df1)<-c("Raster_Q","First_visit")

2) a data frame that contains the infos on the visits; this one is ordered with by 1st rasterquadrants and then 2nd years. This dataframe has the info when the rasterquadrant was visited and when.

df2<- as.data.frame(cbind(c(rep("12345",5),rep("12346",7),rep("12347",3),rep(12348,9)),
c(1950,1952,1955,1967,1951,1968,1970,
1998,2001,2014,2015,2017,1965,1986,2000,1952,1955,1957,1965,2003,2014,2015,2016,2017)))
df2[,1]<- as.character(df2[,1])
df2[,2]<- as.numeric(as.character(df2[,2]))
names(df2)<-c("Raster_Q","Year")

I want to know when and how often the full area was 'sampled'.

Scheme of what I want to do; different colors indicate different areas/regions

My rationale: I sorted the complete data in df2 according to Quadrant and Year. I then match the rasterquadrant in df1 with the name of the rasterquadrant in df2 and the first value of year from df2 is added.

For this I wrote a loop (see below)

In order not to replicate a quadrant I created a vector "visited"

visited<-c()

Every entry of df2 that matches df1 will be written into this vector, so that the second entry of e.g. rasterquadrant "12345" in df2 is ignored in the loop.

Here comes the loop:

visited<- c()
for (i in 1:nrow(df2)){
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
  if(df1$"Raster_Q"[index] %in% visited){next()} else{
    df1$"First_visit"[index]<- df2$"Year"[i]
    visited[index]<- df1$"Raster_Q"[index]
   }
 }
}

This gives me the first full sampling period.

    Raster_Q First_visit
1    12345        1950
2    12346        1968
3    12347        1965
4    12348        1952

However, I want to have all full sampling periods.

So I do:

df1$"Second_visit"<-NA

I reset the visited vector and specify the following loop:

visited <- c()
 for (i in 1:nrow(df2)){
  if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
   index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
   if(length(index)==0) {next()} else{
   if(df1$"Raster_Q"[index] %in% visited){next()} else{
   df1$"Second_visit"[index]<- df2$"Year"[i]
   visited[index]<- df1$"Raster_Q"[index]
      }
    }
  }
}

Which is basically the same loop as before, however, only making sure that, if df2$"Year" in a certain raster quadrant has already been included in the first visit, then it is skipped.

That gives me the second full sampling period:

      Raster_Q First_visit Second_visit
 1    12345        1950           NA
 2    12346        1968           1970
 3    12347        1965           1986
 4    12348        1952           2003

Okay, so far so good. I could do that all by hand. But I have loads and loads of rasterquadrants and several areas that can and should be screened in this way. So doing all of this in a single loop for this would be really great! However, I realized that this will create a problem because the loop then gets recursive:

The added column will not be included in the subsequent iteration of the loop, because the df1 itself is not re-read for each loop, and in consequence, the new coulmn for the new sampling period will not be included in the following iterations:

visited<- c()
for (i in 1:nrow(df2)){
m<-ncol(df1)
index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
if(length(index)==0) {next()} else{
  if(df1$"Raster_Q"[index] %in% visited){next()} else{
    df1[index,m]<- df2$"Year"[i]
    visited[index]<- df1$"Raster_Q"[index]
     #finish "first_visit"

    df1[,m+1]<-NA
     # add column for "second visit"

    if(df2$Year[i]<=max(df1$"First_visit")){next()} else{
     # make sure that the first visit year are not included
    index<- which(df1$"Raster_Q"==df2$"Raster_Q"[i])
    if(length(index)==0) {next()} else{
    if(df1$"Raster_Q"[index] %in% visited){next()} else{
    df1[index,m+1]<- df2$"Year"[i]
    visited[index]<- df1$"Raster_Q"[index]
   }
 }
}

This won't work. Another issue is that the vector visited() is not emptied during this loop, so that basically every Raster_Q has already been visited in the second sampling period.

I am stuck.... any ideas?

tbradley · Accepted Answer · 2017-10-12 13:51:08Z

3

You can do this without a for loop by using the dplyr and tidyr packages. First, you take your df2 and use dplyr::arrange to order by raster and year. Then you can rank the years visited using the rank function inside of the dplyr::mutate function. Then using tidyr::spread you can put them all in their own columns. Here is the code:

df <- df2 %>%
  arrange(Raster_Q, Year) %>%
  group_by(Raster_Q) %>%
  mutate(visit = rank(Year),
         visit = paste0("visit_", as.character(visit))) %>%
  tidyr::spread(key = visit, value = Year)

Here is the output:

> df
# A tibble: 4 x 10
# Groups:   Raster_Q [4]
  Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7 visit_8 visit_9
*    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    12345    1950    1951    1952    1955    1967      NA      NA      NA      NA
2    12346    1968    1970    1998    2001    2014    2015    2017      NA      NA
3    12347    1965    1986    2000      NA      NA      NA      NA      NA      NA
4    12348    1952    1955    1957    1965    2003    2014    2015    2016    2017

EDIT: So I think I understand your problem a little better now. You are looking to remove all duplicate visits to each quadrant that happened before the maximum Year of each respective "round" of visits. So to accomplish this, I wrote a short function that in essence does what the code above does, but with a slight change. Here is the function:

filter_by_round <- function(data, round) {
  output <- data %>%
    arrange(Raster_Q, Year) %>%
    group_by(Raster_Q) %>%
    mutate(visit = rank(Year, ties.method = "first")) %>%
    ungroup() %>%
    mutate(in_round = ifelse(Year <= max(.$Year[.$visit == round]) & visit > round, 
                             TRUE, FALSE)) %>%
    filter(!in_round) %>%
    select(-c(in_round, visit))

  return(output)
}

What this function does, is look through the data and if a given year is less than the max year for the specified "visit round" then it is removed. To apply this only to the first round, you would do this:

df2 %>%
  filter_by_round(1) %>%
  group_by(Raster_Q) %>%
  mutate(visit = rank(Year, ties.method = "first")) %>%
  ungroup() %>% 
  mutate(visit = paste0("visit_", as.character(visit))) %>%
  tidyr::spread(key = visit, value = Year)

which would give you this:

# A tibble: 4 x 8
  Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5 visit_6 visit_7
*    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    12345    1950      NA      NA      NA      NA      NA      NA
2    12346    1968    1970    1998    2001    2014    2015    2017
3    12347    1965    1986    2000      NA      NA      NA      NA
4    12348    1952    2003    2014    2015    2016    2017      NA

However, while it does accomplish what your for loop would have, you now have other occurrences of the same problem. I have come up with a way to do this successfully but it requires you to know how many "visit rounds" you had or some trial and error. To accomplish this, you can use map and assign the change to a global variable.

# I do this so we do not lose the original dataset
df <- df2

# I chose 1:5 after some trial and error showed there are 5 unique 
# "visit rounds" in your toy dataset
# However, if you overshoot your number, it should still work, 
# you will just get warnings about `max` not working correctly
# however, this may casue issues, so figuring out your exact number is 
# recommended
purrr::map(1:5, function(x){
  # this assigns the output of each iteration to the global variable df
  df <<- df %>%
    filter_by_round(x)
})

# now applying the original transformation to get the spread dataset
df %>%
  group_by(Raster_Q) %>%
  mutate(visit = rank(Year, ties.method = "first")) %>%
  ungroup() %>% 
  mutate(visit = paste0("visit_", as.character(visit))) %>%
  tidyr::spread(key = visit, value = Year)

This will give you the following output:

# A tibble: 4 x 6
  Raster_Q visit_1 visit_2 visit_3 visit_4 visit_5
*    <chr>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1    12345    1950      NA      NA      NA      NA
2    12346    1968    1970    2014    2015    2017
3    12347    1965    1986      NA      NA      NA
4    12348    1952    2003    2014    2015    2016

granted, this is probably not the most elegant solution, but it works. Hopefully this solves the problem for you

edited Oct 12, 2017 at 13:51

answered Oct 4, 2017 at 16:41

tbradley

2,31015 silver badges22 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

DavidEich Over a year ago

Wow! That was fast! And very clear! I am impressed. Thank you!

tbradley Over a year ago

no problem. The dplyr package offers a lot of data munging tools that help you avoid for loops for the most part, as they can be slow in R. It is definitely worth looking into, if you like this solution

DavidEich Over a year ago

Thanks, Bradley. I just got back to the problem and found that your code very efficient. However, it does not take one condition into account: a full sampling period implies taht the second visit of a Raster Quadrant must come in a year after the first full sampling period has finished. This condition is stated in the seventh block of code, where it sets the condition: if(df2$Year[i]<=max(df1$"First_visit")){next()} So, every entry for Raster Quadrant 1 that is smaller than the last year to complete the full sampling period is "within 1st period". Any ideas here?

tbradley Over a year ago

So let me see if i understand correctly, you have some quadrants that have two visits in the first year and you want the second visit excluded from the output? If so, does this only occur in the visit_1 case? Or do you have quadrants with multiple visits per year for mamy years and you want all duplicates removed?

DavidEich Over a year ago

Basically, yes. The point is that single quadrants are visited several times, but that does not mean that every quadrant in the whole area has been visited in this interval. For the first period, every quadrant that was revisited within this period is a "pseudoreplicate" of this period. Thus, I need to know when the first sampling interval is done, so that I can get rid of the pseudoreplicates and only take into account the visites for the second, third, fourth.. period. I hope that makes the the (admittedly complex) situation more clear.

|

Collectives™ on Stack Overflow

Looping through two dataframes and adding columns inside of the loop

1 Answer 1

7 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

7 Comments

Your Answer

Sign up or log in

Post as a guest

Related