0

Following my previous question, I work with a large number of dataframes in R, each of which has a different number of columns. I want to assilimilate these datasets so that all of them have the same number of columns and NA values for newly-added columns. I have written a loop but I am not sure how to update the real dataframes.

first_df   = data.frame(matrix(rnorm(20), nrow=10))
second_df  = data.frame(matrix(rnorm(20), nrow=4))
third_df   = data.frame(matrix(rnorm(20), nrow=5))

library(tidyverse)

min_max <- mget(ls(pattern = "_df")) %>%
  map_dbl(ncol) %>%
  enframe() %>%
  arrange(value) %>%
  slice(1, n())

min_max

# A tibble: 2 x 2
#  name      value
#  <chr>     <dbl>
#1 first_df      2
#2 second_df     5

diff <- setdiff(names(get(min_max$name[2])), names(get(min_max$name[1])))

for (col_name in diff)
    
#     all dataframes whose names contain "_df"
    for (df_index in 1:length(ls(pattern = "_df")))
    
    {
#     capturing the dataframe
        data = get(ls(pattern = "_df")[df_index]);
        
     if (!(col_name %in% names(data)))
         
    {data[,col_name] <- NA}
#          I don't know how to update the real datasets
#     get(ls(pattern = "_df")[df_index]) <- data
                   
    }

2 Answers 2

1

i looked it up quick and the solution is the assign() function.

So here is your reprex with assign. But I also read about that it would be useful to gather your dataframes into one list and then you could change the name of the listposition I think.

first_df   = data.frame(matrix(rnorm(20), nrow=10))
second_df  = data.frame(matrix(rnorm(20), nrow=4))
third_df   = data.frame(matrix(rnorm(20), nrow=5))

library(tidyverse)

min_max <- mget(ls(pattern = "_df")) %>%
  map_dbl(ncol) %>%
  enframe() %>%
  arrange(value) %>%
  slice(1, n())

min_max

diff <- setdiff(names(get(min_max$name[2])), names(get(min_max$name[1])))

for (col_name in diff) {
  
  #     all dataframes whose names contain "_df"
  for (df_index in 1:length(ls(pattern = "_df"))) {
    
    #     capturing the dataframe
    data = get(ls(pattern = "_df")[df_index]);
    
    if (!(col_name %in% names(data))) {
      data[,col_name] <- NA
    assign(ls(pattern = "_df")[df_index], data)
    }
    #          I don't know how to update the real datasets
    #     get(ls(pattern = "_df")[df_index]) <- data
    
  }
}
Sign up to request clarification or add additional context in comments.

Comments

1

Here's an alternative which gets away with the loop; it uses dplyr::bind_rows() which puts together the data frames with the size of the biggest one, filling up with NAs where needed.

first_df   = data.frame(matrix(rnorm(20), nrow=10))
second_df  = data.frame(matrix(rnorm(20), nrow=4))
third_df   = data.frame(matrix(rnorm(20), nrow=5))

library(tidyverse)

df_names <- ls(pattern = "_df")
df_list <- mget(df_names)

new_df_list <-
  df_list %>%
  bind_rows(.id = "id") %>%       # put together with biggest number of columns
  group_split(id) %>%             # break down to list 
  set_names(df_names) %>%
  map(., ~ dplyr::select(., -id)) # remove the id column 

# save each df back to global environment
list2env(new_df_list, globalenv())

# check
first_df

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.