2

I have a large list of dataframes that I would like to subset and reorder, so that they can be horizontally joined. Ideally, I would merge them, however this consistently exceeds my computer's capabiltiies, so I need to join them manually.

My large list is a list of 10 dataframes, all with 10 unique variables and one shared variable, with over 1000 rows for each dataframe:

df1:

ID, V2, V3, etc.

df2:

ID, V2_2, V3_2, etc.

and so forth.

All of these dataframes are contained within the same large list. How can I subset all of these dataframes within that large list simultaneously, such that they only include observations where ID > 100? Similarly, how can I order all of these dataframes simultaneously by the ID column, so that I can then bind them horizontally?

This seems to be the only way to have the dataframes merge, because merging the dataframes by ID directly does not appear to work, since the size of the dataframes (even though IDs are all the same for each dataframe, and each ID is unique) causes my R to freeze.

3
  • 2
    10 x 10 x 1000's shouldn't make R crashing. Can you add the code you tried to use and an example of your data? Commented Aug 1, 2022 at 17:49
  • @harre unfortunately, I cannot provide the data, but I have tried various types of merging codes, including merge <- Reduce(function(x, y) merge(x, y, by = "ID"), list) or manually doing it in sequence like: merge <- merge(list[1],list[2], by='ID') %>% merge(., list[3], by='ID') %>% etc. Commented Aug 1, 2022 at 17:53
  • To get good answers then, you might consider: stackoverflow.com/questions/5963269/… Commented Aug 1, 2022 at 18:02

2 Answers 2

4

Try the following. Untested, since there are no test data.

1. Filter and sort

Filter ID > 100 and sort by ID.

df_list <- lapply(df_list, \(x) {
  x <- x[x$ID > 100, ]
  x[order(x$ID), ]
})

2. Merge the data.frames

This is the complicated part. The code below creates a result data.frame df_all and then puts each data.frame in df_list in its place by matching the id's. This is probably faster and no merge and Reduce function calls are needed.

all_cols <- unique(sapply(df_list, names))
all_rows <- unique(sapply(df_list, `[[`, 'ID'))
all_rows <- sort(all_rows)
df_all <- matrix(nrow = length(all_rows), ncol = length(all_cols))
df_all <- as.data.frame(df_all)
names(df_all) <- all_cols
df_all$ID <- all_rows

for(i in seq_along((df_list))) {
  i <- match(df_list[[i]]$ID, all_rows)
  df_all[i, names(df_list[[i]])] <- df_list[[i]]
}
Sign up to request clarification or add additional context in comments.

3 Comments

Thank you, this first solution works. The second part makes sense, but I receive errors. On the first line, I receive the following Error in base::union(x, y, ...) : argument "y" is missing, with no default Any idea of a change to be made? If it could work as well, that would be ideal to avoid the merge/reduce issues
@flâneur Ok, It should be unique, in my tests I had more than one vector. Edited, see now if it works.
Thank you for this solution! I adapted it for a similar problem I had. I just have one question so I understand the syntax, why is "\" necessary before (x) in the line lapply(df_list, (x)?
1

This should return the desired list:

lapply(df_list, \(df) df[df$ID > 100, ])

EDIT

Update after comment that you have a list of lists. Try this:

lapply(df_list, \(df) {
    df  <- data.frame(unlist(df, recursive=FALSE))
    df[df$ID > 100, ]
    }
)

5 Comments

Thank you, I see the rational behind this, however all this returns in the console is the following ``` [[1]] [[2]] [[3]] [[4]] [[5]] [[6]] [[7]] [[8]] [[9]] [[10]] ``` with seemingly no change to any of the dataframes. Any reason why this is happening?
So it is returning 10 empty dataframes? Hard to say why without a reproducible example of your data using dput(). Is your ID column a numeric vector? If you have an ID column and some values are above 100 this should work.
Apologies for not being able to provide the example now. It appears to return the exact same dataframe as before, and just reads those numbers in the console. I am wondering if it is an issue with the lapply function I used to create the large list of dataframes. The large list appears to actually be a large list of large lists, which I guess makes it difficult to operate with. Is there a way to prevent the double large listing?
Oh it's a list of lists? I thought it was a list of data frames. How many data frames does each sublist contain?
Each sublist is one dataframe. It seems I created lists or "tibbles" when in reality each should be a dataframe. Is there a way to convert each sublist into a dataframe within the large list? Then your solution would likely work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.