How to subset multiple dataframes from a large list R

Question

I have a large list of dataframes that I would like to subset and reorder, so that they can be horizontally joined. Ideally, I would merge them, however this consistently exceeds my computer's capabiltiies, so I need to join them manually.

My large list is a list of 10 dataframes, all with 10 unique variables and one shared variable, with over 1000 rows for each dataframe:

df1:

ID, V2, V3, etc.

df2:

ID, V2_2, V3_2, etc.

and so forth.

All of these dataframes are contained within the same large list. How can I subset all of these dataframes within that large list simultaneously, such that they only include observations where ID > 100? Similarly, how can I order all of these dataframes simultaneously by the ID column, so that I can then bind them horizontally?

This seems to be the only way to have the dataframes merge, because merging the dataframes by ID directly does not appear to work, since the size of the dataframes (even though IDs are all the same for each dataframe, and each ID is unique) causes my R to freeze.

10 x 10 x 1000's shouldn't make R crashing. Can you add the code you tried to use and an example of your data? — harre
– harre, Commented Aug 1, 2022 at 17:49
@harre unfortunately, I cannot provide the data, but I have tried various types of merging codes, including merge <- Reduce(function(x, y) merge(x, y, by = "ID"), list) or manually doing it in sequence like: merge <- merge(list[1],list[2], by='ID') %>% merge(., list[3], by='ID') %>% etc. — flâneur
– flâneur, Commented Aug 1, 2022 at 17:53
To get good answers then, you might consider: stackoverflow.com/questions/5963269/… — harre
– harre, Commented Aug 1, 2022 at 18:02

Rui Barradas · Accepted Answer · 2022-08-01 21:11:57Z

4

Try the following. Untested, since there are no test data.

1. Filter and sort

Filter ID > 100 and sort by ID.

df_list <- lapply(df_list, \(x) {
  x <- x[x$ID > 100, ]
  x[order(x$ID), ]
})

2. Merge the data.frames

This is the complicated part. The code below creates a result data.frame df_all and then puts each data.frame in df_list in its place by matching the id's. This is probably faster and no merge and Reduce function calls are needed.

all_cols <- unique(sapply(df_list, names))
all_rows <- unique(sapply(df_list, `[[`, 'ID'))
all_rows <- sort(all_rows)
df_all <- matrix(nrow = length(all_rows), ncol = length(all_cols))
df_all <- as.data.frame(df_all)
names(df_all) <- all_cols
df_all$ID <- all_rows

for(i in seq_along((df_list))) {
  i <- match(df_list[[i]]$ID, all_rows)
  df_all[i, names(df_list[[i]])] <- df_list[[i]]
}

edited Aug 1, 2022 at 21:11

answered Aug 1, 2022 at 19:16

Rui Barradas

78k8 gold badges41 silver badges75 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

flâneur Over a year ago

Thank you, this first solution works. The second part makes sense, but I receive errors. On the first line, I receive the following Error in base::union(x, y, ...) : argument "y" is missing, with no default Any idea of a change to be made? If it could work as well, that would be ideal to avoid the merge/reduce issues

Rui Barradas Over a year ago

@flâneur Ok, It should be unique, in my tests I had more than one vector. Edited, see now if it works.

Victor Shin Over a year ago

Thank you for this solution! I adapted it for a similar problem I had. I just have one question so I understand the syntax, why is "\" necessary before (x) in the line lapply(df_list, (x)?

SamR · Accepted Answer · 2022-08-01 18:24:50Z

1

This should return the desired list:

lapply(df_list, \(df) df[df$ID > 100, ])

EDIT

Update after comment that you have a list of lists. Try this:

lapply(df_list, \(df) {
    df  <- data.frame(unlist(df, recursive=FALSE))
    df[df$ID > 100, ]
    }
)

edited Aug 1, 2022 at 18:24

answered Aug 1, 2022 at 17:53

SamR

23.1k4 gold badges23 silver badges55 bronze badges

5 Comments

flâneur Over a year ago

Thank you, I see the rational behind this, however all this returns in the console is the following ``` [[1]] [[2]] [[3]] [[4]] [[5]] [[6]] [[7]] [[8]] [[9]] [[10]] ``` with seemingly no change to any of the dataframes. Any reason why this is happening?

SamR Over a year ago

So it is returning 10 empty dataframes? Hard to say why without a reproducible example of your data using dput(). Is your ID column a numeric vector? If you have an ID column and some values are above 100 this should work.

flâneur Over a year ago

Apologies for not being able to provide the example now. It appears to return the exact same dataframe as before, and just reads those numbers in the console. I am wondering if it is an issue with the lapply function I used to create the large list of dataframes. The large list appears to actually be a large list of large lists, which I guess makes it difficult to operate with. Is there a way to prevent the double large listing?

SamR Over a year ago

Oh it's a list of lists? I thought it was a list of data frames. How many data frames does each sublist contain?

flâneur Over a year ago

Each sublist is one dataframe. It seems I created lists or "tibbles" when in reality each should be a dataframe. Is there a way to convert each sublist into a dataframe within the large list? Then your solution would likely work.

Collectives™ on Stack Overflow

How to subset multiple dataframes from a large list R

2 Answers 2

1. Filter and sort

2. Merge the data.frames

3 Comments

5 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1. Filter and sort

2. Merge the data.frames

3 Comments

5 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related