Subsetting dataframes stored in a list

Question

I'm having difficulty figuring out how to subset some specific data from dataframes stored in a list. I've read numerous articles on this site as well as UCLA and Adv-R and I'm just not making any progress.

Advanced-R for Subsetting UCLA Advanced R for Subsetting

My function reads in arguments that help it identify what data I'm interested in pulling out across a range of files. So, dat1, dat2 and dat3 in files 1:15 stored in a directory of files (1:999).

Using an lapply and read.CSV I have read all of my files (1:15) into a list of dataframes.

 x <- lapply(directory[id], function(i) {
        read.csv(i, header = TRUE)
         } )

An example looks like this via str(x) [of just the first element]:

List of 15
 $ :'data.frame':   1461 obs. of  4 variables:
  ..$ DateObv   : Factor w/ 1461 levels "2003-01-01","2003-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ dat1: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ dat2: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ ID     : int [1:1461] 1 1 1 1 1 1 1 1 1 1 ...

So in the argument to my function I want to tell it give me dat1 from files 1:15 and then I'll do a mean of the results.

I thought maybe I could use another lapply to subset dat1 specifically into a vector but it keeps returning a NULL value, or "list()" or just errors that set object cannot be subset, or subset missing argument. I've tried subset, bracket notation.

How do you recommend that I take a subset of the list of dataframes so that I get back all dat1's or dat2's into a single vector that I can run a mean against?

Thank you for your time and consideration.

I guess you can make use of something like lapply(x,[[,'dat1'), which will return a list of vectors corresponding to the 'dat1' columns from each data frame — Marat Talipov
– Marat Talipov, Commented Feb 11, 2015 at 21:35
What exactly is the code you were trying that gave you the error? i would think unlist(lapply(x, "[[", "dat1")) might work. An actual reproducible example would be more useful here than just a description of the structure. — MrFlick
– MrFlick, Commented Feb 11, 2015 at 21:35
Hello @mrflick here's a sample of observation 1. Date dat1 dat2 ID 10/11/2003 NA NA 1 10/12/2003 5.99 0.428 1 10/13/2003 NA NA 1 10/14/2003 NA NA 1 10/15/2003 NA NA 1 10/16/2003 NA NA 1 10/17/2003 NA NA 1 10/18/2003 4.68 1.04 1 10/19/2003 NA NA 1 10/20/2003 NA NA 1 10/21/2003 NA NA 1 10/22/2003 NA NA 1 10/23/2003 NA NA 1 10/24/2003 3.47 0.363 1 10/25/2003 NA NA 1 10/26/2003 NA NA 1 10/27/2003 NA NA 1 10/28/2003 NA NA 1 10/29/2003 NA NA 1 10/30/2003 2.42 0.507 1 — Zach
– Zach, Commented Feb 11, 2015 at 22:00
@MrFlick I tried your method and it just returns "NULL", which leads me to believe that my dat1 argument isn't actually being used. I know that if I do a simple print(dat1) that it will give me my argument within the function. y <- unlist(lapply(x, "[[", "dat1")) — Zach
– Zach, Commented Feb 11, 2015 at 22:20
What you posted above in the comments is not a reproducible example. Try dput()-ing a sample object or build a sample list in your original question. Read the link i provided for other examples. There must be something different going on that what you described. — MrFlick
– MrFlick, Commented Feb 11, 2015 at 22:23

shirewoman2 · Accepted Answer · 2015-02-11 21:41:24Z

1

I love plyr for this sort of thing. I would do something like this if you want the mean for each data.frame:

 library(plyr)
 ldply(x, summarize, Mean = mean(dat1))

or, if you want a long vector of all the dat1 columns and you want to take the mean of all of them, I'd still use plyr but do this:

 x <- rbind.fill(x)
 mean(x$dat1)

answered Feb 11, 2015 at 21:41

shirewoman2

1,9604 gold badges21 silver badges37 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Edzer Pebesma · Accepted Answer · 2015-02-11 21:49:34Z

0

create a similar data set:

> x = list(data.frame(dat1 = 1:3,dat2=10), data.frame(dat1 = 2:4,dat2=10))
> str(x)
List of 2
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 1 2 3
  ..$ dat2: num [1:3] 10 10 10
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 2 3 4
  ..$ dat2: num [1:3] 10 10 10

use lapply to select variable dat1:

> lapply(x, function(X) X$dat1)
[[1]]
[1] 1 2 3

[[2]]
[1] 2 3 4

bind the resulting list to a vector with c, call mean on the resulting vector, and add na.rm=TRUE to remove the NA values:

> mean(do.call(c, lapply(x, function(X) X$dat1)),na.rm=TRUE)
[1] 2.5

answered Feb 11, 2015 at 21:49

Edzer Pebesma

4,14918 silver badges27 bronze badges

3 Comments

Zach Over a year ago

Hi Edzer thank you for the feedback. When I do it your way I get the following error: Warning message: In mean.default(do.call(c, lapply(x, function(X) X$dat1)), : argument is not numeric or logical: returning NA dat1 is an argument passed into the function that makes sure the function only selects dat1 for vectorization and mean calculation.

Zach Over a year ago

There has to be something going on with my function in general because if I try the sample list as provided I can subset it with no issues. The mean function still doesn't work but at least the subsetting does. So that gives me something to explore more.

Edzer Pebesma Over a year ago

This warning indicates that some dat1 vectors are not numeric, but for instance factor, unlike the one you show above. I'd suggest to build in a check that catches this.

Collectives™ on Stack Overflow

Subsetting dataframes stored in a list

2 Answers 2

Comments

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related