0

I'm having difficulty figuring out how to subset some specific data from dataframes stored in a list. I've read numerous articles on this site as well as UCLA and Adv-R and I'm just not making any progress.

Advanced-R for Subsetting UCLA Advanced R for Subsetting

My function reads in arguments that help it identify what data I'm interested in pulling out across a range of files. So, dat1, dat2 and dat3 in files 1:15 stored in a directory of files (1:999).

Using an lapply and read.CSV I have read all of my files (1:15) into a list of dataframes.

 x <- lapply(directory[id], function(i) {
        read.csv(i, header = TRUE)
         } )

An example looks like this via str(x) [of just the first element]:

List of 15
 $ :'data.frame':   1461 obs. of  4 variables:
  ..$ DateObv   : Factor w/ 1461 levels "2003-01-01","2003-01-02",..: 1 2 3 4 5 6 7 8 9 10 ...
  ..$ dat1: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ dat2: num [1:1461] NA NA NA NA NA NA NA NA NA NA ...
  ..$ ID     : int [1:1461] 1 1 1 1 1 1 1 1 1 1 ...

So in the argument to my function I want to tell it give me dat1 from files 1:15 and then I'll do a mean of the results.

I thought maybe I could use another lapply to subset dat1 specifically into a vector but it keeps returning a NULL value, or "list()" or just errors that set object cannot be subset, or subset missing argument. I've tried subset, bracket notation.

How do you recommend that I take a subset of the list of dataframes so that I get back all dat1's or dat2's into a single vector that I can run a mean against?

Thank you for your time and consideration.

6
  • I guess you can make use of something like lapply(x,[[,'dat1'), which will return a list of vectors corresponding to the 'dat1' columns from each data frame Commented Feb 11, 2015 at 21:35
  • What exactly is the code you were trying that gave you the error? i would think unlist(lapply(x, "[[", "dat1")) might work. An actual reproducible example would be more useful here than just a description of the structure. Commented Feb 11, 2015 at 21:35
  • Hello @mrflick here's a sample of observation 1. Date dat1 dat2 ID 10/11/2003 NA NA 1 10/12/2003 5.99 0.428 1 10/13/2003 NA NA 1 10/14/2003 NA NA 1 10/15/2003 NA NA 1 10/16/2003 NA NA 1 10/17/2003 NA NA 1 10/18/2003 4.68 1.04 1 10/19/2003 NA NA 1 10/20/2003 NA NA 1 10/21/2003 NA NA 1 10/22/2003 NA NA 1 10/23/2003 NA NA 1 10/24/2003 3.47 0.363 1 10/25/2003 NA NA 1 10/26/2003 NA NA 1 10/27/2003 NA NA 1 10/28/2003 NA NA 1 10/29/2003 NA NA 1 10/30/2003 2.42 0.507 1 Commented Feb 11, 2015 at 22:00
  • @MrFlick I tried your method and it just returns "NULL", which leads me to believe that my dat1 argument isn't actually being used. I know that if I do a simple print(dat1) that it will give me my argument within the function. y <- unlist(lapply(x, "[[", "dat1")) Commented Feb 11, 2015 at 22:20
  • What you posted above in the comments is not a reproducible example. Try dput()-ing a sample object or build a sample list in your original question. Read the link i provided for other examples. There must be something different going on that what you described. Commented Feb 11, 2015 at 22:23

2 Answers 2

1

I love plyr for this sort of thing. I would do something like this if you want the mean for each data.frame:

 library(plyr)
 ldply(x, summarize, Mean = mean(dat1))

or, if you want a long vector of all the dat1 columns and you want to take the mean of all of them, I'd still use plyr but do this:

 x <- rbind.fill(x)
 mean(x$dat1)
Sign up to request clarification or add additional context in comments.

Comments

0

create a similar data set:

> x = list(data.frame(dat1 = 1:3,dat2=10), data.frame(dat1 = 2:4,dat2=10))
> str(x)
List of 2
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 1 2 3
  ..$ dat2: num [1:3] 10 10 10
 $ :'data.frame':   3 obs. of  2 variables:
  ..$ dat1: int [1:3] 2 3 4
  ..$ dat2: num [1:3] 10 10 10

use lapply to select variable dat1:

> lapply(x, function(X) X$dat1)
[[1]]
[1] 1 2 3

[[2]]
[1] 2 3 4

bind the resulting list to a vector with c, call mean on the resulting vector, and add na.rm=TRUE to remove the NA values:

> mean(do.call(c, lapply(x, function(X) X$dat1)),na.rm=TRUE)
[1] 2.5

3 Comments

Hi Edzer thank you for the feedback. When I do it your way I get the following error: Warning message: In mean.default(do.call(c, lapply(x, function(X) X$dat1)), : argument is not numeric or logical: returning NA dat1 is an argument passed into the function that makes sure the function only selects dat1 for vectorization and mean calculation.
There has to be something going on with my function in general because if I try the sample list as provided I can subset it with no issues. The mean function still doesn't work but at least the subsetting does. So that gives me something to explore more.
This warning indicates that some dat1 vectors are not numeric, but for instance factor, unlike the one you show above. I'd suggest to build in a check that catches this.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.