How to tabulate the means for multiple data files

Question

I have multiple data files formatted like so:

Condition    Score  Reqresponse 
   Z          1         b   
   Y          0         a

I want to read in multiple data files, get a mean score for each condition/reqresponse combo then tabulate that mean into a master table. I want each the means for each data file to populate a row in the master table (or list, whatever).

Here's what I've attempted

#loop reads data from source example with only 2 files named 1 and 2
for(i in 1:2)
{
n= paste(i,".txt", sep="")
data <- read.table(toString(n), header = TRUE, sep = "\t")

So far so good right? After this I get lost.

Score <- ave(x = data$Score, data$Condition, data$Reqresponse, FUN = mean)
table(Score)
}

This is all I've come up with. I don't know which cells in the table belong to which Condition x Reqresponse combo, or how to create a new row and then feed them into a master table.

By the way, if this is just a silly way to approach what I'm doing feel free to point that out >)

The toString would be quite unnecessary. paste returns a character value. — IRTFM
– IRTFM, Commented Mar 11, 2013 at 6:19

Jouni Helske · Accepted Answer · 2013-03-11 06:48:53Z

3

This should work, although it could be optimized quite a bit:

all_data<-data.frame() #make empty data.frame (we don't know the size)
for(i in 1:2){ #go through all files    
  #add rows to the data frame
  all_data <- rbind(all_data,read.table(paste(i,".txt", sep=""), 
              header = TRUE, sep = "\t"))
}
#use tapply to compute mean
Score<-tapply(all_data$Score,list(all_data$Condition,all_data$Reqresponse),mean)

EDIT: Better solution in terms of performance could be achieved by not making the master data frame at all (although I'm not sure about the efficiency of xtabs vs tapply):

#read the first file
data <- read.table(paste(1,".txt", sep=""),header = TRUE, sep = "\t"))
#number of 1's, formula is a equal to Score==1~Condition+Reqresponse
score1<-xtabs(xtabs(Score~.,data=data) 
#number of 0's, formula is a equal to Score==0~Condition+Reqresponse
score0<-xtabs(!Score~.,data=data)
for(i in 2:n){ #go through the rest of the files  

  data <- read.table(paste(i,".txt", sep=""),header = TRUE, sep = "\t"))

  #sum the number of combinations in file i.txt to previous values
  score1<-score1+xtabs(xtabs(Score~.,data=data) 
  score0<-score0+xtabs(!Score~.,data=data)  
}
#Compute the means   
Score<-score1/(score0+score1)

edited Mar 11, 2013 at 6:48

answered Mar 11, 2013 at 6:13

Jouni Helske

6,4871 gold badge31 silver badges53 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Paul Hiemstra Over a year ago

+1, although for reading the data you could use apply style loops, these do not suffer from the issues of growing an object sequentially. See my answer.

Jouni Helske Over a year ago

I'm not growing anything sequentially in my second version as I overwrite the previous data? I agree that the first version is quite inefficient, not only because all_data is growing sequentally, but also that we are making one possibly huge data frame (which is circumvented in my second version).

Paul Hiemstra Over a year ago

Yes you are right, then my answer is only in regard to your first solution of reading all files into memory.

Paul Hiemstra · Accepted Answer · 2013-03-11 07:02:12Z

3

The answer of @Hemmo involves growing an object sequentially. If the amount of files is large this can become really slow. A more R style approach is not to use the for loop, but to first create a vector of files, and then loop over them using an apply style loop. I'll use an apply loop from the plyr pacakge as this makes live a little easier:

library(plyr)
file_list = sprintf("%s.txt", 1:2)
all_data = ldply(file_list, read.table, header = TRUE, sep = "\t")

After that you can use another plyr function to process the data:

ddply(all_data, .(Condition, Reqresponse), summarise, mn = mean(Score))

You could also use base R functions:

all_data = do.call("rbind", lapply(file_list, read.table, header = TRUE, sep = "\t"))
# Here I copy the tapply call of @Hemmo
Score<-tapply(all_data$Score,list(all_data$Condition,all_data$Reqresponse),mean)

edited Mar 11, 2013 at 7:02

answered Mar 11, 2013 at 6:56

Paul Hiemstra

61.2k12 gold badges146 silver badges151 bronze badges

2 Comments

luke123 Over a year ago

Hi Paul, is there any way I can do stuff like cut certain rows of data, exclude outliers etc before I calculate my means in this R-ish way? I like this much better but not sure how to do that without my loop. Also I want to do ANOVAs on this data (repeated measures, means of different columns) after. What would be the most effective way to do that? Make a df variable as I go?

Paul Hiemstra Over a year ago

I think it is best to create a new question where you refer to this one, and explain your additional questions.

Collectives™ on Stack Overflow

How to tabulate the means for multiple data files

2 Answers 2

3 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related