Summarizing multiple columns by two variables

Question

This is my first time ever using R so please forgive me if this question is not worded properly. I have a .csv file that I imported into R and I am trying to summarize some data. each row of data if for a given year, study site, and area and each column has the number of species present. There are 4 columns for each species as there were 4 surveys where the species could have been seen.

I am trying to get the sum of each species by year and study site. Columns 5:8 are one species, 9:12 another, 13:16 another and so on. Here is the code that I thought would summarize columns 5:8 by year (YYYY) and study area (SAR):

aggregate(test[,5:8],by = list("SAR","YYYY"), FUN = sum, na.rm = TRUE)

This gives me the error message that "argument must have the same length". Can anyone help me through this initial step?

Here is some of the data:

SAR    YYYY GRID_ID WID     col1 col2 col3 col4
BCPALP  2005    1   1189    NA  NA  0   0
BCPALP  2005    1   1190    0   NA  0   0
BCPALP  2005    1   1191    0   0   NA  NA
BCPALP  2005    1   1192    0   NA  NA  NA
BCPALP  2005    1   1194    NA  NA  1   NA
BCPALP  2005    1   1195    NA  NA  1   NA
BCPALP  2005    1   1196    0   NA  0   NA
BCPALP  2005    1   1198    0   NA  0   NA
BCPALP  2005    1   1199    0   NA  0   0

I'm hoping to get an output that is something like this:

SAR    YYYY    total of columns 1:4
BCPALP 2005    2

This is the code I just tried.

aggregate(cbind("col1", "col2", "col3", "col4")~SAR+YYYY, test, FUN=sum, na.rm=TRUE, na.action=NULL)

It gives me an error message that states "variable lengths differ (found for 'SAR')".

I went back and checked the data and all the variable lengths are the same.

BTW, Can you specify YYYY and SARs position in the dataset — akrun
– akrun, Commented Nov 13, 2015 at 17:57
Sorry, I am having trouble figuring out how to add a portion of my data table to the comment. The SAR and YYYY are the headings for columns 1 and 2 respectively. — HVS
– HVS, Commented Nov 13, 2015 at 18:03

akrun · Accepted Answer · 2015-11-13 19:19:04Z

1

We can use either aggregate or data.table or dplyr. If we use the formula method for aggregate, we need to set the na.action=NULL when there are NA values in different columns. By default, the na.action=na.omit, so if there is a single NA in one of the columns, that row will be removed from the calculation.

aggregate(cbind(col1, col2, col3, col4)~SAR+YYYY, test,
                        FUN=sum, na.rm=TRUE, na.action=NULL)
#   SAR YYYY col1 col2 col3 col4
#1 BCPALP 2005    0    0    2    0

Using dplyr, we group by 'SAR', 'YYYY', and use summarise_each to get the sum of each of the 'col'.

library(dplyr)
test %>%
     group_by(SAR, YYYY) %>%
     summarise_each(funs(sum=sum(., na.rm=TRUE)), 5:ncol(test))
#     SAR  YYYY  col1  col2  col3  col4
#   (chr) (int) (int) (int) (int) (int)
#1 BCPALP  2005     0     0     2     0

Or with data.table. We convert the 'data.frame' to 'data.table' (setDT(test)), grouped by 'SAR', 'YYYY', we loop though the Subset of Data.table (.SD) and get the sum. The columns to be looped are specified in the .SDcols.

library(data.table)
setDT(test)[, lapply(.SD, sum, na.rm=TRUE), by = .(SAR, YYYY),
             .SDcols= 5:ncol(test)]  
#      SAR YYYY col1 col2 col3 col4
#1: BCPALP 2005    0    0    2    0

Update

Suppose after aggregating we need to get the row wise sum for columns 'col1:col4', then 'col5:col8' etc.

 DT <- setDT(test1)[, lapply(.SD, sum, na.rm=TRUE),
              by = .(SAR, YYYY), .SDcols= 5:ncol(test1)]
 DT1 <- melt(DT, id.var=c('SAR', 'YYYY'))[, i1 := as.numeric(gl(.N, 4, .N)),
            .(SAR, YYYY)]
 dcast(DT1, SAR+YYYY~i1, value.var='value', sum)

data

 test <- structure(list(SAR = c("BCPALP", "BCPALP",
"BCPALP", "BCPALP", 
"BCPALP", "BCPALP", "BCPALP", "BCPALP", "BCPALP"), YYYY = c(2005L, 
2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L),
GRID_ID = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), WID = c(1189L, 1190L, 1191L, 
1192L, 1194L, 1195L, 1196L, 1198L, 1199L), col1 = c(NA, 0L, 0L, 
0L, NA, NA, 0L, 0L, 0L), col2 = c(NA, NA, 0L, NA, NA, NA, NA, 
NA, NA), col3 = c(0L, 0L, NA, NA, 1L, 1L, 0L, 0L, 0L), col4 = c(0L, 
0L, NA, NA, NA, NA, NA, NA, 0L)), .Names = c("SAR", "YYYY",
"GRID_ID", 
"WID", "col1", "col2", "col3", "col4"), class = "data.frame", 
 row.names = c(NA, -9L))

set.seed(24)
m1 <- matrix(sample(c(NA,0:5), 9*4, replace=TRUE),ncol=4, 
           dimnames=list(NULL, paste0('col', 5:8)))
test1 <- cbind(test, m1)

edited Nov 13, 2015 at 19:19

answered Nov 13, 2015 at 18:12

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

HVS Over a year ago

Also, how can I tell R to summarize columns 5:8, 9:12,13:16 and so on until the end of my data?

akrun Over a year ago

@HVS As I have showed, you can specify the column position in select with dplyr or .SDcols in data.table. i.e. .SDcols= 5:ncol(test).

HVS Over a year ago

When I copy and paste the data you posted above into R it runs and then I don't get an error message with the aggregate function, The error message must be due to the way the column was created in excel and then converted to a .csv file. I am still a bit unclear on the .SDcols= 5:ncol(test) command. I don't understand how that statement tells R to read the columns I mentioned above.

akrun Over a year ago

@HVS Actually, you got the error with aggregate by quoting the "col1" etc. We are specifying the position or names of the columns in .SDcols and based on that, we loop though those columns (lapply(.SD, ...)

HVS Over a year ago

I have 328 columns in total.

|

Collectives™ on Stack Overflow

Summarizing multiple columns by two variables

1 Answer 1

Update

data

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Update

data

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related