1

This is my first time ever using R so please forgive me if this question is not worded properly. I have a .csv file that I imported into R and I am trying to summarize some data. each row of data if for a given year, study site, and area and each column has the number of species present. There are 4 columns for each species as there were 4 surveys where the species could have been seen.

I am trying to get the sum of each species by year and study site. Columns 5:8 are one species, 9:12 another, 13:16 another and so on. Here is the code that I thought would summarize columns 5:8 by year (YYYY) and study area (SAR):

aggregate(test[,5:8],by = list("SAR","YYYY"), FUN = sum, na.rm = TRUE)

This gives me the error message that "argument must have the same length". Can anyone help me through this initial step?

Here is some of the data:

SAR    YYYY GRID_ID WID     col1 col2 col3 col4
BCPALP  2005    1   1189    NA  NA  0   0
BCPALP  2005    1   1190    0   NA  0   0
BCPALP  2005    1   1191    0   0   NA  NA
BCPALP  2005    1   1192    0   NA  NA  NA
BCPALP  2005    1   1194    NA  NA  1   NA
BCPALP  2005    1   1195    NA  NA  1   NA
BCPALP  2005    1   1196    0   NA  0   NA
BCPALP  2005    1   1198    0   NA  0   NA
BCPALP  2005    1   1199    0   NA  0   0

I'm hoping to get an output that is something like this:

SAR    YYYY    total of columns 1:4
BCPALP 2005    2

This is the code I just tried.

aggregate(cbind("col1", "col2", "col3", "col4")~SAR+YYYY, test, FUN=sum, na.rm=TRUE, na.action=NULL)

It gives me an error message that states "variable lengths differ (found for 'SAR')".

I went back and checked the data and all the variable lengths are the same.

9
  • Please show some example data and expected output. Commented Nov 13, 2015 at 17:52
  • Here is some of the data that I have: Commented Nov 13, 2015 at 17:55
  • 1
    Please update it in your post Commented Nov 13, 2015 at 17:55
  • BTW, Can you specify YYYY and SARs position in the dataset Commented Nov 13, 2015 at 17:57
  • Sorry, I am having trouble figuring out how to add a portion of my data table to the comment. The SAR and YYYY are the headings for columns 1 and 2 respectively. Commented Nov 13, 2015 at 18:03

1 Answer 1

1

We can use either aggregate or data.table or dplyr. If we use the formula method for aggregate, we need to set the na.action=NULL when there are NA values in different columns. By default, the na.action=na.omit, so if there is a single NA in one of the columns, that row will be removed from the calculation.

aggregate(cbind(col1, col2, col3, col4)~SAR+YYYY, test,
                        FUN=sum, na.rm=TRUE, na.action=NULL)
#   SAR YYYY col1 col2 col3 col4
#1 BCPALP 2005    0    0    2    0

Using dplyr, we group by 'SAR', 'YYYY', and use summarise_each to get the sum of each of the 'col'.

library(dplyr)
test %>%
     group_by(SAR, YYYY) %>%
     summarise_each(funs(sum=sum(., na.rm=TRUE)), 5:ncol(test))
#     SAR  YYYY  col1  col2  col3  col4
#   (chr) (int) (int) (int) (int) (int)
#1 BCPALP  2005     0     0     2     0

Or with data.table. We convert the 'data.frame' to 'data.table' (setDT(test)), grouped by 'SAR', 'YYYY', we loop though the Subset of Data.table (.SD) and get the sum. The columns to be looped are specified in the .SDcols.

library(data.table)
setDT(test)[, lapply(.SD, sum, na.rm=TRUE), by = .(SAR, YYYY),
             .SDcols= 5:ncol(test)]  
#      SAR YYYY col1 col2 col3 col4
#1: BCPALP 2005    0    0    2    0

Update

Suppose after aggregating we need to get the row wise sum for columns 'col1:col4', then 'col5:col8' etc.

 DT <- setDT(test1)[, lapply(.SD, sum, na.rm=TRUE),
              by = .(SAR, YYYY), .SDcols= 5:ncol(test1)]
 DT1 <- melt(DT, id.var=c('SAR', 'YYYY'))[, i1 := as.numeric(gl(.N, 4, .N)),
            .(SAR, YYYY)]
 dcast(DT1, SAR+YYYY~i1, value.var='value', sum)

data

 test <- structure(list(SAR = c("BCPALP", "BCPALP",
"BCPALP", "BCPALP", 
"BCPALP", "BCPALP", "BCPALP", "BCPALP", "BCPALP"), YYYY = c(2005L, 
2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L, 2005L),
GRID_ID = c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), WID = c(1189L, 1190L, 1191L, 
1192L, 1194L, 1195L, 1196L, 1198L, 1199L), col1 = c(NA, 0L, 0L, 
0L, NA, NA, 0L, 0L, 0L), col2 = c(NA, NA, 0L, NA, NA, NA, NA, 
NA, NA), col3 = c(0L, 0L, NA, NA, 1L, 1L, 0L, 0L, 0L), col4 = c(0L, 
0L, NA, NA, NA, NA, NA, NA, 0L)), .Names = c("SAR", "YYYY",
"GRID_ID", 
"WID", "col1", "col2", "col3", "col4"), class = "data.frame", 
 row.names = c(NA, -9L))

set.seed(24)
m1 <- matrix(sample(c(NA,0:5), 9*4, replace=TRUE),ncol=4, 
           dimnames=list(NULL, paste0('col', 5:8)))
test1 <- cbind(test, m1) 
Sign up to request clarification or add additional context in comments.

8 Comments

Also, how can I tell R to summarize columns 5:8, 9:12,13:16 and so on until the end of my data?
@HVS As I have showed, you can specify the column position in select with dplyr or .SDcols in data.table. i.e. .SDcols= 5:ncol(test).
When I copy and paste the data you posted above into R it runs and then I don't get an error message with the aggregate function, The error message must be due to the way the column was created in excel and then converted to a .csv file. I am still a bit unclear on the .SDcols= 5:ncol(test) command. I don't understand how that statement tells R to read the columns I mentioned above.
@HVS Actually, you got the error with aggregate by quoting the "col1" etc. We are specifying the position or names of the columns in .SDcols and based on that, we loop though those columns (lapply(.SD, ...)
I have 328 columns in total.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.