R: sum row based on several conditions

Question

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.

I have the a dataset looking like this:

county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)

For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed

The final dataset should look like this:

county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)

I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.

akrun · Accepted Answer · 2017-06-11 07:50:53Z

1

We can do

library(dplyr)
data$naics <- as.character(data$naics)

data %>%
    filter(naics %in% c(423620, 423720)) %>% group_by(county) %>% 
    summarise(naics = "423620", employment = sum(employment)) %>%
    bind_rows(., filter(data, !naics  %in% c(423620, 423720)))
# A tibble: 5 x 3
#   county  naics employment
#  <fctr>  <chr>      <dbl>
#1   1001 423620         10
#2   1202 423620         10
#3   1303 423620          5
#4   1001 423630          6
#5   1303 423550          6

edited Jun 11, 2017 at 7:50

answered Jun 11, 2017 at 7:42

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Helen Over a year ago

May I ask, why do we use "423720" in: filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%, when we in the next step only summarize over "naics = "423620"?

akrun Over a year ago

@Erosennin In the summarise step, we are creating that 'naics' as 423620. I think the reason could be that the OP wanted the 'employment' sum where 'naics' is either of the two. Also, if you notice, here we are changing the two levels to a single one. I hope this helps

Murray Bozinsky · Accepted Answer · 2017-06-11 09:38:16Z

0

With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:

# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
  if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
  v
}

data %>% 
  dplyr::group_by(county) %>% 
  dplyr::mutate(naics = onlyThoseNAICS(naics)) %>% 
  dplyr::group_by(county, naics) %>% 
  dplyr::summarise(employment = sum(employment)) %>% 
  dplyr::ungroup()

answered Jun 11, 2017 at 9:38

Murray Bozinsky

4994 silver badges10 bronze badges

Collectives™ on Stack Overflow

R: sum row based on several conditions

2 Answers 2

2 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related