1

I am working on my thesis with little knowledge of r, so the answer this question may be pretty obvious.

I have the a dataset looking like this:

county<-c('1001','1001','1001','1202','1202','1303','1303')
naics<-c('423620','423630','423720','423620','423720','423550','423720')
employment<-c(5,6,5,5,5,6,5)
data<-data.frame(county,naics,employment)

For every county, I want to sum the value of employment of rows with naics '423620' and '423720'. (So two conditions: 1. same county code 2. those two naics codes) The row in which they are added should be the first one ('423620'), and the second one ('423720') should be removed

The final dataset should look like this:

county2<-c('1001','1001','1202','1303','1303')
naics2<-c('423620','423630','423620','423550','423720')
employment2<-c(10,6,10,6,5)
data2<-data.frame(county2,naics2,employment2)

I have tried to do it myself with aggregate and rowSum, but because of the two conditions, I have failed thus far. Thank you very much.

2 Answers 2

1

We can do

library(dplyr)
data$naics <- as.character(data$naics)

data %>%
    filter(naics %in% c(423620, 423720)) %>% group_by(county) %>% 
    summarise(naics = "423620", employment = sum(employment)) %>%
    bind_rows(., filter(data, !naics  %in% c(423620, 423720)))
# A tibble: 5 x 3
#   county  naics employment
#  <fctr>  <chr>      <dbl>
#1   1001 423620         10
#2   1202 423620         10
#3   1303 423620          5
#4   1001 423630          6
#5   1303 423550          6
Sign up to request clarification or add additional context in comments.

2 Comments

May I ask, why do we use "423720" in: filter(naics %in% c(423620, 423720)) %>% group_by(county) %>%, when we in the next step only summarize over "naics = "423620"?
@Erosennin In the summarise step, we are creating that 'naics' as 423620. I think the reason could be that the OP wanted the 'employment' sum where 'naics' is either of the two. Also, if you notice, here we are changing the two levels to a single one. I hope this helps
0

With such a condition, I'd first write a small helper and then pass it on to dplyr mutate:

# replace 423720 by 423620 only if both exist
onlyThoseNAICS <- function(v){
  if( ("423620" %in% v) & ("423720" %in% v) ) v[v == "423720"] <- "423620"
  v
}

data %>% 
  dplyr::group_by(county) %>% 
  dplyr::mutate(naics = onlyThoseNAICS(naics)) %>% 
  dplyr::group_by(county, naics) %>% 
  dplyr::summarise(employment = sum(employment)) %>% 
  dplyr::ungroup()

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.