1

I have data that has the following structure:

data <- data.frame(
  uniqueid = c(1, 1, 2, 2, 3, 3),
  year = c(2010, 2011, 2010, 2011, 2010, 2011),
  agency = c("SZ", "SZ", "SZ", NA, "SZ", "HE"),
  switch = c(0, 0, 0, NA, 0, 1)
)

As you can see, the data is organized by uniqueids appearing in a given year. Keep in mind that for the agency column there can be 13 different unique strings appearing across different uniqueids. I would like the data to look like the following:

data <- data.frame(
  uniqueid = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3),
  year = c(2010, 2010, 2011, 2011, 2010, 2010, 2011, 2011, 2011, 2010, 2010, 2011, 2011),
  agency = c("SZ", "HE", "SZ", "HE", "SZ", "HE", "SZ", NA, "HE", "SZ", "HE", "SZ", "HE"),
  switch = c(0, 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, 1)
)

In this transformation, the rows take on the different unique values of the agency variable, and the switch variable largely maps what it had previously looked like. I'm not really sure how to accomplish this in R, though I would prefer the solution to be within tidyverse. Thank you!

I've been trying something like the following but I don't seem to be getting exactly what I want:

data1 <- data %>%
      pivot_wider(names_from = agency, values_from = lead, names_prefix = "agency_", values_fill = "0") %>%
      gather(key = agency, value = lead, starts_with("agency_")) %>%
      arrange(uniqueid, year, agency)

3 Answers 3

2

Here's yet another way:

Steps:

  1. complete the data (add a row for every variation of id, year, and just the "SZ" and "HE" agencies)
  2. If none of the agency values in each uniqueid and year combination are NA, then change the new row switch values in that group to be 0. Otherwise, keep them NA
data |>
  complete(uniqueid, year, agency = c("SZ", "HE")) |>
  mutate(switch = ifelse(!any(is.na(agency)) & is.na(switch), 0, switch), .by = c(uniqueid, year))

Output:

# A tibble: 13 × 4
   uniqueid  year agency switch
      <dbl> <dbl> <chr>   <dbl>
 1        1  2010 HE          0
 2        1  2010 SZ          0
 3        1  2011 HE          0
 4        1  2011 SZ          0
 5        2  2010 HE          0
 6        2  2010 SZ          0
 7        2  2011 HE         NA
 8        2  2011 SZ         NA
 9        3  2010 HE          0
10        3  2010 SZ          0
11        3  2011 HE          1
12        3  2011 SZ          0
13        2  2011 NA         NA
Sign up to request clarification or add additional context in comments.

1 Comment

The only additional thing that needs to be accounted for is that the years are not the same across uniqueids. So uniqueid 1 might have years 2010 and 2011, but id 2 might have years 2011 and 2012. So the repitition of the agencies would need to occur within the years noted for a uniqueid.
0

You could break this question into 3 steps, give it a try:

0. Let's call your raw data as raw_data:

library(tidyverse)

raw_data <- data.frame(
  uniqueid = c(1, 1, 2, 2, 3, 3),
  year = c(2010, 2011, 2010, 2011, 2010, 2011),
  agency = c("SZ", "SZ", "SZ", NA, "SZ", "HE"),
  switch = c(0, 0, 0, NA, 0, 1)
)

1. Isolate the row(s) where the agency column is NA:

NA_rows <- raw_data |> filter(is.na(agency))

2. Process the filled rows:

filled_rows <- raw_data |> 
  
  complete(uniqueid, year, agency) |> 
  
  select(-switch) |> 
  
  filter(!is.na(agency)) |> 
  
  left_join(raw_data, join_by(uniqueid, year, agency)) |> 
  
  mutate(switch = case_when(
    is.na(switch) ~ 0,
    TRUE ~ switch
  ))

3. Bind the filled_rows together with the NA_rows:

bind_rows(filled_rows, NA_rows)

# A tibble: 13 × 4
   uniqueid  year agency switch
      <dbl> <dbl> <chr>   <dbl>
 1        1  2010 HE          0
 2        1  2010 SZ          0
 3        1  2011 HE          0
 4        1  2011 SZ          0
 5        2  2010 HE          0
 6        2  2010 SZ          0
 7        2  2011 HE          0
 8        2  2011 SZ          0
 9        3  2010 HE          0
10        3  2010 SZ          0
11        3  2011 HE          1
12        3  2011 SZ          0
13        2  2011 NA         NA

I am sure there are more elegant approaches, hope it is helpful.

Comments

0
  1. Filter rows with NA out
  2. Complete the data with missing combinations
  3. Perform dplyr::union(not base::union) on the completed and original dataframes to include those rows with missing agency.
  4. Re-arrange.
library(tidyverse)

data %>%
  drop_na() %>%
  complete(uniqueid, year, agency, fill = list(switch = 0)) %>%
  union(data) %>%
  mutate(switch = case_when(!anyNA(agency) ~ switch), .by = c(uniqueid, year)) %>%
  arrange(uniqueid, year)

# A tibble: 13 × 4
   uniqueid  year agency switch
      <dbl> <dbl> <chr>   <dbl>
 1        1  2010 HE          0
 2        1  2010 SZ          0
 3        1  2011 HE          0
 4        1  2011 SZ          0
 5        2  2010 HE          0
 6        2  2010 SZ          0
 7        2  2011 HE         NA
 8        2  2011 SZ         NA
 9        2  2011 NA         NA
10        3  2010 HE          0
11        3  2010 SZ          0
12        3  2011 HE          1
13        3  2011 SZ          0

2 Comments

why use case_when when you have only one condition? 🤨
@Mark Just to make the line shorter than using ifelse.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.