How to create new data rows for each unique value in a column (potentially grouped by other variables) in R

Question

I have data that has the following structure:

data <- data.frame(
  uniqueid = c(1, 1, 2, 2, 3, 3),
  year = c(2010, 2011, 2010, 2011, 2010, 2011),
  agency = c("SZ", "SZ", "SZ", NA, "SZ", "HE"),
  switch = c(0, 0, 0, NA, 0, 1)
)

As you can see, the data is organized by uniqueids appearing in a given year. Keep in mind that for the agency column there can be 13 different unique strings appearing across different uniqueids. I would like the data to look like the following:

data <- data.frame(
  uniqueid = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3, 3),
  year = c(2010, 2010, 2011, 2011, 2010, 2010, 2011, 2011, 2011, 2010, 2010, 2011, 2011),
  agency = c("SZ", "HE", "SZ", "HE", "SZ", "HE", "SZ", NA, "HE", "SZ", "HE", "SZ", "HE"),
  switch = c(0, 0, 0, 0, 0, 0, NA, NA, NA, 0, 0, 0, 1)
)

In this transformation, the rows take on the different unique values of the agency variable, and the switch variable largely maps what it had previously looked like. I'm not really sure how to accomplish this in R, though I would prefer the solution to be within tidyverse. Thank you!

I've been trying something like the following but I don't seem to be getting exactly what I want:

data1 <- data %>%
      pivot_wider(names_from = agency, values_from = lead, names_prefix = "agency_", values_fill = "0") %>%
      gather(key = agency, value = lead, starts_with("agency_")) %>%
      arrange(uniqueid, year, agency)

Mark · Accepted Answer · 2023-12-01 09:01:31Z

2

Here's yet another way:

Steps:

complete the data (add a row for every variation of id, year, and just the "SZ" and "HE" agencies)
If none of the agency values in each uniqueid and year combination are NA, then change the new row switch values in that group to be 0. Otherwise, keep them NA

data |>
  complete(uniqueid, year, agency = c("SZ", "HE")) |>
  mutate(switch = ifelse(!any(is.na(agency)) & is.na(switch), 0, switch), .by = c(uniqueid, year))

Output:

# A tibble: 13 × 4
   uniqueid  year agency switch
      <dbl> <dbl> <chr>   <dbl>
 1        1  2010 HE          0
 2        1  2010 SZ          0
 3        1  2011 HE          0
 4        1  2011 SZ          0
 5        2  2010 HE          0
 6        2  2010 SZ          0
 7        2  2011 HE         NA
 8        2  2011 SZ         NA
 9        3  2010 HE          0
10        3  2010 SZ          0
11        3  2011 HE          1
12        3  2011 SZ          0
13        2  2011 NA         NA

edited Dec 1, 2023 at 9:01

answered Dec 1, 2023 at 8:06

Mark

12.7k3 gold badges24 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

sheri Over a year ago

The only additional thing that needs to be accounted for is that the years are not the same across uniqueids. So uniqueid 1 might have years 2010 and 2011, but id 2 might have years 2011 and 2012. So the repitition of the agencies would need to occur within the years noted for a uniqueid.

Grasshopper_NZ · Accepted Answer · 2023-12-01 08:48:43Z

You could break this question into 3 steps, give it a try:

0. Let's call your raw data as `raw_data`:

library(tidyverse)

raw_data <- data.frame(
  uniqueid = c(1, 1, 2, 2, 3, 3),
  year = c(2010, 2011, 2010, 2011, 2010, 2011),
  agency = c("SZ", "SZ", "SZ", NA, "SZ", "HE"),
  switch = c(0, 0, 0, NA, 0, 1)
)

1. Isolate the row(s) where the `agency` column is NA:

NA_rows <- raw_data |> filter(is.na(agency))

2. Process the filled rows:

filled_rows <- raw_data |> 
  
  complete(uniqueid, year, agency) |> 
  
  select(-switch) |> 
  
  filter(!is.na(agency)) |> 
  
  left_join(raw_data, join_by(uniqueid, year, agency)) |> 
  
  mutate(switch = case_when(
    is.na(switch) ~ 0,
    TRUE ~ switch
  ))

3. Bind the `filled_rows` together with the `NA_rows`:

bind_rows(filled_rows, NA_rows)

# A tibble: 13 × 4
   uniqueid  year agency switch
      <dbl> <dbl> <chr>   <dbl>
 1        1  2010 HE          0
 2        1  2010 SZ          0
 3        1  2011 HE          0
 4        1  2011 SZ          0
 5        2  2010 HE          0
 6        2  2010 SZ          0
 7        2  2011 HE          0
 8        2  2011 SZ          0
 9        3  2010 HE          0
10        3  2010 SZ          0
11        3  2011 HE          1
12        3  2011 SZ          0
13        2  2011 NA         NA

I am sure there are more elegant approaches, hope it is helpful.

Darren Tsai · Accepted Answer · 2023-12-01 11:01:42Z

0

Filter rows with NA out
Complete the data with missing combinations
Perform dplyr::union(not base::union) on the completed and original dataframes to include those rows with missing agency.
Re-arrange.

library(tidyverse)

data %>%
  drop_na() %>%
  complete(uniqueid, year, agency, fill = list(switch = 0)) %>%
  union(data) %>%
  mutate(switch = case_when(!anyNA(agency) ~ switch), .by = c(uniqueid, year)) %>%
  arrange(uniqueid, year)

# A tibble: 13 × 4
   uniqueid  year agency switch
      <dbl> <dbl> <chr>   <dbl>
 1        1  2010 HE          0
 2        1  2010 SZ          0
 3        1  2011 HE          0
 4        1  2011 SZ          0
 5        2  2010 HE          0
 6        2  2010 SZ          0
 7        2  2011 HE         NA
 8        2  2011 SZ         NA
 9        2  2011 NA         NA
10        3  2010 HE          0
11        3  2010 SZ          0
12        3  2011 HE          1
13        3  2011 SZ          0

edited Dec 1, 2023 at 11:01

answered Dec 1, 2023 at 7:19

Darren Tsai

36.6k6 gold badges27 silver badges58 bronze badges

2 Comments

Mark Over a year ago

why use case_when when you have only one condition? 🤨

Darren Tsai Over a year ago

@Mark Just to make the line shorter than using ifelse.

Collectives™ on Stack Overflow

How to create new data rows for each unique value in a column (potentially grouped by other variables) in R

3 Answers 3

1 Comment

0. Let's call your raw data as `raw_data`:

1. Isolate the row(s) where the `agency` column is NA:

2. Process the filled rows:

3. Bind the `filled_rows` together with the `NA_rows`:

Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

0. Let's call your raw data as raw_data:

1. Isolate the row(s) where the agency column is NA:

2. Process the filled rows:

3. Bind the filled_rows together with the NA_rows:

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related

0. Let's call your raw data as `raw_data`:

1. Isolate the row(s) where the `agency` column is NA:

3. Bind the `filled_rows` together with the `NA_rows`: