1

I would like to find a tidy way to carry out a data cleaning step that I have to do for multiple pairs of columns.

df <- data.frame(apple = c("Yes", NA, NA, "Yes", NA),
                 apple_NO = c(NA, "No_1", "No_1", NA, "No_2"),
                 berry = c("Yes", "Yes", NA, NA, "Yes"),
                 berry_NO = c(NA, NA, "No_1", "No_1", NA),
                 coconut = c(NA, "Yes", "Yes", "Yes", NA),
                 coconut_NO = c("No_2", NA, NA, NA, "No_2"),
                 dinosaur = c("Yes", NA, NA, NA, "Yes"),
                 dinosaur_NO = c(NA, "No_2", "No_1", "No_2", NA))
> df
  apple apple_NO berry berry_NO coconut coconut_NO dinosaur dinosaur_NO
1   Yes     <NA>   Yes     <NA>    <NA>       No_2      Yes        <NA>
2  <NA>     No_1   Yes     <NA>     Yes       <NA>     <NA>        No_2
3  <NA>     No_1  <NA>     No_1     Yes       <NA>     <NA>        No_1
4   Yes     <NA>  <NA>     No_1     Yes       <NA>     <NA>        No_2
5  <NA>     No_2   Yes     <NA>    <NA>       No_2      Yes        <NA>

cols <- c("apple", "berry", "coconut", "dinosaur")
cols_NO <- c("apple_NO", "berry_NO", "coconut_NO", "dinosaur_NO")

I would like to clean the values in columns in cols_NO and assign new values to the columns in cols

For example, if I just had one column pair to clean, I would do something like:

df <- df %>%
  mutate(apple = case_when(apple_NO == "No_1" ~ "None left",
                           apple_NO == "No_2" ~ "Finished",
                           TRUE ~ apple))

I would also like to do this with berry and berry_NO, and coconut and coconut_NO etc.

The output I want would look something like this:

      apple apple_NO     berry berry_NO  coconut coconut_NO  dinosaur dinosaur_NO
1       Yes     <NA>       Yes     <NA> Finished       No_2       Yes        <NA>
2 None left     No_1       Yes     <NA>      Yes       <NA>  Finished        No_2
3 None left     No_1 None left     No_1      Yes       <NA> None left        No_1
4       Yes     <NA> None left     No_1      Yes       <NA> None left        No_2
5  Finished     No_2       Yes     <NA> Finished       No_2       Yes        <NA>

I think there's a solution somewhere along the lines of using map or map2 or mapply and parallel lists, but I've not used those before and can't seem to find similar solutions that I can use, featuring a list of columns on the left and right hand side of the = in mutate.

Thanks!

EDIT:

This gets me close but I would still need to replace or mutate_at this to my main dataframe. My real data would benefit from using grepl so I've just left that in.

fun.casewhen <- function(cols, cols_NO){
  case_when(grepl("No_1", cols_NO) == TRUE ~ "None left",
            grepl("No_2", cols_NO) == TRUE ~ "Finished",
            TRUE ~ cols)
  }


dftest <- map2(df %>% select(cols), df1 %>% select(cols_NO), ~ fun.casewhen (.x, .y))

The resulting dftest is made up of lists of each of the columns in cols, but with the correct values.

2
  • Can you please share at least a few rows of your data? Commented Mar 3, 2021 at 9:34
  • Sure, just edited the above. Thanks Commented Mar 3, 2021 at 9:41

3 Answers 3

1

here is a data.table + rlist approach

library( data.table )
library( rlist )
data.table::setDT(df)
L <- split.default( df, gsub( "_NO", "", names(df) ) )
rlist::list.cbind( lapply( L, function(x) x[,1 := data.table:: fcoalesce(x) ] ) )

enter image description here

Sign up to request clarification or add additional context in comments.

Comments

1

Here's a tidyverse solution relying on multiple pivots. I'm sure this isn't the most concise method, but including as another option.

df %>%
  mutate(row = row_number()) %>%
  pivot_longer(-row) %>%
  separate(name, c("group", "keep"), sep = "_") %>%
  pivot_wider(names_from = keep, values_from = value) %>%
  mutate(`NA` = case_when(NO == "No_1" ~ "None left",
                          NO == "No_2" ~ "Finished",
                          TRUE ~ `NA`)) %>%
  pivot_longer(-c(row, group)) %>%
  unite("col", c(group, name)) %>%
  pivot_wider(names_from = col, values_from = value)



    row apple_NA  apple_NO berry_NA  berry_NO coconut_NA coconut_NO dinosaur_NA dinosaur_NO
  <int> <chr>     <chr>    <chr>     <chr>    <chr>      <chr>      <chr>       <chr>      
1     1 Yes       NA       Yes       NA       Finished   No_2       Yes         NA         
2     2 None left No_1     Yes       NA       Yes        NA         Finished    No_2       
3     3 None left No_1     None left No_1     Yes        NA         None left   No_1       
4     4 Yes       NA       None left No_1     Yes        NA         Finished    No_2       
5     5 Finished  No_2     Yes       NA       Finished   No_2       Yes         NA      

Comments

0

I have found a way to assign the values to my df, albeit not in the best tidyr style:

fun.casewhen <- function(cols, cols_NO){
  case_when(grepl("No_1", cols_NO) == TRUE ~ "None left",
            grepl("No_2", cols_NO) == TRUE ~ "Finished",
            TRUE ~ cols)
  }


df[cols] <- map2(df %>% select(cols), df1 %>% select(cols_NO), ~ fun.casewhen (.x, .y))

I was missing the [cols] in df[cols] before. Nonetheless, it would be great to know if there is a solution more in-line with tidyr standards where the data can be cleaned using mutate and pipes.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.