1

Suppose I have this dataset:

set.seed (1234); 
data.frame(cbind(a=rep(c("si","no"),30),b=rnorm(60)),
           c=rep(c("d","e","f"),20))  %>% head()

original dataset, first 6 cases

Then I want to add many columns (in this example I only added two), to identify distinct cases between each group (in this case, column "a").

set.seed(1234); 
data.frame(cbind(a=rep(c("si","no"),30),b=rnorm(60)),c=rep(c("d","e","f"),20)) %>% 
group_by(a) %>% dplyr::mutate_at(vars(c(b,c)), .funs= list(dups_hash_ing= ~n_distinct(.)))

This code leaves the following dataset:

dataset with distinct cases

If I set the dataset with dput, the outcome is

structure(list(a = structure(c(2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 2L, 1L, 
2L, 1L, 2L, 1L), .Label = c("no", "si"), class = "factor"), b = structure(c(22L, 
1L, 51L, 34L, 50L, 57L, 53L, 10L, 47L, 3L, 11L, 23L, 15L, 38L, 
58L, 39L, 41L, 17L, 28L, 21L, 37L, 45L, 29L, 46L, 32L, 48L, 56L, 
52L, 26L, 19L, 35L, 8L, 55L, 20L, 9L, 36L, 2L, 12L, 6L, 42L, 
49L, 43L, 59L, 54L, 31L, 13L, 60L, 44L, 14L, 30L, 7L, 5L, 16L, 
27L, 33L, 18L, 24L, 4L, 25L, 40L), .Label = c("-0.0997905884418961", 
"-0.151736536534977", "-0.198416273822079", "-0.254874652654534", 
"-0.274704218225806", "-0.304721068966714", "-0.324393300483657", 
"-0.400235237343163", "-0.415751788401515", "-0.50873701541522", 
"-0.538070788884863", "-0.60615111526422", "-0.659770093821306", 
"-0.684320344136007", "-0.789646852263761", "-0.933503340589868", 
"-0.965903210133575", "-1.07754212275943", "-1.11444896479736", 
"-1.60708093984972", "-2.07823754188738", "-2.7322195229558", 
"-2.85575865501923", "-3.23315213292314", "0.0295178303214797", 
"0.0326639575014441", "0.116845344986082", "0.162654708118265", 
"0.185513915583057", "0.186492083080971", "0.287709728313787", 
"0.311681028661359", "0.319160238648117", "0.413868915451097", 
"0.418057822385083", "0.42200837321742", "0.485226820569252", 
"0.487814635163685", "0.500694614280786", "0.594273774110513", 
"0.62021020366732", "0.629536099884472", "0.660212631820405", 
"0.677415500438328", "0.696768778564913", "0.700733515544461", 
"0.704180178465512", "0.760462361967838", "0.895171980275539", 
"0.912322161610113", "0.976031734922396", "1.1123628412626", 
"1.16910851401363", "1.17349757263239", "1.49349310261748", "1.84246362620766", 
"1.98373220068438", "2.16803253951933", "2.27348352044748", "2.91914013071762"
), class = "factor"), c = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 
2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 
3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 
1L, 2L, 3L, 1L, 2L, 3L), .Label = c("d", "e", "f"), class = "factor"), 
    a_dups_hash_ing = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L, 1L), b_dups_hash_ing = c(30L, 30L, 30L, 30L, 
    30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 
    30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 
    30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 
    30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L, 
    30L, 30L, 30L, 30L, 30L, 30L, 30L, 30L), c_dups_hash_ing = c(3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 
    3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -60L), groups = structure(list(
    a = structure(1:2, .Label = c("no", "si"), class = "factor"), 
    .rows = list(c(2L, 4L, 6L, 8L, 10L, 12L, 14L, 16L, 18L, 20L, 
    22L, 24L, 26L, 28L, 30L, 32L, 34L, 36L, 38L, 40L, 42L, 44L, 
    46L, 48L, 50L, 52L, 54L, 56L, 58L, 60L), c(1L, 3L, 5L, 7L, 
    9L, 11L, 13L, 15L, 17L, 19L, 21L, 23L, 25L, 27L, 29L, 31L, 
    33L, 35L, 37L, 39L, 41L, 43L, 45L, 47L, 49L, 51L, 53L, 55L, 
    57L, 59L))), row.names = c(NA, -2L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE))

What I need to do, is replace, column by column, if the number of distinct cases is more than one per group, with the value of the original column. I have to do this for more than 50 columns. An example of this will be provided for only one column with mutate:

  dplyr::mutate(b_dups_hash_ing= ifelse(>1,b,0))

I need to repeat the code provided above for many variables. This is very similar to a mutate_at (words in brackets is what I would do). The following example does not work, but is something I would do in an ideal world, just for your better understanding of my problem.

dplyr::mutate_at(vars(contains('_dups_hash_ing')), .funs = list(~ifelse(.>1,vars([original]),0)))
8
  • 3
    Please make this reproducible by setting a seed with set.seed([some number here]) Commented Feb 20, 2020 at 20:32
  • Thank you for your response. However, the output of the example is not relevant. I'm particularly interested in the function that makes it possible to replace values in the columns. I will do my best trying to edit the question, I don't know how to do it. Commented Feb 20, 2020 at 20:51
  • @JasonAizkalns I did what you asked. I added the seed. Thank you. Commented Feb 20, 2020 at 20:58
  • 1
    Try using dput(dataset), where dataset might be a dataframe. Commented Feb 20, 2020 at 20:59
  • 1
    That's not me, just tried to make the question more comprehensible, with the suggestion. Commented Feb 20, 2020 at 22:03

1 Answer 1

2

Is this what you're looking for?

df %>% dplyr::mutate_at(vars(contains('_dups_hash_ing')), ~ ifelse(. > 1, ., 0)) %>% head
#> # A tibble: 6 x 6
#> # Groups:   a [2]
#>   a     b                  c     a_dups_hash_ing b_dups_hash_ing c_dups_hash_ing
#>   <fct> <fct>              <fct>           <dbl>           <int>           <int>
#> 1 si    -2.7322195229558   d                   0              30               3
#> 2 no    -0.09979058844189… e                   0              30               3
#> 3 si    0.976031734922396  f                   0              30               3
#> 4 no    0.413868915451097  d                   0              30               3
#> 5 si    0.912322161610113  e                   0              30               3
#> 6 no    1.98373220068438   f                   0              30               3
Sign up to request clarification or add additional context in comments.

5 Comments

Thank you, but I need to do something like this: df %>% dplyr::mutate(b_dups_hash_ing=ifelse(. > 1,b, 0)) %>% head, but for more than 50 columns (I was wondering to do this by the function mutate_at, but I can't imagine how to do it).
Is there a predicate function that can be passed to mutate_if to select the columns you want or can you simply pass a numeric vector of positions to mutate_at(vars())?
I can simply pass a numeric vector of positions. I have a vector of names of columns, which are the original ones (c(b,c)), and a vector of column names that I created that should be of the same length, which are the ones that contain the suffix "_dup_hash_ing" (contains('_dups_hash_ing')). This is what I did in the previous step: dplyr::mutate_at(vars(c(b,c)), .funs= list(dups_hash_ing= ~n_distinct(.))). Now that i can detect if there are more than one case per group (in column a), I can replace values with the original ones if there are more than one case per group.
Sorry. I dont speak english very well. I need to replace values in variables that contains "_dup_hash_ing" ("b_dup_hash_ing" and "c_dup_hash_ing") with the original variable ("b" and "c"), for each one, if the values of "_dup_hash_ing" are greater than 1.
I need a function that can do something like this: dplyr::mutate_at(vars(c(b,c)), .funs= funs(dups_hash_ing=ifelse(n_distinct(.)>1, as.character(.),""))) but by giving me the distinct values per row, no by group (a) (group_by(a))

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.