Drop columns that are replicated in a data frame

Question

I have a large data frame with repeated variables. This is just a sample of my data to illustrate the question:

df <- data.frame(
  ID = rep(1:4, each = 1),
  CMW = rep(c(10, 20, 30, 30), each = 1),
  D_D = c(rep(100, 3), 200), 
  D_D = c(rep(100, 3), 200),
  D_D = c(rep(100, 1), 200),
  Eref = rep(4:4, each = 1),
  Eref = rep(4:4, each = 1),
  Eref = rep(1:4, each = 1),
  Eref = rep(1:4, each = 1)
)


  ID CMW  DD DD.1 DD.2 Eref Eref.1 Eref.2 Eref.3
   1  10 100  100  100    4      4      1      1
   2  20 100  100  200    4      4      2      2
   3  30 100  100  100    4      4      3      3
   4  30 200  200  200    4      4      4      4

R will append numbers in the variable names to make them unique but the variables that have the same "root name" (the string before dot) are actually the same. So what I am trying to do is, if the variable is repeated, look at the values within that particular variable, if the values are identical keep only one column of that variable. However if there are two set of the same variable that are identical keep one column of each set. So I want to do that with all the repeated variables in my data frame. For example from the sample of the data frame above (df) I want to have the following result:

  ID CMW  DD DD.1 Eref Eref.1
   1  10 100  100    4      1
   2  20 100  200    4      2
   3  30 100  100    4      3
   4  30 200  200    4      4

So far I was able to check if there are repeated variables in my data frame with this code:

duplicated_col <- unique(sub("\\.\\d+$", "", names(df))[duplicated(sub("\\.\\d+$", "", names(df)))])

But I am not sure how to compare the repeated variables and drop/keep to obtain the df_result. Any help is very welcomed. Thank you!

LMc · Accepted Answer · 2024-06-27 20:03:21Z

2

Here is a base R solution:

df |> 
  split.default(gsub("\\.\\d+$", "", names(df))) |>
  lapply(\(x) {
    l <- as.list(x)
    distinct <- unique(l)
    setNames(distinct, names(l)[seq_along(distinct)])
  }) |>
  unname() |>
  data.frame()
#   CMW D_D D_D.1 Eref Eref.1 ID
# 1  10 100   100    4      1  1
# 2  20 100   200    4      2  2
# 3  30 100   100    4      3  3
# 4  30 200   200    4      4  4

answered Jun 27, 2024 at 20:03

LMc

19k4 gold badges41 silver badges54 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Onyambu · Accepted Answer · 2024-06-27 20:09:39Z

split.default(df, sub(".\\d+$", "", names(df))) |>
  lapply(\(x)unique(as.matrix(unname(x)), MARGIN = 2)) |>
  data.frame()

  CMW D_D.1 D_D.2 Eref.1 Eref.2 ID
1  10   100   100      4      1  1
2  20   100   200      4      2  2
3  30   100   100      4      3  3
4  30   200   200      4      4  4

If you want to maintain the order of appearance. add another pipe:

fn <- function(x,d) x[order(match(names(x), names(d)))]

split.default(df, sub(".\\d+$", "", names(df))) |>
   lapply(\(x)unique(as.matrix(unname(x)), MARGIN = 2)) |>
   data.frame() |> fn(df)

  ID CMW D_D.1 D_D.2 Eref.1 Eref.2
1  1  10   100   100      4      1
2  2  20   100   200      4      2
3  3  30   100   100      4      3
4  4  30   200   200      4      4

zx8754 · Accepted Answer · 2024-06-27 20:12:08Z

1

Transpose and check if it is duplicated:

df[, c("ID", names(df[, -1])[ !duplicated(t(df[, -1 ]))]) ]
#   ID CMW D_D D_D.2 Eref Eref.2
# 1  1  10 100   100    4      1
# 2  2  20 100   200    4      2
# 3  3  30 100   100    4      3
# 4  4  30 200   200    4      4

Note that I am excluding ID from transpose step, as it has the same value as Eref.2 and Eref.3. If your IDs are not the same as your values on other columns then the code looks even simpler:

df[, names(df)[ !duplicated(t(df)) ] ]

answered Jun 27, 2024 at 20:12

zx8754

56.7k12 gold badges131 silver badges229 bronze badges

1 Comment

Onyambu Over a year ago

This assumes other columns wont be duplicated by the existing ones. eg if there were two other columns A, A.1 with values 1:4, they will be dropped

Collectives™ on Stack Overflow

Drop columns that are replicated in a data frame

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related