1

I have a large data frame with repeated variables. This is just a sample of my data to illustrate the question:

df <- data.frame(
  ID = rep(1:4, each = 1),
  CMW = rep(c(10, 20, 30, 30), each = 1),
  D_D = c(rep(100, 3), 200), 
  D_D = c(rep(100, 3), 200),
  D_D = c(rep(100, 1), 200),
  Eref = rep(4:4, each = 1),
  Eref = rep(4:4, each = 1),
  Eref = rep(1:4, each = 1),
  Eref = rep(1:4, each = 1)
)


  ID CMW  DD DD.1 DD.2 Eref Eref.1 Eref.2 Eref.3
   1  10 100  100  100    4      4      1      1
   2  20 100  100  200    4      4      2      2
   3  30 100  100  100    4      4      3      3
   4  30 200  200  200    4      4      4      4

R will append numbers in the variable names to make them unique but the variables that have the same "root name" (the string before dot) are actually the same. So what I am trying to do is, if the variable is repeated, look at the values within that particular variable, if the values are identical keep only one column of that variable. However if there are two set of the same variable that are identical keep one column of each set. So I want to do that with all the repeated variables in my data frame. For example from the sample of the data frame above (df) I want to have the following result:

  ID CMW  DD DD.1 Eref Eref.1
   1  10 100  100    4      1
   2  20 100  200    4      2
   3  30 100  100    4      3
   4  30 200  200    4      4

So far I was able to check if there are repeated variables in my data frame with this code:

duplicated_col <- unique(sub("\\.\\d+$", "", names(df))[duplicated(sub("\\.\\d+$", "", names(df)))])

But I am not sure how to compare the repeated variables and drop/keep to obtain the df_result. Any help is very welcomed. Thank you!

3 Answers 3

2

Here is a base R solution:

df |> 
  split.default(gsub("\\.\\d+$", "", names(df))) |>
  lapply(\(x) {
    l <- as.list(x)
    distinct <- unique(l)
    setNames(distinct, names(l)[seq_along(distinct)])
  }) |>
  unname() |>
  data.frame()
#   CMW D_D D_D.1 Eref Eref.1 ID
# 1  10 100   100    4      1  1
# 2  20 100   200    4      2  2
# 3  30 100   100    4      3  3
# 4  30 200   200    4      4  4
Sign up to request clarification or add additional context in comments.

Comments

1
split.default(df, sub(".\\d+$", "", names(df))) |>
  lapply(\(x)unique(as.matrix(unname(x)), MARGIN = 2)) |>
  data.frame()

  CMW D_D.1 D_D.2 Eref.1 Eref.2 ID
1  10   100   100      4      1  1
2  20   100   200      4      2  2
3  30   100   100      4      3  3
4  30   200   200      4      4  4

If you want to maintain the order of appearance. add another pipe:

fn <- function(x,d) x[order(match(names(x), names(d)))]

split.default(df, sub(".\\d+$", "", names(df))) |>
   lapply(\(x)unique(as.matrix(unname(x)), MARGIN = 2)) |>
   data.frame() |> fn(df)

  ID CMW D_D.1 D_D.2 Eref.1 Eref.2
1  1  10   100   100      4      1
2  2  20   100   200      4      2
3  3  30   100   100      4      3
4  4  30   200   200      4      4

Comments

1

Transpose and check if it is duplicated:

df[, c("ID", names(df[, -1])[ !duplicated(t(df[, -1 ]))]) ]
#   ID CMW D_D D_D.2 Eref Eref.2
# 1  1  10 100   100    4      1
# 2  2  20 100   200    4      2
# 3  3  30 100   100    4      3
# 4  4  30 200   200    4      4

Note that I am excluding ID from transpose step, as it has the same value as Eref.2 and Eref.3. If your IDs are not the same as your values on other columns then the code looks even simpler:

df[, names(df)[ !duplicated(t(df)) ] ]

1 Comment

This assumes other columns wont be duplicated by the existing ones. eg if there were two other columns A, A.1 with values 1:4, they will be dropped

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.