1

I have a dataset with thousands of columns with some of the column having identical column name. I want to merge column with same column name such that the values are appended as rows. And, for the columns that don't have column with same column name, 0 is appended in rows.

Clarification: Below is just an example, the real data set I have has thousands of column and many of those have column name that are duplicate and many aren't.

Sample Input Data

Col_1 Col_1 Col_1 Col_1 Col_2
  1     2     3     4   5
  5     6     7     8   5
  9    10    11    12   5
 13    14    15    16   5

Sample Output Data

Col_1 Col_2
  1    5
  2    5
  3    5
  4    5
  5    0
  6    0
  7    0 
  8    0
  9    0
 10    0
 11    0
 12    0
 13    0
 14    0
 15    0
 16    0
0

3 Answers 3

1

Here is my way that involves some manual work. Let's assume your dataset is in the variable test

# may only require some of the packages of tidyverse
library(tidyverse)

# this will give all column unique names
renamed_test <- test %>%
                set_names(str_c(names(test), 1:ncol(test)))

# then for each duplicated column name, they now start with the same prefix;
# so select all these columns and use gather to append them one after another,
# and finally rename the merged column back to the original name
bound_col_1 <- renamed_test %>%
               select(starts_with("Col_1")) %>%
               gather %>%
               transmute(Col_1 = value)

# repeat this for 'Col_2'
# .....

# last, column bind all these results
bind_cols(bound_col_1, bound_col_2, [potentiall other variables])

Edit:

I generalized the solution so it will automatically find all duplicated columns and row bind each

library(tidyverse)

# testing data
test <- data.frame(c(1,2,3), c(7,8,9), c(4,5,6), c(10,11,12), c(100, 101, 102)) %>%
  set_names(c("Col_1", "Col_2", "Col_1", "Col_2", "Col_3"))

col_names <- names(test)

# find all columns that have duplicated columns
dup_names <- col_names[duplicated(col_names)]

# make the column names unique so it will work with tidyr
renamed_test <- test %>%
  set_names(str_c(col_names, "-", 1:ncol(test)))

unique_data <- test[!(duplicated(col_names) | duplicated(col_names, fromLast = TRUE))]

# for each duplicated column name, merge all columns that have the same name
dup_names %>% map(function(col_name) {
  renamed_test %>%
    select(starts_with(col_name)) %>% 
    gather %>% # bind rows
    select(-1) %>% # merged value is the last column
    set_names(c(col_name)) # rename the column name back to its original name
}) %>% bind_cols

result <- bind_rows(tmp_result, unique_data)

This is tricky when you try to bind the columns because the merged data might have different row number. You can compare the length every time when merging and fill the shorter list by appending 0s.

Sign up to request clarification or add additional context in comments.

5 Comments

.@KeqiangLi - Sorry, the example I gave is not the real scenario, I have 1000s of columns and out of them some have duplicate column name and some don't. I simply want to merge every duplicate column name (and its data) same way as explained in above example in question.
This is still doable, just need to build up based on my example. First it's very easy to figure out all columns that have duplicate column names by dup_names <- names(test)[duplicated(names(test))]. And then you can just iterate through vector, and use the logic in my answer
@ChetanArvindPatil I updated the solution and now it should generalize well.
.@KeqiangLi - Thanks. I am running into Error in cbind_all(x) : Argument 2 must be length 6, not 102. The value 6 and 102 is specific to my big data, since I am not good at tidyverse() yet, I don't know how to solve this. Any suggestions please?
@ChetanArvindPatil when using bind_cols, the two data frames must have the same row number if you image how column binding works (this is not specific to 'tidyverse', base R will still require this). The 'automatical merging' won't just happen magically but needs the real dataset with a lot of debugging. If you can put a couple of breakpoints and print the variables generated along the way, and try to append rows so that two data frames have the same number of rows, you will be able to figure it out.
0

Try this. The logic isn't clear: EDIT:: I think the best one can do is simply melt the data like this

library(tidyverse)
df1<-df %>% 
  gather("ID","Value") %>% 
  group_by(ID) %>% 
  arrange(Value)

df1$ID<-str_replace_all(df1$ID,"Col_1.\\d","Col_1")

You could proceed like this but I feel leaving the data melted is better.

library(reshape2)
df1 %>% 
  ungroup() %>% 
  dcast(Value~ID,fun=mean) %>% 
  mutate(Col_2=ifelse(Col_1<=4,5,0)) %>% 
  select(-Value)

Result(melted): The question then is how to deal with the duplicates.

 ID    Value
   <chr> <int>
 1 Col_1     1
 2 Col_1     2
 3 Col_1     3
 4 Col_1     4
 5 Col_1     5
 6 Col_2     5
 7 Col_2     5
 8 Col_2     5
 9 Col_2     5
10 Col_1     6
11 Col_1     7
12 Col_1     8
13 Col_1     9
14 Col_1    10
15 Col_1    11
16 Col_1    12
17 Col_1    13
18 Col_1    14
19 Col_1    15
20 Col_1    16

Original:

  library(tidyverse)
    df %>% 
  gather(key,value,-Col_2) %>% 
  arrange(value) %>% 
  rename(Col_1=value) %>% 
  mutate(Col_2=ifelse(Col_1<=4,5,0)) %>% 
  select(Col_1,everything(),-key)

Result:

      Col_1 Col_2
1      1     5
2      2     5
3      3     5
4      4     5
5      5     0
6      6     0
7      7     0
8      8     0
9      9     0
10    10     0
11    11     0
12    12     0
13    13     0
14    14     0
15    15     0
16    16     0

7 Comments

.@NelsonGon - The example I gave is not the real scenario, I have 1000s of columns and out of them some have duplicate column name and some don't.
How do you want to deal with the duplicates?
.@NelsonGon - Sorry if this is confusing. I simply want to merge every duplicate column name (and its data) same way as explained in above example in question.
.@NelsonGon - Thank you, but you are still assuming the column name is Col_1. I don't think this will work on large data where I don't know what column name is and simply want to dynamical make it happen.
@NelsonGon- Thanks. Sorry, as I am reading more I am realizing this is good if done with tidyverse(). However I am just getting started with that package. May be below answer by @KeqiangLi may help. I am trying to resolve an issue I am running with it.
|
0

Here's a pretty complicated answer. Some of the code is a bit clunky, but it is a general solution.

Solution

library(tidyverse)
library(magrittr)

# function to create lookup table, matching duplicate column names to syntactically valid names 
rel <- function(x) {x %>% 
  colnames %>% 
  make.names(., unique = TRUE) %>% 
  as.data.frame() %>% 
  mutate(names(x)) %>% 
  setNames(c("New", "Old")) }

# create lookup table to match old and new column names
lookup <- rel(df)

# gather df into long format
df_long <- df %>% 
  setNames(lookup$New) %>% 
  gather(var, value)

# match new names to original names
df_colnames <- lapply(1:length(unique(lookup$Old)), function(x) grepl(unique(lookup$Old)[x], df_long$var)) %>% 
  setNames(unique(lookup$Old)) %>% 
  as.data.frame

# vector replacing new syntactically valid names with original names
column <- lapply(names(df_colnames), function(x) ifelse(df_colnames[, x], x, F)) %>% 
  setNames(unique(lookup$Old)) %>% 
  as.data.frame %>% 
  unite(comb, sep = "") %>% 
  magrittr::extract(, "comb") %>% 
  gsub("FALSE", "", .)

# put original columns into lists
final_list <- df_long %>% 
  mutate(var = column) %>% 
  arrange(var, value) %>% 
  split(.$var) %>% 
  map(~select_at(.x, c("value"))) %>% 
  lapply(function(x) x$value)

# create vectors of zeros to append to original data
final_list_extend <- sapply(abs(unlist(lapply(final_list, length)) - max(unlist(lapply(final_list, length)))), function(x) rep(0, x))

# append zeros to original data and rename columns to match original names
output <- sapply(1:length(final_list), function(x) c(final_list[[x]], final_list_extend[[x]])) %>% 
  as_data_frame %>% 
  setNames(unique(lookup$Old))

#show result
output

# A tibble: 16 x 2
   Col_1 Col_2
   <dbl> <dbl>
 1     1     5
 2     2     5
 3     3     5
 4     4     5
 5     5     0
 6     6     0
 7     7     0
 8     8     0
 9     9     0
10    10     0
11    11     0
12    12     0
13    13     0
14    14     0
15    15     0
16    16     0

Data

df <- read.table(header = T, text = "
Col_1 Col_1 Col_1 Col_1 Col_2
  1     2     3     4   5
5     6     7     8   5
9    10    11    12   5
13    14    15    16   5") %>% 
  setNames(c("Col_1", "Col_1", "Col_1", "Col_1", "Col_2"))

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.