R remove duplicate data from each column

Question

I get CSV's with hundreds of different columns and would like to be able to output a new file with the duplicate values removed from each column. Everything that I have seen and tried uses a specific column. I just need each column to be unique values.

For Example My Data:

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))
df
    A B    C
  1 1 1  Mr.
  2 2 0  Mr.
  3 3 1 Mrs.
  4 4 0 Miss
  5 5 0  Mr.
  6 6 1 Mrs.

I would like:

Then I can:

write.csv(df, file = file.path(df, "df_No_Dupes.csv"), na="")

So I can use it as a reference for my next task.

MrFlick · Accepted Answer · 2016-07-27 18:04:13Z

1

read.csv and write.csv work best with tabular data. Your desired output is not a good example of this (every row does not have the same number of columns).

You can easily get all the unique value for your columns with

vals <- sapply(df, unique)

Then you'd be better off saving this object with save() and load() to preserve the list as an R object.

answered Jul 27, 2016 at 18:04

MrFlick

209k19 gold badges300 silver badges324 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

PhilC · Accepted Answer · 2016-07-27 20:06:50Z

1

Code snippet to work with a flexible number of columns, remove duplicate columns, and preserve column names:

require(rowr)

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))

#get the number of columns in the dataframe
n <- ncol(df)

#loop through the columns
for(i in 1:ncol(df)){

  #replicate column i without duplicates, fill blanks with NAs
  df <-  cbind.fill(df,unique(df[,1]), fill = NA)
  #rename the new column
  colnames(df)[n+1] <- colnames(df)[1]
  #delete the old column
  df[,1] <- NULL
}

answered Jul 27, 2016 at 20:06

PhilC

7873 silver badges8 bronze badges

Comments

Hack-R · Accepted Answer · 2016-07-27 19:08:33Z

0

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))


for(i in 1:ncol(df)){
  assign(paste("df_",i,sep=""), unique(df[,i]))
}

require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = NA)

  V1 V1   V1
1  1  1  Mr.
2  2  0 Mrs.
3  3 NA Miss
4  4 NA <NA>
5  5 NA <NA>
6  6 NA <NA>

or you could do

require(rowr)
df <- cbind.fill(df_1,df_2,df_3, fill = "")
df

  V1 V1   V1
1  1  1  Mr.
2  2  0 Mrs.
3  3    Miss
4  4        
5  5        
6  6

If you want to avoid typing the name of each intermediate dataframe you can just use ls(pattern="df_") and get the objects named in that vector or use another loop.

If you want to change the column names back to their original values you can use:

colnames(output_df) <- colnames(input_df)

Then you can save the results however you, like, i.e.

saveRDS()

save()

or write it to a file.

Putting it all together:

df <- data.frame(A = c(1, 2, 3, 4, 5, 6), B = c(1, 0, 1, 0, 0, 1), C = c("Mr.","Mr.","Mrs.","Miss","Mr.","Mrs."))


for(i in 1:ncol(df)){
  assign(paste("df_",i,sep=""), unique(df[,i]))
}

require(rowr)
files     <- ls(pattern="df_")

df_output <- data.frame()
for(i in files){
  df_output <- cbind.fill(df_output, get(i), fill = "")
}

df_output <- df_output[,2:4] # fix extra colname from initialization
colnames(df_output) <- colnames(df)
write.csv(df_output, "df_out.csv",row.names = F)

verify_it_worked <- read.csv("df_out.csv")
verify_it_worked

  A  B    C
1 1  1  Mr.
2 2  0 Mrs.
3 3    Miss
4 4      
5 5      
6 6

edited Jul 27, 2016 at 19:08

answered Jul 27, 2016 at 17:25

Hack-R

23.5k15 gold badges82 silver badges138 bronze badges

4 Comments

Trigs Over a year ago

This works for the current data set however I sometimes have 100 or more columns so typing out df_1, df_2... would not work. So after the For loop when I have every column outputted into values can I run another loop to to grab every value starting with df_ and combine into 1 file? Also if the headers could be the original name that would be perfect.

Hack-R Over a year ago

@Trigs Yes, sure. You can also us ls() to get a list of the objects in your environment with a certain pattern, i.e. ls(pattern="df_"). If you want to change the colnames it's just colnames(output_df) <- colnames(input_df)

Trigs Over a year ago

Looks like it is adding another column of NA so everything is shifted right. Column A is all NA. Also this is a huge help so thank you!

Hack-R Over a year ago

@Trigs Oh I see what you're saying. A weird result of how I created the object. I will add one line to fix it right now. You're welcome for the help. Please feel free to upvote if it was helpful.

Collectives™ on Stack Overflow

R remove duplicate data from each column

3 Answers 3

Comments

Comments

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Comments

Comments

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related