Removing duplicate rows on the basis of specific columns

Question

How can I remove the duplicate rows on the basis of specific columns while maintaining the dataset. I tried using these links1, link2

What I want to do is I want to see the ambiguity on the basis of column 3 to 6. If their values are same then the processed dataset should remove the rows, as shown in the example:

I used this code but I gave me half result:

Data <- unique(Data[, 3:6])

Lets suppose my dataset is like this

 A  B  C  D  E  F  G  H  I  J  K  L  M
 1  2  2  1  5  4  12 A  3  5  6  2  1
 1  2  2  1  5  4  12 A  2 35  36 22 21
 1  22 32 31 5 34  12 A  3  5  6  2  1

What I want in my output is:

 A  B  C  D  E  F  G  H  I  J  K  L  M
 1  2  2  1  5  4  12 A  3  5  6  2  1
 1  22 32 31 5 34  12 A  3  5  6  2  1

akrun · Accepted Answer · 2015-08-07 06:33:46Z

2

Another option is unique from data.table. It has the by option. We convert the 'data.frame' to 'data.table' (setDT(df1)), use unique and specify the columns within the by

 library(data.table)
 unique(setDT(df1), by= names(df1)[3:6])
 #   A  B  C  D E  F  G H I J K L M
 #1: 1  2  2  1 5  4 12 A 3 5 6 2 1
 #2: 1 22 32 31 5 34 12 A 3 5 6 2 1

unique returns a data.table with duplicated rows removed.

edited Aug 7, 2015 at 6:33

answered Aug 7, 2015 at 6:18

akrun

891k38 gold badges590 silver badges700 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

akrun Over a year ago

@ayush What is the other question

ayush Over a year ago

I have already have the dummy solution but it isn't accepting in my original dataset. I tried every possible permutation in my code but it won't work. Can I mail you the ques? or you can ping me over mail so that i can do that.

akrun Over a year ago

@ayush I am using sim to connect to the net. Downloading big datasets is costly for me. Can't you provide a dummy example that mimics your original dataset as a new post

RHertel · Accepted Answer · 2015-08-07 06:47:28Z

2

Assuming that your data is stored as a dataframe, you could try:

Data <- Data[!duplicated(Data[,3:6]),]
#> Data
#  A  B  C  D E  F  G H I J K L M
#1 1  2  2  1 5  4 12 A 3 5 6 2 1
#3 1 22 32 31 5 34 12 A 3 5 6 2 1

The function duplicated() returns a logical vector containing in this case information for each row about whether the combination of the entries in column 3 to 6 reappears elsewhere in the dataset. The negation ! of this logical vector is used to select the rows from your dataset, resulting in a dataset with unique combinations of the entries in column 3 to 6.

Thanks to @thelatemail for pointing out a mistake in my previous post.

edited Aug 7, 2015 at 6:47

answered Aug 7, 2015 at 6:14

RHertel

23.8k5 gold badges42 silver badges67 bronze badges

Collectives™ on Stack Overflow

Removing duplicate rows on the basis of specific columns

2 Answers 2

3 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related