Delete row if same value in all columns

Question

I have a space-delimited large file with thousands of rows and columns. I would like to remove all lines which have the same value across all columns but the first.

Input:

CHROM   108 139 159 265 350 351
SNP1    -1  -1  -1  -1  -1  -1
SNP2    2   2   2   2   2   2
SNP3    0   0   0   -1  -1  -1
SNP4    1   1   1   1   1   1
SNP5    0   0   0   0   0   0

Desired

CHROM   108 139 159 265 350 351
SNP3    0   0   0   -1  -1  -1

There is a similar question asked for the Panda Framework (Delete duplicate rows with the same value in all columns in pandas) and I found a somewhat partial solution that removes lines containing only zero

awk 'NR > 1{s=0; for (i=3;i<=NF;i++) s+=$i; if (s!=0)print}' input > outfile

but I want to do this for the numbers -1, 0, 1 and 2 in one go with header and 1st column as the identifier.

Any help will be highly appreciated.

kvantour · Accepted Answer · 2018-10-09 15:15:53Z

2

I believe you can do something like this:

awk '{s=$0; gsub(FS $2,FS)} (NF > 1) {print s}' file

Which outputs:

CHROM   108 139 159 265 350 351
SNP3    0   0   0   -1  -1  -1

How does this work?

{s=$0; gsub(FS $2,FS)}: This action contains 2 parts:
- Store the current line in variable s
- Substitute in the current line $0 all values of the second field including its starting field separator FS (FS $2) with a field separator FS. This has as a side effect the $0 is redefined and all field variables and the total number of field NF are redefined. The field separator FS is needed to avoid matching xx if $2=x
(NF > 1) {print s}: If you have more then 1 field left, print the line, it means you have various numbers.

edited Oct 9, 2018 at 15:15

answered Oct 9, 2018 at 14:54

kvantour

26.9k4 gold badges57 silver badges81 bronze badges

Sign up to request clarification or add additional context in comments.

10 Comments

RavinderSingh13 Over a year ago

this will fail in case we have a line SNP1 -1 -11 -1 -1 -1 -1 it will print this line too.

kvantour Over a year ago

@RavinderSingh13 as it should. -11 is different from -1

RavinderSingh13 Over a year ago

I meant this will print this line too in output even it shouldn't be printed.

Kent Over a year ago

@kvantour find the number with max length as the pattern. But it will be slower than we compare value in each col.

Waqas Khokhar Over a year ago

@RavinderSingh13 yes i have checked all the solutions and they all are working perfectly fine. I highly appreciate efforts of all posters :-)

|

oliv · Accepted Answer · 2018-10-09 15:03:00Z

1

You can try this:

awk 'NR==1;NR>1{for(i=2;i<NF;i++)if($(i+1)!=$i) {print;next}}' file

It print the header line.
It loops over the fields until the a difference with the next one is found, then prints it, and go to the next one.

answered Oct 9, 2018 at 15:03

oliv

13.3k30 silver badges52 bronze badges

Comments

RavinderSingh13 · Accepted Answer · 2018-10-09 15:22:11Z

1

Could you please try following.

awk '{val=$2;count=1;for(i=3;i<=NF;i++){if(val==$i){count++}};if(count!=(NF-1)){print}}'  Input_file

edited Oct 9, 2018 at 15:22

answered Oct 9, 2018 at 15:02

RavinderSingh13

135k14 gold badges61 silver badges100 bronze badges

2 Comments

Barmar Over a year ago

Why not just set val = $2 and then loop for i starting from 3? There's also no need for a count -- as soon as you find a column not equal to val, print the line and break out of the loop.

RavinderSingh13 Over a year ago

@Barmar, sure sir done that now, thank you for making me aware.

stack0114106 · Accepted Answer · 2018-10-10 05:45:45Z

0

Portable Perl solution:

$ cat all_row
CHROM   108 139 159 265 350 351
SNP1    -1  -1  -1  -1  -1  -1
SNP2    2   2   2   2   2   2
SNP3    0   0   0   -1  -1  -1
SNP4    1   1   1   1   1   1
SNP5    0   0   0   0   0   0

$ perl -F"\s+" -ane ' { print "$_" if @F[1 .. $#F-1] != $F[1] } ' all_row
CHROM   108 139 159 265 350 351
SNP3    0   0   0   -1  -1  -1

$

if the ask is like don't delete if same value in all columns, then

$ perl -F"\s+" -ane ' { print "$_" if @F[1 .. $#F-1] == $F[1] } ' all_row
SNP1    -1  -1  -1  -1  -1  -1
SNP2    2   2   2   2   2   2
SNP4    1   1   1   1   1   1
SNP5    0   0   0   0   0   0

answered Oct 10, 2018 at 5:45

stack0114106

8,8934 gold badges16 silver badges40 bronze badges

Collectives™ on Stack Overflow

Delete row if same value in all columns

4 Answers 4

10 Comments

Comments

2 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

10 Comments

Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related