1

I have a space-delimited large file with thousands of rows and columns. I would like to remove all lines which have the same value across all columns but the first.

Input:

CHROM   108 139 159 265 350 351
SNP1    -1  -1  -1  -1  -1  -1
SNP2    2   2   2   2   2   2
SNP3    0   0   0   -1  -1  -1
SNP4    1   1   1   1   1   1
SNP5    0   0   0   0   0   0

Desired

CHROM   108 139 159 265 350 351
SNP3    0   0   0   -1  -1  -1

There is a similar question asked for the Panda Framework (Delete duplicate rows with the same value in all columns in pandas) and I found a somewhat partial solution that removes lines containing only zero

awk 'NR > 1{s=0; for (i=3;i<=NF;i++) s+=$i; if (s!=0)print}' input > outfile

but I want to do this for the numbers -1, 0, 1 and 2 in one go with header and 1st column as the identifier.

Any help will be highly appreciated.

4 Answers 4

2

I believe you can do something like this:

awk '{s=$0; gsub(FS $2,FS)} (NF > 1) {print s}' file

Which outputs:

CHROM   108 139 159 265 350 351
SNP3    0   0   0   -1  -1  -1

How does this work?

  1. {s=$0; gsub(FS $2,FS)}: This action contains 2 parts:

    • Store the current line in variable s
    • Substitute in the current line $0 all values of the second field including its starting field separator FS (FS $2) with a field separator FS. This has as a side effect the $0 is redefined and all field variables and the total number of field NF are redefined. The field separator FS is needed to avoid matching xx if $2=x
  2. (NF > 1) {print s}: If you have more then 1 field left, print the line, it means you have various numbers.

Sign up to request clarification or add additional context in comments.

10 Comments

this will fail in case we have a line SNP1 -1 -11 -1 -1 -1 -1 it will print this line too.
@RavinderSingh13 as it should. -11 is different from -1
I meant this will print this line too in output even it shouldn't be printed.
@kvantour find the number with max length as the pattern. But it will be slower than we compare value in each col.
@RavinderSingh13 yes i have checked all the solutions and they all are working perfectly fine. I highly appreciate efforts of all posters :-)
|
1

You can try this:

awk 'NR==1;NR>1{for(i=2;i<NF;i++)if($(i+1)!=$i) {print;next}}' file

It print the header line.
It loops over the fields until the a difference with the next one is found, then prints it, and go to the next one.

Comments

1

Could you please try following.

awk '{val=$2;count=1;for(i=3;i<=NF;i++){if(val==$i){count++}};if(count!=(NF-1)){print}}'  Input_file

2 Comments

Why not just set val = $2 and then loop for i starting from 3? There's also no need for a count -- as soon as you find a column not equal to val, print the line and break out of the loop.
@Barmar, sure sir done that now, thank you for making me aware.
0

Portable Perl solution:

$ cat all_row
CHROM   108 139 159 265 350 351
SNP1    -1  -1  -1  -1  -1  -1
SNP2    2   2   2   2   2   2
SNP3    0   0   0   -1  -1  -1
SNP4    1   1   1   1   1   1
SNP5    0   0   0   0   0   0

$ perl -F"\s+" -ane ' { print "$_" if @F[1 .. $#F-1] != $F[1] } ' all_row
CHROM   108 139 159 265 350 351
SNP3    0   0   0   -1  -1  -1

$

if the ask is like don't delete if same value in all columns, then

$ perl -F"\s+" -ane ' { print "$_" if @F[1 .. $#F-1] == $F[1] } ' all_row
SNP1    -1  -1  -1  -1  -1  -1
SNP2    2   2   2   2   2   2
SNP4    1   1   1   1   1   1
SNP5    0   0   0   0   0   0

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.