Extract lines with multiple patterns occuring in one column using awk

Question

I have a quite large text file with genetic data (94,807,000 rows). I want to extract the rows in which specific patterns occur in a specific column. I tried using awk and grep in various ways but did not find a way to get the job done. The file is space-delimited and looks like this:

   V1     V2 V3 V4   V5      V6
1: 10 179406  T  . HPGM T,T,T,T
2: 10 179407  T  . HPGM T,T,T,T
3: 10 179408  G  . HPGM G,G,G,G
4: 10 179409  A  . HPGM A,A,A,A
5: 10 179410  A  . HPGM A,A,A,A
6: 10 179411  T  . HPGM T,T,T,T

V5 and V6 can have more then the four entries shown here and everything might look pretty weird, like:

   V1        V2 V3 V4   V5                    V6
1:  1 158154514  A  . HPGO A,AAAA..204..TTTT,A,A

I want to keep the lines where both entries for H and P (those are the first two comma-delimited characters in V6) are exactly either A, C, T or G, so should only have one of those four characters. H and P do not have to have the same character, though. In V5 multiple combinations can occur, but all start with HP. I am not interested if any or how many entries come afterwards and all rows do have entries for H and P, so I do not have to deal with missing entries.

I found some answers that show how to search for multiple patterns using logical or || , some that show how too look in a specific field using $6 ~ '/A,.' and how to look for exact matches using == "pattern". However, I did not find answers for combining these things and could not figure it out by myself. Help would be very much appreciated.

What do you consider an "Entry for H and P". V5 looks like HPGM or HPGO are those considered "Entries for H and P"? And how does one tell if those H and P entries are exactly either A,C, T or G? Like all values in V6 can't contain anything other than A, C , T or G values? I think the logic makes perfect sense to you, but it's not explained in a way that someone that is unfamiliar with this genetic data file will understand. — JNevill
– JNevill, Commented Jun 2, 2016 at 14:20
Sorry for being unclear. Entries for H and P are the first two characters in column V6. In the upper case (6 rows) everything looks good. However in the lower case (1 row) the entry for P is AAAA..204..TTTT - these rows should be excluded. In V5 multiple combinations can occur, but all start with HP. In the end, all rows should look somewhat like the 6 rows above. I just want to exclude lines that have funny stuff like, for example, the AAAA..204..TTTT. More general, the first two comma-delimited positions in V6 should have exactly one character that is either A, C, T or G. — AlexDeLarge
– AlexDeLarge, Commented Jun 2, 2016 at 14:21

anubhava · Accepted Answer · 2016-06-02 14:51:01Z

1

You can use this awk command:

awk 'split($NF, a, /,/) && a[1] a[2] ~ /^[ACTG]{2}$/' file

1: 10 179406  T  . HPGM T,T,T,T
2: 10 179407  T  . HPGM T,T,T,T
3: 10 179408  G  . HPGM G,G,G,G
4: 10 179409  A  . HPGM A,A,A,A
5: 10 179410  A  . HPGM A,A,A,A
6: 10 179411  T  . HPGM T,T,T,T

split($NF, a, /,/) is splitting last column by comma
a[1] a[2] ~ /^[ACTG]{2}$/ is using a regex to ensure first and second sub-fields after split are one of A or C or T or G

answered Jun 2, 2016 at 14:51

anubhava

790k67 gold badges603 silver badges671 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Extract lines with multiple patterns occuring in one column using awk

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related