0

I have a quite large text file with genetic data (94,807,000 rows). I want to extract the rows in which specific patterns occur in a specific column. I tried using awk and grep in various ways but did not find a way to get the job done. The file is space-delimited and looks like this:

   V1     V2 V3 V4   V5      V6
1: 10 179406  T  . HPGM T,T,T,T
2: 10 179407  T  . HPGM T,T,T,T
3: 10 179408  G  . HPGM G,G,G,G
4: 10 179409  A  . HPGM A,A,A,A
5: 10 179410  A  . HPGM A,A,A,A
6: 10 179411  T  . HPGM T,T,T,T

V5 and V6 can have more then the four entries shown here and everything might look pretty weird, like:

   V1        V2 V3 V4   V5                    V6
1:  1 158154514  A  . HPGO A,AAAA..204..TTTT,A,A

I want to keep the lines where both entries for H and P (those are the first two comma-delimited characters in V6) are exactly either A, C, T or G, so should only have one of those four characters. H and P do not have to have the same character, though. In V5 multiple combinations can occur, but all start with HP. I am not interested if any or how many entries come afterwards and all rows do have entries for H and P, so I do not have to deal with missing entries.

I found some answers that show how to search for multiple patterns using logical or || , some that show how too look in a specific field using $6 ~ '/A,.' and how to look for exact matches using == "pattern". However, I did not find answers for combining these things and could not figure it out by myself. Help would be very much appreciated.

2
  • What do you consider an "Entry for H and P". V5 looks like HPGM or HPGO are those considered "Entries for H and P"? And how does one tell if those H and P entries are exactly either A,C, T or G? Like all values in V6 can't contain anything other than A, C , T or G values? I think the logic makes perfect sense to you, but it's not explained in a way that someone that is unfamiliar with this genetic data file will understand. Commented Jun 2, 2016 at 14:20
  • Sorry for being unclear. Entries for H and P are the first two characters in column V6. In the upper case (6 rows) everything looks good. However in the lower case (1 row) the entry for P is AAAA..204..TTTT - these rows should be excluded. In V5 multiple combinations can occur, but all start with HP. In the end, all rows should look somewhat like the 6 rows above. I just want to exclude lines that have funny stuff like, for example, the AAAA..204..TTTT. More general, the first two comma-delimited positions in V6 should have exactly one character that is either A, C, T or G. Commented Jun 2, 2016 at 14:21

1 Answer 1

1

You can use this awk command:

awk 'split($NF, a, /,/) && a[1] a[2] ~ /^[ACTG]{2}$/' file

1: 10 179406  T  . HPGM T,T,T,T
2: 10 179407  T  . HPGM T,T,T,T
3: 10 179408  G  . HPGM G,G,G,G
4: 10 179409  A  . HPGM A,A,A,A
5: 10 179410  A  . HPGM A,A,A,A
6: 10 179411  T  . HPGM T,T,T,T
  • split($NF, a, /,/) is splitting last column by comma
  • a[1] a[2] ~ /^[ACTG]{2}$/ is using a regex to ensure first and second sub-fields after split are one of A or C or T or G
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.