I have a quite large text file with genetic data (94,807,000 rows). I want to extract the rows in which specific patterns occur in a specific column. I tried using awk and grep in various ways but did not find a way to get the job done. The file is space-delimited and looks like this:
V1 V2 V3 V4 V5 V6
1: 10 179406 T . HPGM T,T,T,T
2: 10 179407 T . HPGM T,T,T,T
3: 10 179408 G . HPGM G,G,G,G
4: 10 179409 A . HPGM A,A,A,A
5: 10 179410 A . HPGM A,A,A,A
6: 10 179411 T . HPGM T,T,T,T
V5 and V6 can have more then the four entries shown here and everything might look pretty weird, like:
V1 V2 V3 V4 V5 V6
1: 1 158154514 A . HPGO A,AAAA..204..TTTT,A,A
I want to keep the lines where both entries for H and P (those are the first two comma-delimited characters in V6) are exactly either A, C, T or G, so should only have one of those four characters. H and P do not have to have the same character, though. In V5 multiple combinations can occur, but all start with HP. I am not interested if any or how many entries come afterwards and all rows do have entries for H and P, so I do not have to deal with missing entries.
I found some answers that show how to search for multiple patterns using logical or || , some that show how too look in a specific field using $6 ~ '/A,.' and how to look for exact matches using == "pattern". However, I did not find answers for combining these things and could not figure it out by myself. Help would be very much appreciated.
V5looks likeHPGMorHPGOare those considered "Entries for H and P"? And how does one tell if those H and P entries are exactly either A,C, T or G? Like all values in V6 can't contain anything other than A, C , T or G values? I think the logic makes perfect sense to you, but it's not explained in a way that someone that is unfamiliar with this genetic data file will understand.AAAA..204..TTTT- these rows should be excluded. InV5multiple combinations can occur, but all start withHP. In the end, all rows should look somewhat like the 6 rows above. I just want to exclude lines that have funny stuff like, for example, theAAAA..204..TTTT. More general, the first two comma-delimited positions in V6 should have exactly one character that is either A, C, T or G.