I have a large file which I want to format in a certain manner. File input example:
DVL1 03220 NP_004412.2 VANGL2 02758 Q9ULK5 in vitro 12490194
PAX3 09421 NP_852124.1 MEOX2 02760 NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254.1 in vitro;in vivo 15195140
And this is how I want it to become:
DVL1 03220 NP_004412 VANGL2 02758 Q9ULK5
PAX3 09421 NP_852124 MEOX2 02760 NP_005915
VANGL2 02758 Q9ULK5 MAGI3 11290 NP_001136254
To summarize:
- if the line has 1 dot, that dot is deleted along with the number after it and a \t is added, so the output line will only have 6 tab-separated values
- if the line has 2 dots, those dots are deleted along with the numbers after them and a \t is added, so the output line will only have 6 tab-separated values
- if the line has no dots, maintain the first 6 tab-separated values
My idea is currently something like this:
for line in infile:
if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
columns = transformed_line.split('\t')
outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
else:
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n') # this is fine
Hope I explained myself ok. Thanks for you guys effort.
sed. I guess you wantpythonbecause it's part of a bigger program (?)