0

i am newbie to python. I am trying to parse a file to extract certain columns and write to an output file. I was able to parse and extract the desired columns but having trouble writing them to an output file.

Here is the original test file:

EGW05759        Pld5    I79_005987      GO_function: GO:0003824 - catalytic activity [Evidence IEA]; GO_process: GO:0008152 - metabolic process [Evidence IEA]                                  
EGW05760        Exo1    I79_005988      GO_function: GO:0003677 - DNA binding [Evidence IEA]; GO_function: GO:0003824 - catalytic activity [Evidence IEA]; GO_function: GO:0004518 - nuclease activity [Evidence IEA]; GO_process: GO:0006281 - DNA repair [Evidence IEA] 

Here is my python code

f = open('test_parsing.txt', 'rU')
f1 = open('test_parsing_out.txt', 'a')
for line in f:
   match = re.search('\w+\s+(\w+)\s+\w+\s+\w+\:', line)
   match1 = re.findall('GO:\d+', line)
   f1.write(match.group(1), match1)
f1.close()

Basically i want the output to look like this (though i know my code is not complete to achieve this)

Pld5 GO:0003824:GO:0008152
Exo1 GO:0003677:GO:0003824:GO:0004518:GO:0006281

Thanks

Upendra

1
  • 1
    looks like you have a tsv file. Look into the csv python module to parse it more accurately. Commented Sep 14, 2014 at 20:13

2 Answers 2

4
f = open('test_parsing.txt', 'rU')
f1 = open('test_parsing_out.txt', 'a')
for line in f:
    match = re.search('\w+\s+(\w+)\s+\w+\s+\w+\:', line)
    match1 = re.findall('GO:\d+', line)
    f1.write('%s %s \n'%(match.group(1), ''.join(match1)))
f1.close()
Sign up to request clarification or add additional context in comments.

1 Comment

This is awesome. I am happy that my code still holds good mostly. I just made a slight edit to your code to suit the desired output. Here it is.. f1.write('%s %s \n'%(match.group(1), ','.join(match1)))
2

Using the csv module:

import csv, re

with open('test_parsing.txt', 'rU') as infile, open('test_parsing_out.txt', 'a') as outfile:
    reader = csv.reader(infile, delimiter="\t")
    for line in reader:
        result = line[1] + " " + ':'.join(re.findall("GO:\d{6}", line[3]))
        outfile.write(result + "\n")

# OUTPUT
Pld5 GO:000382:GO:000815
Exo1 GO:000367:GO:000382:GO:000451:GO:000628

3 Comments

I think there is a problem with the code here..I'm getting this error "IndexError: list index out of range". Can u please check..
When op copy and pasted the text to SO, it converted tabs to spaces. Replace all the tabbed bits with an actual tab and it works beautifully
@upendra re.sub(r"\s{2,}", "\t", txt) :)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.