1

I have a large file which I want to format in a certain manner. File input example:

DVL1    03220   NP_004412.2 VANGL2  02758   Q9ULK5  in vitro    12490194
PAX3    09421   NP_852124.1 MEOX2   02760   NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254.1  in vitro;in vivo    15195140

And this is how I want it to become:

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

To summarize:

  • if the line has 1 dot, that dot is deleted along with the number after it and a \t is added, so the output line will only have 6 tab-separated values
  • if the line has 2 dots, those dots are deleted along with the numbers after them and a \t is added, so the output line will only have 6 tab-separated values
  • if the line has no dots, maintain the first 6 tab-separated values

My idea is currently something like this:

for line in infile:
    if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
        transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
        columns = transformed_line.split('\t')
        outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
    else:
        columns = line.split('\t')
        outfile.write('\t'.join(columns[:5]) + '\n') # this is fine

Hope I explained myself ok. Thanks for you guys effort.

2
  • this can easily be done with sed. I guess you want python because it's part of a bigger program (?) Commented Jul 13, 2012 at 16:31
  • Yup, this is just part of a function. Commented Jul 13, 2012 at 16:37

4 Answers 4

3
import re
with open(filename,'r') as f:
    newlines=(re.sub(r'\.\d+','',old_line) for old_line in f)
    newlines=['\t'.join(line.split()[:6]) for line in newlines]

Now you have a list of lines with the '.number' portions removed. As far as I can tell, your problem isn't well enough constrained to make this whole thing work in 1 pass with regex, but it'll work with 2.

Sign up to request clarification or add additional context in comments.

4 Comments

its a regex... that replaces a "." followed by 1 or more #'s with nothing
It doesn't quite give the desired output (yet). Still working on it... (I didn't realize that everything after the second dot should be truncated).
I got the idea, but I can't seem to find the correct spot to add the line to. Should I just do: "import re for line in infile: new_line=re.sub(r'\.\d+','',old_line)" ?
Wow, that's pretty well thought. Thanks, would never realize it by myself!
2

you can try something like this:

    with open('data1.txt') as f:
        for line in f:
            line=line.split()[:6]
            line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)  #if an element has '.' then
                                                                         #remove that dot else keep the element as it is
            print('\t'.join(line))

output:

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

Edit:

as @mgilson suggested the line line=map(lambda x:x[:x.index('.')] if '.' in x else x,line) can be replaced by simply line=map(lambda x:x.split('.')[0],line)

6 Comments

Can you explain step by step what you did? I'm not that good at programming.
Thanks for the comments. But I'm guessing, why after 'in'?
I edited my solution, I removed those in related lines. Just use line.split()[0:6] to fetch the first 6 columns.
It is amazing man, just had to add a + '\n' after print('\t'.join(line), cause the output was just one big line. Thanks a lot!
In your lambda, why not just use x.split('.')[0]?
|
1

I figured somebody should do this with a single regex, so...

import re
beast_regex = re.compile(r'(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+in.*')
with open('data.txt') as infile:
    for line in infile:
        match = beast_regex.match(line)
        print('\t'.join(match.groups())

1 Comment

(+1) -- although, this is pretty sensitive to the position of the '.'. e.g. (if I'm reading it correctly), you couldn't have 'foo.1' in the first column.
0

you can do this with a simple regex:

import re
for line in infile:
    line=re.sub(r'\.\d+','\t',line)
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n')

this replaces any "." followed by one or more digits with a tab character.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.