Using Python to distinguish between lines with one dot and lines with two dots

Question

I have a large file which I want to format in a certain manner. File input example:

DVL1    03220   NP_004412.2 VANGL2  02758   Q9ULK5  in vitro    12490194
PAX3    09421   NP_852124.1 MEOX2   02760   NP_005915.2 in vitro;yeast 2-hybrid 11423130
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254.1  in vitro;in vivo    15195140

And this is how I want it to become:

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

To summarize:

if the line has 1 dot, that dot is deleted along with the number after it and a \t is added, so the output line will only have 6 tab-separated values
if the line has 2 dots, those dots are deleted along with the numbers after them and a \t is added, so the output line will only have 6 tab-separated values
if the line has no dots, maintain the first 6 tab-separated values

My idea is currently something like this:

for line in infile:
    if "." in line: # thought about this and a line.count('.') might be better, just wasn't capable to make it work
        transformed_line = line.replace('.', '\t', 2) # only replaces the dot; want to replace dot plus next first character
        columns = transformed_line.split('\t')
        outfile.write('\t'.join(columns[:8]) + '\n') # if i had a way to know the position of the dot(s), i could join only the desired columns
    else:
        columns = line.split('\t')
        outfile.write('\t'.join(columns[:5]) + '\n') # this is fine

Hope I explained myself ok. Thanks for you guys effort.

this can easily be done with sed. I guess you want python because it's part of a bigger program (?) — c00kiemon5ter
– c00kiemon5ter, Commented Jul 13, 2012 at 16:31

mgilson · Accepted Answer · 2012-07-13 16:58:40Z

3

import re
with open(filename,'r') as f:
    newlines=(re.sub(r'\.\d+','',old_line) for old_line in f)
    newlines=['\t'.join(line.split()[:6]) for line in newlines]

Now you have a list of lines with the '.number' portions removed. As far as I can tell, your problem isn't well enough constrained to make this whole thing work in 1 pass with regex, but it'll work with 2.

edited Jul 13, 2012 at 16:58

answered Jul 13, 2012 at 16:34

mgilson

312k70 gold badges656 silver badges722 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Joran Beasley Over a year ago

its a regex... that replaces a "." followed by 1 or more #'s with nothing

mgilson Over a year ago

It doesn't quite give the desired output (yet). Still working on it... (I didn't realize that everything after the second dot should be truncated).

Edward Coelho Over a year ago

I got the idea, but I can't seem to find the correct spot to add the line to. Should I just do: "import re for line in infile: new_line=re.sub(r'\.\d+','',old_line)" ?

Edward Coelho Over a year ago

Wow, that's pretty well thought. Thanks, would never realize it by myself!

Ashwini Chaudhary · Accepted Answer · 2012-07-13 17:06:26Z

2

you can try something like this:

    with open('data1.txt') as f:
        for line in f:
            line=line.split()[:6]
            line=map(lambda x:x[:x.index('.')] if '.' in x else x,line)  #if an element has '.' then
                                                                         #remove that dot else keep the element as it is
            print('\t'.join(line))

output:

DVL1    03220   NP_004412   VANGL2  02758   Q9ULK5
PAX3    09421   NP_852124   MEOX2   02760   NP_005915
VANGL2  02758   Q9ULK5  MAGI3   11290   NP_001136254

Edit:

as @mgilson suggested the line line=map(lambda x:x[:x.index('.')] if '.' in x else x,line) can be replaced by simply line=map(lambda x:x.split('.')[0],line)

edited Jul 13, 2012 at 17:06

answered Jul 13, 2012 at 16:35

Ashwini Chaudhary

252k60 gold badges478 silver badges519 bronze badges

6 Comments

Edward Coelho Over a year ago

Can you explain step by step what you did? I'm not that good at programming.

Edward Coelho Over a year ago

Thanks for the comments. But I'm guessing, why after 'in'?

Ashwini Chaudhary Over a year ago

I edited my solution, I removed those in related lines. Just use line.split()[0:6] to fetch the first 6 columns.

Edward Coelho Over a year ago

It is amazing man, just had to add a + '\n' after print('\t'.join(line), cause the output was just one big line. Thanks a lot!

mgilson Over a year ago

In your lambda, why not just use x.split('.')[0]?

|

Robbie Rosati · Accepted Answer · 2012-07-13 18:18:13Z

1

I figured somebody should do this with a single regex, so...

import re
beast_regex = re.compile(r'(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+(\S+)\s+(\S+)\s+(\S+?)(?:\.\d+)?\s+in.*')
with open('data.txt') as infile:
    for line in infile:
        match = beast_regex.match(line)
        print('\t'.join(match.groups())

edited Jul 13, 2012 at 18:18

answered Jul 13, 2012 at 16:39

Robbie Rosati

1,2051 gold badge9 silver badges25 bronze badges

1 Comment

mgilson Over a year ago

(+1) -- although, this is pretty sensitive to the position of the '.'. e.g. (if I'm reading it correctly), you couldn't have 'foo.1' in the first column.

larissa · Accepted Answer · 2012-07-13 16:50:05Z

0

you can do this with a simple regex:

import re
for line in infile:
    line=re.sub(r'\.\d+','\t',line)
columns = line.split('\t')
outfile.write('\t'.join(columns[:5]) + '\n')

this replaces any "." followed by one or more digits with a tab character.

edited Jul 13, 2012 at 16:50

answered Jul 13, 2012 at 16:43

larissa

4934 silver badges16 bronze badges

Collectives™ on Stack Overflow

Using Python to distinguish between lines with one dot and lines with two dots

4 Answers 4

4 Comments

6 Comments

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

4 Comments

6 Comments

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related