How to read this file using Python?

Question

I have a DNA file in the following format:

>gi|5524211|gb|AAD44166.1| cytochrome
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGC
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
ACACCCCCCCCGGTGTGTGTGGGGGGTTAAAAATGATGAGTGATGAGTGAGTTGTGTG
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
TTCTATCATCATTCGGCGGGGGGATATATTATAGCGCGCGATTATTGCGCAGTCTACG
TCATCGACTACGATCAGCATCAGCATCAGCATCAGCATCGACTAGCATCAGCTACGAC

How do I read this file and extract the DNA sequence part (ACCAGAGCGG...) without any newlines, for example:

ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGCCTACATCATCACAGCAGCATCA

Maybe regex isn't needed?

You're asking a lot of questions about Python on this DNA project. — Jed Smith
– Jed Smith, Commented Nov 15, 2009 at 19:35
@jed - but at least answers are being marked as accepted (and hopefully upvoted). — Kev
– Kev, Commented Nov 15, 2009 at 19:43
While I love python if you want speed for this type of calculation you should be using <a href="ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/…>, while it maybe a little show to pick up it will surely be better than reinventing the wheel. Here is a <a href="bips.u-strasbg.fr/fr/Tutorials/Comparison/Blast/…> that looks good. — snarkyname77
– snarkyname77, Commented Nov 16, 2009 at 13:28

Pierre Bourdon · Accepted Answer · 2009-11-15 19:34:42Z

7

If there's always only one line of header :

dnalines = text.split('\n')[1:]
dna = ''.join(dnalines)

With text = the contents of your file (for example, text = open('yourfile').read())

answered Nov 15, 2009 at 19:34

Pierre Bourdon

10.9k4 gold badges36 silver badges28 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Community · Accepted Answer · 2017-05-23 11:47:46Z

I did some tests, and it appears that the following is more efficient than delroth's answer:

text.split('\n', 1)[1].replace('\n', '')

Edit: wait, it's not so simple. I timed both methods, twice, using Python 2.6.4 and 3.1.1, on an ~30MB file:

Python 2.6.4, my version:

$ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 221 msec per loop
$ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 219 msec per loop

Python 2.6.4, delroth's version:

$ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 392 msec per loop
$ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 390 msec per loop

Python 3.1.1, my version:

$ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 803 msec per loop
$ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
10 loops, best of 3: 798 msec per loop

Python 3.1.1, delroth's version:

$ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 610 msec per loop
$ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
10 loops, best of 3: 610 msec per loop

Conclusion: Python 3 is much slower, and it depends on the Python version which of the two code snippets is faster!

Collectives™ on Stack Overflow

How to read this file using Python?

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related