1

I have a DNA file in the following format:

>gi|5524211|gb|AAD44166.1| cytochrome
ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGC
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
ACACCCCCCCCGGTGTGTGTGGGGGGTTAAAAATGATGAGTGATGAGTGAGTTGTGTG
CTACATCATCACAGCAGCATCAGCATCGACATCAGCATCAGCATCAGCATCGACGACT
TTCTATCATCATTCGGCGGGGGGATATATTATAGCGCGCGATTATTGCGCAGTCTACG
TCATCGACTACGATCAGCATCAGCATCAGCATCAGCATCGACTAGCATCAGCTACGAC

How do I read this file and extract the DNA sequence part (ACCAGAGCGG...) without any newlines, for example:

ACCAGAGCGGCACAGCAGCGACATCAGCACTAGCACTAGCATCAGCATCAGCATCAGCCTACATCATCACAGCAGCATCA

Maybe regex isn't needed?

4
  • You're asking a lot of questions about Python on this DNA project. Commented Nov 15, 2009 at 19:35
  • @jed - but at least answers are being marked as accepted (and hopefully upvoted). Commented Nov 15, 2009 at 19:43
  • I'm kind of a noob at python is why. Commented Nov 15, 2009 at 19:46
  • While I love python if you want speed for this type of calculation you should be using <a href="ncbi.nlm.nih.gov/staff/tao/URLAPI/blastall/…>, while it maybe a little show to pick up it will surely be better than reinventing the wheel. Here is a <a href="bips.u-strasbg.fr/fr/Tutorials/Comparison/Blast/…> that looks good. Commented Nov 16, 2009 at 13:28

2 Answers 2

7

If there's always only one line of header :

dnalines = text.split('\n')[1:]
dna = ''.join(dnalines)

With text = the contents of your file (for example, text = open('yourfile').read())

Sign up to request clarification or add additional context in comments.

Comments

3

I did some tests, and it appears that the following is more efficient than delroth's answer:

text.split('\n', 1)[1].replace('\n', '')

Edit: wait, it's not so simple. I timed both methods, twice, using Python 2.6.4 and 3.1.1, on an ~30MB file:

  • Python 2.6.4, my version:

    $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 221 msec per loop
    $ python -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 219 msec per loop
    
  • Python 2.6.4, delroth's version:

    $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 392 msec per loop
    $ python -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 390 msec per loop
    
  • Python 3.1.1, my version:

    $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 803 msec per loop
    $ python3 -m timeit -c "open('x').read().split('\n', 1)[1].replace('\n', '')"
    10 loops, best of 3: 798 msec per loop
    
  • Python 3.1.1, delroth's version:

    $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 610 msec per loop
    $ python3 -m timeit -c "''.join(open('x').read().split('\n')[1:])"
    10 loops, best of 3: 610 msec per loop
    

Conclusion: Python 3 is much slower, and it depends on the Python version which of the two code snippets is faster!

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.