0

given a FASTA text file (Rosalind_gc.txt), I am supposed to go through each DNA record and identify the percentage (%) of Guanine-Cytosine (GC) content.

Example of this is :

Sample Dataset:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG    
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample output:

Rosalind_0808 60.919540

So basically go through each string, count the amt of times G/C show up and then divide that total by the length of each string. My issue is learning how to identify the breaks in code (i.e. >Rosalind_6404 ). I would like an example of this code without using Biopython and also with the biopython approach.

1
  • 1
    I think there are already some tools developed to read fasta files, is there a particular case you want to write it on your own? WGS data can be large. They were typically implemented in C. Commented May 30, 2019 at 21:21

3 Answers 3

2

Since you're looking for a Biopython solution, here is a very simple one:

from Bio import SeqIO
from Bio.SeqUtils import GC

for r in SeqIO.parse('Rosalind_gc.fa', 'fasta'):
    print(r.id, GC(r.seq))

Outputs:

Rosalind_6404 53.75
Rosalind_5959 53.57142857142857
Rosalind_0808 60.91954022988506
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you! Biopython really does simplify this for you. Apologies on late response!
2

You could read the file line by line and accumulate sequence data up to the next line that starts with ">" (plus one more time for the end of the file)

def getCount(seq):
    return seq.count("G")+seq.count("C") 

with open("input.txt","r") as file:
    sequence = ""
    name     = ""
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            sequence += line
            continue
        if name != "":
            print(name, 100*getCount(sequence)/len(sequence))
        name     = line[1:]
        sequence = ""
    print(name, 100*getCount(sequence)/len(sequence))

# Rosalind_6404 53.75
# Rosalind_5959 53.57142857142857
# Rosalind_0808 60.91954022988506

1 Comment

Thank you! This is the type of approach I was looking for
1

You may want to make use of precompiled C modules as much as possible for performance issue. There's one solution using regex:

seq = 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'

import re
perc = re.subn(r'[GC]', '', seq) / len(seq)

And also handle the ">" lines:

seq = []
name = ''

for line in open('Rosalind_gc.txt'):
    if not line.startswith('>'):
        seq.append(line.strip())
    else:
        if seq:
            seq = ''.join(seq)
            perc = re.subn(r'[GC]', '', seq) / len(seq)
            print('{} has GC percent: {}'.format(name, perc * 100))
            seq = []
        name = line.strip()

5 Comments

Python's str.count() is implemented in C (for CPython), so why use a regex?
@Chris_Rands In this case str.count() is fast and concise but regex is more flexible (can match multiple patterns in one run). Yet there are also Biopython implementations fast enough so speaking of this question, it's just providing another option.
@Chris_Rands Have you found any post answered your question? It has been already pretty much answered in my view. Yet I don't see any feedback.
Regex doesn't work; matches "GC" exactly instead of matching characters "G" or "C" (patterns "[GC]" or "G|C" would work). Also the re.subn() returns a tuple that needs to be indexed. Looks good otherwise.
@Ghoti I misread the question. I thought was finding consequent GC. I'd fix the answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.