How to write a string algorithm

Question

given a FASTA text file (Rosalind_gc.txt), I am supposed to go through each DNA record and identify the percentage (%) of Guanine-Cytosine (GC) content.

Example of this is :

Sample Dataset:

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG    
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample output:

Rosalind_0808 60.919540

So basically go through each string, count the amt of times G/C show up and then divide that total by the length of each string. My issue is learning how to identify the breaks in code (i.e. >Rosalind_6404 ). I would like an example of this code without using Biopython and also with the biopython approach.

I think there are already some tools developed to read fasta files, is there a particular case you want to write it on your own? WGS data can be large. They were typically implemented in C. — knh190
– knh190, Commented May 30, 2019 at 21:21

Chris_Rands · Accepted Answer · 2019-05-31 11:45:06Z

2

Since you're looking for a Biopython solution, here is a very simple one:

from Bio import SeqIO
from Bio.SeqUtils import GC

for r in SeqIO.parse('Rosalind_gc.fa', 'fasta'):
    print(r.id, GC(r.seq))

Outputs:

Rosalind_6404 53.75
Rosalind_5959 53.57142857142857
Rosalind_0808 60.91954022988506

answered May 31, 2019 at 11:45

Chris_Rands

41.7k15 gold badges92 silver badges126 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Thomas William Dunn Over a year ago

Thank you! Biopython really does simplify this for you. Apologies on late response!

Alain T. · Accepted Answer · 2019-05-31 13:19:16Z

2

You could read the file line by line and accumulate sequence data up to the next line that starts with ">" (plus one more time for the end of the file)

def getCount(seq):
    return seq.count("G")+seq.count("C") 

with open("input.txt","r") as file:
    sequence = ""
    name     = ""
    for line in file:
        line = line.strip()
        if not line.startswith(">"):
            sequence += line
            continue
        if name != "":
            print(name, 100*getCount(sequence)/len(sequence))
        name     = line[1:]
        sequence = ""
    print(name, 100*getCount(sequence)/len(sequence))

# Rosalind_6404 53.75
# Rosalind_5959 53.57142857142857
# Rosalind_0808 60.91954022988506

edited May 31, 2019 at 13:19

answered May 30, 2019 at 21:33

Alain T.

42.2k4 gold badges36 silver badges57 bronze badges

1 Comment

Thomas William Dunn Over a year ago

Thank you! This is the type of approach I was looking for

knh190 · Accepted Answer · 2019-06-03 23:09:11Z

1

You may want to make use of precompiled C modules as much as possible for performance issue. There's one solution using regex:

seq = 'CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCCTCCCACTAATAATTCTGAGG'

import re
perc = re.subn(r'[GC]', '', seq) / len(seq)

And also handle the ">" lines:

seq = []
name = ''

for line in open('Rosalind_gc.txt'):
    if not line.startswith('>'):
        seq.append(line.strip())
    else:
        if seq:
            seq = ''.join(seq)
            perc = re.subn(r'[GC]', '', seq) / len(seq)
            print('{} has GC percent: {}'.format(name, perc * 100))
            seq = []
        name = line.strip()

edited Jun 3, 2019 at 23:09

answered May 30, 2019 at 21:35

knh190

2,8821 gold badge21 silver badges32 bronze badges

5 Comments

Chris_Rands Over a year ago

Python's str.count() is implemented in C (for CPython), so why use a regex?

knh190 Over a year ago

@Chris_Rands In this case str.count() is fast and concise but regex is more flexible (can match multiple patterns in one run). Yet there are also Biopython implementations fast enough so speaking of this question, it's just providing another option.

knh190 Over a year ago

@Chris_Rands Have you found any post answered your question? It has been already pretty much answered in my view. Yet I don't see any feedback.

Ghoti Over a year ago

Regex doesn't work; matches "GC" exactly instead of matching characters "G" or "C" (patterns "[GC]" or "G|C" would work). Also the re.subn() returns a tuple that needs to be indexed. Looks good otherwise.

knh190 Over a year ago

@Ghoti I misread the question. I thought was finding consequent GC. I'd fix the answer.

Collectives™ on Stack Overflow

How to write a string algorithm

3 Answers 3

1 Comment

1 Comment

5 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

5 Comments

Your Answer

Sign up or log in

Post as a guest

Related