1

I am new to Python, so please bear with me.

I can't get this little script to work properly:

genome = open('refT.txt','r')

datafile - a reference genome with a bunch (2 million) of contigs:

Contig_01
TGCAGGTAAAAAACTGTCACCTGCTGGT
Contig_02
TGCAGGTCTTCCCACTTTATGATCCCTTA
Contig_03
TGCAGTGTGTCACTGGCCAAGCCCAGCGC
Contig_04
TGCAGTGAGCAGACCCCAAAGGGAACCAT
Contig_05
TGCAGTAAGGGTAAGATTTGCTTGACCTA

The file is opened:

cont_list = open('dataT.txt','r')

a list of contigs that I want to extract from the dataset listed above:

Contig_01
Contig_02
Contig_03
Contig_05

My hopeless script:

for line in cont_list:
    if genome.readline() not in line:
        continue
    else:
        a=genome.readline()
        s=line+a    
        data_out = open ('output.txt','a')
        data_out.write("%s" % s)
        data_out.close()

input('Press ENTER to exit')

The script successfully writes the first three contigs to the output file, but for some reason it doesn't seem able to skip "contig_04", which is not in the list, and move on to "Contig_05".

I might seem a lazy bastard for posting this, but I've spent all afternoon on this tiny bit of code -_-

4
  • 1
    the problem is that your continue makes you skip the line in cont_list. you need to loop on genome only until you find line Commented Feb 18, 2014 at 15:50
  • 1
    Aside from skipped lines, are the line names guaranteed to appear in the same order in the cont_list and genome files? Commented Feb 18, 2014 at 15:52
  • 1
    you can solve it simply by replacing the if by a while Commented Feb 18, 2014 at 15:52
  • Thanks guys! By changing the "if" with a "while" and ensuring that the names in the reference file was in the same order as in the datafile my script worked perfectly! Bless you fine people :) Commented Feb 19, 2014 at 9:29

3 Answers 3

1

I would first try to generate an iterable which gives you a tuple: (contig, gnome):

def pair(file_obj):
    for line in file_obj:
        yield line, next(file_obj)

Now, I would use that to get the desired elements:

wanted = {'Contig_01', 'Contig_02', 'Contig_03', 'Contig_05'}
with open('filename') as fin:
    pairs = pair(fin)
    while wanted:
        p = next(pairs)
        if p[0] in wanted:
            # write to output file, store in a list, or dict, ...
            wanted.forget(p[0])
Sign up to request clarification or add additional context in comments.

Comments

0

I would recommend several things:

  • Try using with open(filename, 'r') as f instead of f = open(...)/f.close(). with will handle the closing for you. It also encourages you to handle all of your file IO in one place.

  • Try to read in all the contigs you want into a list or other structure. It is a pain to have many files open at once. Read all the lines at once and store them.


Here's some example code that might do what you're looking for

from itertools import izip_longest

# Read in contigs from file and store in list
contigs = []
with open('dataT.txt', 'r') as contigfile:
    for line in contigfile:
        contigs.append(line.rstrip()) #rstrip() removes '\n' from EOL

# Read through genome file, open up an output file
with open('refT.txt', 'r') as genomefile, open('out.txt', 'w') as outfile:
    # Nifty way to sort through fasta files 2 lines at a time
    for name, seq in izip_longest(*[genomefile]*2):
        # compare the contig name to your list of contigs
        if name.rstrip() in contigs:
            outfile.write(name) #optional. remove if you only want the seq
            outfile.write(seq)

Comments

0

Here's a pretty compact approach to get the sequences you'd like.

def get_sequences(data_file, valid_contigs):
    sequences = []

    with open(data_file) as cont_list:
        for line in cont_list:
            if line.startswith(valid_contigs):
                sequence = cont_list.next().strip()
                sequences.append(sequence)

    return sequences

if __name__ == '__main__':
    valid_contigs = ('Contig_01', 'Contig_02', 'Contig_03', 'Contig_05')
    sequences = get_sequences('dataT.txt', valid_contigs)
    print(sequences)

The utilizes the ability of startswith() to accept a tuple as a parameter and check for any matches. If the line matches what you want (a desired contig), it will grab the next line and append it to sequences after stripping out the unwanted whitespace characters. From there, writing the sequences grabbed to an output file is pretty straightforward.

Example output:

['TGCAGGTAAAAAACTGTCACCTGCTGGT',
 'TGCAGGTCTTCCCACTTTATGATCCCTTA',
 'TGCAGTGTGTCACTGGCCAAGCCCAGCGC',
 'TGCAGTAAGGGTAAGATTTGCTTGACCTA']

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.