1

I am trying to combine the items from several files (1.20.1_Indel_allEff.vcf,1.20.2_Indel_allEff.vcf....1.200.1_Indel_allEff.vcf) in a folder in order to get a matrix that looks something like this.

Fm Chromosome Position Ref Alt Gene  X1.20.1 X1.20.2 X1.20.3    
Fm        chrI       100007   AT  A   CAR2  0       0       0  
Fm        chrX       3000676  G   T   HYM1  0       0       0.5

where, X1.20.1, X1.20.2, X1.20.3.....X1.200.3 are individual file names and their frequency values contained in the folder.

I wrote a code in python (F1_comparison.py)

snps = defaultdict(lambda: defaultdict(str))
myfiles=listdir(str(sys.argv[1]))
for f1 in myfiles:
    f = open(f1)
    tpp = f1.split("_")[0].split(".")
    tp=tpp[0]+'.'+tpp[1]+'.'+tpp[2]
    for l in f:
        ls = l.split()
        if l.find("#") == -1 and len(ls) > 6: 
            chrom = ls[0]
            pos = ls[1]
            ref = ls[2]
            alt = ls[3]
            freq = ls[4]
            typ = ls[5]
            gene = ls[6]
            if len(alt) == 1:
                snps[pos+"_"+ref+"-"+alt+"_"+chrom+"_"+gene+"_"+typ][tp] = freq
            elif len(alt) > 1:
                for k in range (0,len(alt.split(","))):
                snps[pos+"_"+ref+alt.split(",")[k]+"_"+chrom+"_"+gene+"_"+typ][tp] = freq.split(",")[k]

    f.close()

traj = 1
tp_list = ['1.20.1','1.20.2','1.20.3','1.30.1','1.30.2','1.30.3','1.40.1','1.40.2','1.40.3','1.50.1','1.50.2','1.50.3','1.60.1','1.60.2','1.60.3','1.90.1','1.90.2','1.90.3','1.100.1','1.100.2','1.100.3','1.130.1','1.130.2','1.130.3','1.200.1','1.200.2','1.200.3']
print "Fermentor\tTrajectory\tChromosome\tPosition\tMutation\tGene\tEffect\t1.20.1\t1.20.2\t1.20.3\t1.30.1\t1.30.2\t1.30.3\t1.40.1\t1.40.3\t1.50.1\t1.50.2\t1.50.3\t1.60.1\t1.60.2\t1.60.3\t1.90.1\t1.90.2\t1.90.3\t1.100.1\t1.100.2\t1.100.3\t1.130.1\t1.130.2\t1.130.3\t1.200.1\t1.200.2\t1.200.3"
for pos in sorted(snps.keys()):
    pos1 = pos.split("_")[0]
    mut = pos.split("_")[1]
    chrom = pos.split("_")[2] 
    gene = pos.split("_")[3]
    typ = pos.split("_")[4]
    tp_string = ""
    for tp in tp_list:
        if len(snps[pos][tp])>0:
            tp_string += "\t"+str(snps[pos][tp])
        else:
            tp_string += "\t"+str("0/0")

    print "F1"+"\t"+str(traj)+"\t"+chrom+"\t"+pos1+"\t"+mut+"\t"+gene+"\t"+typ+"\t"+tp_string
    traj += 1

However, I am getting an error, where the code does not recognize some of the files in the folder, though they are all of the same format.

My command and the error I get:

python F1_comparison.py Fer1 > output.csv

Traceback (most recent call last):
    File "Fer1_comparison.py", line 18, in <module>
    f = open(f1)
    IOError: [Errno 2] No such file or directory: '1.30.2_INDEL_allEff.vcf' 

Can someone help me figure out this problem please? It will be a great help. Thanks

3
  • Hi Padraic, I cross checked my command. I ran the code from the same directory. But I still get an error. Commented Jan 23, 2015 at 12:47
  • @PM 2Ring: I corrected the indentation. While copying the code here, made that error with indentation. could you suggest an alternative please? Commented Jan 23, 2015 at 12:57
  • Thanks PM 2Ring. I have updated the code with the correction. Commented Jan 23, 2015 at 13:04

1 Answer 1

1

You need to join the file to the path:

from os import path, listdir

pth = sys.argv[1]  # get full path
myfiles = listdir(pth) # get list of all files in that path/directory
for f1 in myfiles:
    with open(path.join(pth,f1)) as f: # join -> pth/f1. with also closes your file
        tpp = f1.split("_",1)[0].split(".")
        tp = ".".join(tpp[0:3]) # same as tp=tpp[0]+'.'+tpp[1]+'.'+tpp[2]
        for line in f:
            # continue your code ...

You can write your code a little to be more more concise and more efficient using slicing, unpacking str.format and not repeatedly splitting:

from os import path, listdir
import sys
from collections import defaultdict

snps = defaultdict(lambda: defaultdict(str))
pth = sys.argv[1]  # get full path
myfiles = listdir(pth)  # get list of all files in that path/directory

with open("Fer1_INDELs_clones_filtered.csv","w") as out: # file to write all filtered data to
    out.write("Fermentor\tTrajectory\tChromosome\tPosition\tMutation\tGene\tEffect\t1.20.1\t1.20.2\t1.20.3\t1.30.1\t1.30.2\t1.30.3\t1.40.1\t1.40.3\t1.50.1\t1.50.2\t1.50.3\t1.60.1\t1.60.2\t1.60.3\t1.90.1\t1.90.2\t1.90.3\t1.100.1\t1.100.2\t1.100.3\t1.130.1\t1.130.2\t1.130.3\t1.200.1\t1.200.2\t1.200.3\n")
    for f1 in myfiles:
        with open(path.join(pth, f1)) as f:  # join -> pth/f1
            tpp = f1.split("_", 1)[0].split(".")
            tp = ".".join(tpp[0:3])  # same as tp=tpp[0]+'.'+tpp[1]+'.'+tpp[2]
            for line in f:
                ls = line.split()
                if line.find("#") == -1 and len(ls) > 6: 
                    print(line)
                    # use unpacking and slicing
                    chrom, pos, ref, alt, freq, typ, gene = ls[:7]
                    if len(alt) == 1:
                        # use str.fromat
                        snps["{}_{}-{}_{}_{}_{}".format(pos,ref,alt,chrom,gene,typ)][tp] = freq
                    elif len(alt) > 1:
                        # use enumerate
                        for ind,k in enumerate(alt.split(",")):
                            snps["{}_{}_{}_{}_{}_{}".format(pos,ref,k,chrom,gene,typ)][tp] = freq.split(",")[ind]
    traj = 1
    tp_list = ['1.20.1', '1.20.2', '1.20.3', '1.30.1', '1.30.2', '1.30.3', '1.40.1', '1.40.2', '1.40.3', '1.50.1', '1.50.2',
               '1.50.3', '1.60.1', '1.60.2', '1.60.3', '1.90.1', '1.90.2', '1.90.3', '1.100.1', '1.100.2', '1.100.3',
               '1.130.1', '1.130.2', '1.130.3', '1.200.1', '1.200.2', '1.200.3']
    for pos in sorted(snps):
        # split once and again use unpacking and slicing 
        pos1, mut, chrom, gene, typ = pos.split("_")[:5]
        tp_string = ""
        for tp in tp_list:
            #print(tp)
            if snps[pos][tp]: # empty value will be False no need to check len
                tp_string += "\t{}".format(snps[pos][tp])
            else:
                tp_string += "\t0/0"

        out.write(("F1{}\t{}\t{}\t{}\t{}\t{}\t{}\n".format(traj,chrom,pos1,mut,gene,typ,tp_string)))
        traj += 1 
Sign up to request clarification or add additional context in comments.

6 Comments

Hi Padriac, I am still facing some problems with the code. Not able to figure out. Whats going wrong. Could you help please?
sure, what is the the problem?
I get the same error that I started off with. The code does not recognize some of the files, though they are present in the folder.
This is what I was looking for bro. Thanks a ton! Appreciate your time and help.
There is just one minor issue. If you observe, the columns are mixed up in some of the cases. What I mean is that the order is not maintained throughout. Could you suggest a way to fix that?
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.