I have a file containing a dictionary of values on each line that I grab and use to query a mysql database using each key as a query. The results of each query get placed in a dict and once all values for the query dict have been generated the line gets written out.
IN > foo bar someotherinfo {1: 'query_val', 2: 'query_val', 3: 'query_val'
OUT > foo bar someotherinfo 1_result 2_result 3_result
This whole process appears to be somewhat slow because I'm performing around 200,000 mysql queries per file and have around 10 files per sample and around 30 samples in total, so I'm looking to speed up the whole process.
I'm just wondering if the fileIO could be creating a bottleneck. Instead of writing the line_info (foo,bar,somblah) followed by each result dict as it's returned, would I be better of chunking these results into memory before writing them to file in batches?
Or is this simply a case of having to just wait it out... ?
Example Input line and output line
INPUT
XM_006557349.1 1 - exon XM_006557349.1_exon_2 10316 10534 {1: 10509:10534', 2: '10488:10508', 3: '10467:10487', 4: '10446:10466', 5: '10425:10445', 6: '10404:10424', 7: '10383:10403', 8: '10362:10382', 9: '10341:10361', 10: '10316:10340'}
OUTPUT
XM_006557349.1 1 - exon XM_006557349.1_exon_2 10316 105340.7083 0.2945 0.2 0.2931 0.125 0.1154 0.2095 0.5833 0.0569 0.0508
CODE
def array_2_meth(sample,bin_type,type,cur_meth):
bins_in = open('bin_dicts/'+bin_type,'r')
meth_out = open('meth_data/'+bin_type+'_'+sample+'_plus_'+type+'_meth.tsv','w')
for line in bins_in.readlines():
meth_dict = {}
# build array of data from each line
array = line.strip('\n').split('\t')
mrna_id = array[0]
assembly = array[1]
strand = array[2]
bin_dict = ast.literal_eval(array[7])
for bin in bin_dict:
coords = bin_dict[bin].split(':')
start = int(coords[0]) -1
end = int(coords[1]) +1
cur_meth.execute('select sum(mc)/sum(h) from allc_'+str(sample)+'_'+str(assembly) + ' where strand = \'' +str(strand) +'\' and class = \''+str(type)+'\' and position between '+str(start)+' and ' +str(end) + ' and h >= 5')
for row in cur_meth.fetchall():
if str(row[0]) == 'None':
meth_dict[bin] = 'no_cov'
else:
meth_dict[bin] = float(row[0])
meth_out.write('\t'.join(array[:7]))
for k in sorted(meth_dict.keys()):
meth_out.write('\t'+str(meth_dict[k]))
meth_out.write('\n')
meth_out.close()
Not sure if adding this code is going to be a massive help, but it should show the way I'm approaching this.. Any advice you could provide on mistakes I'm making in my approach or tips on how to optimise would be greatly appreciated!!!
Thanks ^_^