3

I have a pandas dataframe and I want to plot slices of it, in a function using multiprocessing. Even though the function "process_expression" works when I call it independently, when I use the "multiprocessing" option it is not giving any plots.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import scipy
import seaborn as sns
import sys
from multiprocessing import Pool
import os
os.system("taskset -p 0xff %d" % os.getpid())


pool = Pool()  
gn = pool.map(process_expression, gene_ids)
pool.close()
pool.join()

def process_expression(gn_name, df_gn=df_coding):
    df_part = df_gn.loc[df_gn['Gene_id'] == gn_name]
    df_part = df_part.drop('Gene_id', 1)
    df_part = df_part.drop('Transcript_biotype', 1)

    COUNT100 = df_part[df_part >100 ].count()
    COUNT10 = (df_part[df_part >10 ].count()) - COUNT100
    COUNT1 = (df_part[df_part >1].count())- COUNT100 - COUNT10 
    COUNT0 = (df_part[df_part >0].count())- COUNT100-COUNT10- COUNT1
    result = pd.concat([COUNT0,COUNT1,COUNT10,COUNT100], axis=1)
    result.columns = [ '0 TO 1', '1 TO 10','10 TO 100', '>100']
    result.plot( kind='bar', figsize=(50, 20), fontsize=7, stacked=True) 
    plt.savefig('./expression_levels/all_genes/'+gn_name+'.png')#,bbox_inches='tight')  
    plt.close() 

the df_coding table is something like (it has more columns, I erased some):

 Isoform_name,heart,heart.1,lung.3,Gene_id,Transcript_biotype
 ENST00000296782,0.14546900000000001,0.161245,0.09479889999999999,ENSG00000164327,protein_coding
 ENST00000357387,6.53902,5.86969,7.057689999999999,ENSG00000164327,protein_coding
 ENST00000514735,0.0,0.0,0.0,ENSG00000164327,protein_coding

The input dataframe df_coding is a dataframe with a column Gene_id. In this column I have a list of gn_name. What I want is to take each time only the parts of the dataframe which have the name gn_name[i] in the Gene_id column and plot a barplot based on this dataframe.

For example if I call the 'process_expression('ENSG00000164327')', which is a specific gn_name, the output is something like this:

This is the barplot of the dataframe if the gn_name is ENSG00000164327

What am I doing wrong? I know that the process stops at the plotting command when I run it with multiprocessing.

3
  • sorry, I meant "result", the dataframe that I create. I edited the code Commented Jun 16, 2015 at 14:23
  • 1
    Can you provide more context? import lines and a small subset of the data to be evaluated would be helpful (enough to demonstrate the functioning non-multithreaded version). Also, the the working code for the non-multithreaded case would be helpful. Commented Jun 16, 2015 at 14:25
  • 1
    I updated again the initial code. Is it clear now? Commented Jun 16, 2015 at 16:21

1 Answer 1

2

The problem is between multiprocessing and matplotlib. With multiprocessing you create a completely new context with each process. The new context does not (and can not) successfully initialize the context because it is already initialized in the parent process.

If you are trying to overcome a performance issue then you may be on the right track. However, plotting back to the correctly initialized context of the parent process will require you to go a lot deeper into the structure of the underlying matplotlib guts. Here is an example of setting a data pipe back to the original application. Really this is only going to help if you are dealing with a lot of processing of the data before it is plotted. It doesn't look like that is what you are doing here.

If you are trying to get a visual effect like stacked / overlayed results then you probably want to look into repeating the plot function or modifying the data structure to better represent what you want to visualize.

So. What problem are you trying to solve? A performance problem, or a visualization problem? If it is a visualization problem then you do NOT want to use multiprocessing.

Sign up to request clarification or add additional context in comments.

2 Comments

Thank you for your elaborate answer. My main issue is performance. I am creating many many plots and I would like to make it faster. The thing is that I am pretty sure that the problem is that I use the plot wrapper of pandas and that is what is causing me the problem. So I am wondering if the problem is the matplotlib wrapper of pandas or matplotlib to begin with.
matplotlib is flexible and easy, but not particularly fast. Running on an i3 it takes a few seconds for me to update a panel of 96 plots. How many plots are you displaying in your data set? Multiprocessing is the only way you are going to improve performance with a number crunching bottleneck (likely the issue when processing many data points for visualization). The linked example code probably won't help because it is still one thread doing all of the processing. Does all of your data need to be plotted in the same figure? Does it all need to be displayed interactively?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.