3

I'm fairly new at coding (completely self taught), and have started using it at at my job as a research assistant in a cancer lab. I need some help setting up a few line graphs in matplot lab.

I have a dataset that includes nextgen sequencing data for about 80 patients. on each patient, we have different timepoints of analysis, different genes detected (out of 40), and the associated %mutation for the gene.

My goal is to write two scripts, one that will generate a "by patient" plot, that will be a linegraph with y-%mutation, x-time of measurement, and will have a different color line for all lines made by each of the patient's associated genes. The second plot will be a "by gene", where I will have one plot contain different color lines that represent each of the different patient's x/y values for that specific gene.

Here is an example dataframe for 1 genenumber for the above script:

gene    yaxis   xaxis   pt# gene#
ASXL1-3 34  1   3   1
ASXL1-3 0   98  3   1
IDH1-3  24  1   3   11
IDH1-3  0   98  3   11
RUNX1-3 38  1   3   21
RUNX1-3 0   98  3   21
U2AF1-3 33  1   3   26
U2AF1-3 0   98  3   26

I have setup a groupby script that when I iterate over it, gives me a dataframe for every gene-timepoint for each patient.

grouped = df.groupby('pt #')
for groupObject in grouped:
    group = groupObject[1]

For patient 1, this gives the following output:

        y     x   gene  patientnumber patientgene  genenumber  dxtotransplant  \
0    40.0  1712  ASXL1              1     ASXL1-1           1            1857   
1    26.0  1835  ASXL1              1     ASXL1-1           1            1857   
302   7.0  1835  RUNX1              1     RUNX1-1          21            1857   

I need help writing a script that will create either of the plots described above. using the bypatient example, my general idea is that I need to create a different subplot for every gene a patient has, where each subplot is the line graph represented by that one gene.

Using matplotlib this is about as far as I have gotten:

plt.figure()

grouped = df.groupby('patient number')

for groupObject in grouped:
    group = groupObject[1]
    df = group #may need to remove this
    for element in range(len(group)): 
        xs = np.array(df[df.columns[1]]) #"x" column
        ys= np.array(df[df.columns[0]]) #"y" column
        gene = np.array(df[df.columns[2]])[element] #"gene" column
        plt.subplot(1,1,1) 
        plt.scatter(xs,ys, label=gene)
        plt.plot(xs,ys, label=gene)
        plt.legend()
    plt.show()

This produces the following output:

enter image description here

In this output, the circled line is not supposed to be connected to the other 2 points. In this case, this is patient 1, who has the following datapoint:

x       y   gene
1712    40  ASXL1
1835    26  ASXL1
1835    7   RUNX1

Using seaborn I have gotten close to my desired graph using this code:

grouped = df.groupby(['patientnumber'])
for groupObject in grouped:
    group = groupObject[1]
    g = sns.FacetGrid(group, col="patientgene", col_wrap=4, size=4, ylim=(0,100))  
    g = g.map(plt.scatter, "x", "y", alpha=0.5)
    g = g.map(plt.plot, "x", "y", alpha=0.5)
    plt.title= "gene:%s"%element

Using this code I get the following:

If I adjust the line:

g = sns.FacetGrid(group, col="patientnumber", col_wrap=4, size=4, ylim=(0,100))

I get the following result:

enter image description here

As you can see in the 2d example, the plot is treating every point on my plot as if they are from the same line (but they are actually 4 separate lines).

How I can tweak my iterations so that each patient-gene is treated as a separate line on the same graph?

7
  • This might be a bit broad for here - you have a good level of detail, but tag communities on Stack Overflow generally try to discourage posts seeking broad guidance or get-me-started. Since you have spent some time on this, will you show us what you have tried, even if that did not work? I have however removed your deadlines - anything that tries to rush volunteers is not generally well received ;-) Commented Jul 13, 2016 at 11:56
  • @halfer hey, i'm fairly new to this website (and coding community in general), thanks for calling me out on my fopaux. I have tried using seaborn, matplotlib, and bokeh and seem to run into the same error with each of them (ie every point on my linegraph is treated as if they are connected, rather than representing data from multiple lines). I will update my question with some more detail on what i tried and what the output was. thanks. I didn't mean to make it sound like i was rushing the community for help, rather that i was desperate for help. I apologize it came off that way. Commented Jul 13, 2016 at 16:38
  • No worries, yes if you can update your question with your approach, that often will guide answers, since it shows the approach/strategy you are taking, and perhaps someone will spot a mistake. Commented Jul 13, 2016 at 17:04
  • @halfer I just added 3 examples of code i have written, the outputs, and how they differ from what I am trying to accomplish. I also shortened my prose a bit and tryed to some removed unnecessary detail. I hope this is more in line with what is expected in post. Thanks again and hopefully this will make it easier for volunteers to help. Commented Jul 13, 2016 at 18:40
  • Great effort, that's what we like to see. I can't advice on this topic myself, but this looks like good prep. Commented Jul 13, 2016 at 21:11

1 Answer 1

2

I wrote a subplot function that may give you a hand. I modified the data a tad to help illustrate the plotting functionality.

gene,yaxis,xaxis,pt #,gene #
ASXL1-3,34,1,3,1
ASXL1-3,3,98,3,1
IDH1-3,24,1,3,11
IDH1-3,7,98,3,11
RUNX1-3,38,1,3,21
RUNX1-3,2,98,3,21
U2AF1-3,33,1,3,26
U2AF1-3,0,98,3,26
ASXL1-3,39,1,4,1
ASXL1-3,8,62,4,1
ASXL1-3,0,119,4,1
IDH1-3,27,1,4,11
IDH1-3,12,62,4,11
IDH1-3,1,119,4,11
RUNX1-3,42,1,4,21
RUNX1-3,3,62,4,21
RUNX1-3,1,119,4,21
U2AF1-3,16,1,4,26
U2AF1-3,1,62,4,26
U2AF1-3,0,119,4,26

This is the subplotting function...with some extra bells and whistles :)

def plotByGroup(df, group, xCol, yCol, title = "", xLabel = "", yLabel = "", lineColors = ["red", "orange", "yellow", "green", "blue", "purple"], lineWidth = 2, lineOpacity = 0.7, plotStyle = 'ggplot', showLegend = False):
    """
    Plot multiple lines from a Pandas Data Frame for each group using DataFrame.groupby() and MatPlotLib PyPlot.
    @params
        df          - Required  - Data Frame    - Pandas Data Frame
        group       - Required  - String        - Column name to group on           
        xCol        - Required  - String        - Column name for X axis data
        yCol        - Required  - String        - Column name for y axis data
        title       - Optional  - String        - Plot Title
        xLabel      - Optional  - String        - X axis label
        yLabel      - Optional  - String        - Y axis label
        lineColors  - Optional  - List          - Colors to plot multiple lines
        lineWidth   - Optional  - Integer       - Width of lines to plot
        lineOpacity - Optional  - Float         - Alpha of lines to plot
        plotStyle   - Optional  - String        - MatPlotLib plot style
        showLegend  - Optional  - Boolean       - Show legend
    @return
        MatPlotLib Plot Object

    """
    # Import MatPlotLib Plotting Function & Set Style
    from matplotlib import pyplot as plt
    matplotlib.style.use(plotStyle)
    figure = plt.figure()                   # Initialize Figure
    grouped = df.groupby(group)             # Set Group
    i = 0                                   # Set iteration to determine line color indexing
    for idx, grp in grouped:
        colorIndex = i % len(lineColors)    # Define line color index
        lineLabel = grp[group].values[0]    # Get a group label from first position
        xValues = grp[xCol]                 # Get x vector
        yValues = grp[yCol]                 # Get y vector
        plt.subplot(1,1,1)                  # Initialize subplot and plot (on next line)
        plt.plot(xValues, yValues, label = lineLabel, color = lineColors[colorIndex], lw = lineWidth, alpha = lineOpacity)
        # Plot legend
        if showLegend:
            plt.legend()
        i += 1
    # Set title & Labels
    axis = figure.add_subplot(1,1,1)
    axis.set_title(title)
    axis.set_xlabel(xLabel)
    axis.set_ylabel(yLabel)
    # Return plot for saving, showing, etc.
    return plt

And to use it...

import pandas

# Load the Data into Pandas
df = pandas.read_csv('data.csv')    

#
# Plotting - by Patient
#

# Create Patient Grouping
patientGroup = df.groupby('pt #')

# Iterate Over Groups
for idx, patientDF in patientGroup:
    # Let's give them specific titles
    plotTitle = "Gene Frequency over Time by Gene (Patient %s)" % str(patientDf['pt #'].values[0])
    # Call the subplot function
    plot = plotByGroup(patientDf, 'gene', 'xaxis', 'yaxis', title = plotTitle, xLabel = "Days", yLabel = "Gene Frequency")
    # Add Vertical Lines at Assay Timepoints
    timepoints = set(patientDf.xaxis.values)
    [plot.axvline(x = timepoint, linewidth = 1, linestyle = "dashed", color='gray', alpha = 0.4) for timepoint in timepoints]
    # Let's see it
    plot.show()

enter image description here

And of course, we can do the same by gene.

#
# Plotting - by Gene
#

# Create Gene Grouping
geneGroup   = df.groupby('gene')

# Generate Plots for Groups
for idx, geneDF in geneGroup:
    plotTitle = "%s Gene Frequency over Time by Patient" % str(geneDf['gene'].values[0])
    plot = plotByGroup(geneDf, 'pt #', 'xaxis', 'yaxis', title = plotTitle, xLab = "Days", yLab = "Frequency")
    plot.show()

enter image description here

If this isn't what you're looking for, provide a clarification and I'll take another crack at it.

Sign up to request clarification or add additional context in comments.

2 Comments

you're a rockstar, i can't wait to get into work tomorrow to try this out. thankyou!
Thanks! If it works out please accept the answer : )

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.