2

I have searched and searched in the stacks for an answer to my question; this one approaches my question but I have been unsuccessful in modifying the code to fix my graph.

I have data, reshaped in long format, that looks like this:

ID          Var1      GenePosition   ContinuousOutcomeVar
1           control      X20068492 0.092813611
2           control      X20068492 0.001746708
3           case         X20068492 0.069251157
4           case         X20068492 0.003639304

Each ID has one value for ContinuousOutcomeVar per position, and there are 86 positions and 10 IDs. I want to plot a line graph with position on the x axis and the continuous outcome variable on the y axis. I want two groups: a case group and control group, so there should be two dots for every position: one is the mean value for cases, and one is the mean value for controls. Then I want a line that connects the cases, and a line that connects the controls. I know this is easy, but I'm new to R - I've been working at it for 8 hours and I can't quite get it right. Below is what I have; I'd really appreciate some insight. If this exists somewhere in the stacks, I really apologize...I honestly looked all over and tried modifying a lot of code but still haven't gotten it right.

My code: This code plots all the values for all IDs at each position, and connects them for the two groups. It gives me a black dot at the mean of all 10 values per position (I think):

lineplot <- ggplot(data=seq.long, aes(x=Position, y=PMethyl, 
    group=CACO, colour=CACO)) +
    stat_summary (fun.y=mean, geom="point", aes(group=1), color="black") +      
    geom_line() + geom_point()

I can't get R to not plot all 10 points; just two means (one per case/control group) per position, with cases' & controls' values each connected by a line across the x axis.

2 Answers 2

3

First, adjusted your original sample data to contain more than one unique GenePosition.

dput(seq.long)
structure(list(ID = 1:8, Var1 = structure(c(2L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L), .Label = c("case", "control"), class = "factor"), 
    GenePosition = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
    ), .Label = c("X20068492", "X20068493"), class = "factor"), 
    ContinuousOutcomeVar = c(0.092813611, 0.001746708, 0.069251157, 
    0.003639304, 0.112813611, 0.002746708, 0.089251157, 0.004639304
    )), .Names = c("ID", "Var1", "GenePosition", "ContinuousOutcomeVar"
), class = "data.frame", row.names = c(NA, -8L))

If you just want to represent one value for each GenePosition and Var1 combination then it would be easier to calculate mean values before plotting. That can be achieved with function ddply() from library plyr.

library(plyr)    
seq.long.sum<-ddply(seq.long,.(Var1,GenePosition),
       summarize, value = mean(ContinuousOutcomeVar))
seq.long.sum
     Var1 GenePosition      value
1    case    X20068492 0.03644523
2    case    X20068493 0.04694523
3 control    X20068492 0.04728016
4 control    X20068493 0.05778016

Now with this new data frame you just have to give x and y values. Var1 should be used in colour= and group= to ensure that each group has different color and that lines are connected.

ggplot(seq.long.sum,aes(x=GenePosition,y=value,colour=Var1,group=Var1))+
   geom_point()+geom_line()

enter image description here

Sign up to request clarification or add additional context in comments.

1 Comment

Thanks for teaching me something new! Also learned that if I use "transform" with ddply instead of summarize it keeps all the other vars in my dataframe. I appreciate your help!
0

1、First make data as Didzis Elferts support just like

data <- structure(list(ID = 1:8, Var1 = structure(c(2L, 2L, 1L, 1L, 2L, 
2L, 1L, 1L), .Label = c("case", "control"), class = "factor"), 
    GenePosition = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L
    ), .Label = c("X20068492", "X20068493"), class = "factor"), 
    ContinuousOutcomeVar = c(0.092813611, 0.001746708, 0.069251157, 
    0.003639304, 0.112813611, 0.002746708, 0.089251157, 0.004639304
    )), .Names = c("ID", "Var1", "GenePosition", "ContinuousOutcomeVar"
), class = "data.frame", row.names = c(NA, -8L))

2、create a plot with code below:

ggplot(data,aes(x=GenePosition,y=ContinuousOutcomeVar,color=Var1,group=Var1))+
    stat_summary(fun = 'mean',geom = 'point')+
    stat_summary(fun = 'mean',geom = 'line')

2 Comments

Hi 孟泽楷, its probably better to create the summary stats separately as suggested here, because then you don't have to calculate the values twice (for the points and the lines). However, regardless that, your answer is missing the summary function within stat_summary(). I would suggest to adjust and share an output figure as well, thanks!
Hi Samson, I appreciate your kind advice. I think the code I provide is another solution, without any pre-summary. The missing args you mentioned, leaving it as default would be OK for this question. However, I provide the default mean by editing. The output figure is exactly same as the adopted answer's.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.