0

I have below mentioned directory structure:

Folder named A contains txt files named 1, 2, 3, .., 5
Folder named B contains txt files named 1, 2, 3, .., 5
|
--A (Folder)
  |---1.txt
  |---2.txt
  ....
  |---5.txt

--B (Folder)
  |---1.txt
  |---2.txt
  ....
  |---5.txt

I am reading these text files into data frames through 2 nested for loops. Single data frame looks like this:

df <- data.frame(Comp.1 = c(0.3, -0.2, -1, NA, 1),
         Comp.2 = c(-0.4, -0.1, NA, 0, 0.6),
         Comp.3 = c(0.2, NA, -0.4, 0.3, NA))
row.names(df) <- c("Param1", "Param2", "Param3", "Param4", "Param5")

Values always lie between -1 and +1. Number of rows (parameters) and number of columns (components) of all these data frames are not same. For eg: the above data frame is of 3x5, others can be 5x15, 4x10, 5x40, etc.

I want a plot that has:

1. parameters on x-axis
2. components on y-axis
3. values as points in the above graph 
4. shape of point representing folder name (A = square, B = triangle, C = circle, .., E)
5. color inside the point shape representing file name (1, 2, 3, .., 5)
6. color intensity describing value (For eg: light red [almost white] color representing closer to -1 like -0.98, dark red representing closer to 1 like 0.98)

I have this code:

alphabets = c("A", "B", "C", "D", "E", "F")
numbers = c(1, 2, 3, 4, 5)

pca.plot <- ggplot(data = NULL, aes(xlab="Principal Components",ylab="Parameters"))

for (alphabet in alphabets){
   for(number in numbers){

   filename=paste("/filepath/",alphabet,"/",number,".txt", sep="")

   df <- read.table(filename)

   #Making all row dimensions = 62. Adding rows with NAs
   if(length(row.names.data.frame(df))<62){
      row_length = length(row.names.data.frame(df))
      for(i in row_length:61){
          new_row = c(NA, NA, NA, NA, NA, NA)
          df<-rbind(df, new_row)  
      }
   }

   df$row.names<-rownames(df)
   long.df<-melt(df,id=c("row.names"), na.rm = TRUE)
   pca.plot<-pca.plot+geom_point(data=long.df,aes(x=variable,y=row.names, shape = number, color=alphabet, size = value))
   }
}

Output of this code is this: enter image description here

EDIT: After following @Gregor's steps mentioned in comments, I have a big_data_frame like this: head(big_data, 3)

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 params alphabet number 1 NA NA NA NA NA param1 A 1 2 NA NA NA 0.89 NA param2 A 1 3 NA -0.95 NA NA NA param3 A 1

5
  • 2
    Combine your data into one data frame - one tidy data frame - and this will be trivial. I would recommend reading your data into a list of data frames and then combining them all at once. Commented Feb 14, 2017 at 22:48
  • I have list of data frames ready. Filled with NAs wherever rows/columns weren't there. How should I plot now? How do we access attribute names of data frame list? Commented Feb 14, 2017 at 23:45
  • 2
    Please notice the first sentence of my comment: Combine your data into one data frame. If you need help with this, see the section called Combining a list of data frames into a single data frame in the answer I linked above. Make sure that the attributes you want to plot, including the file name and folder name, are columns in your data frame. If the file names are the names of your list, then, as stated in the link, dplyr::bind_rows or data.table::rbindlist will automatically add them as columns. Commented Feb 15, 2017 at 0:16
  • Great. Can you show it in your question? If you post dput(droplevels(head(your_data, 10))) we will get a copy/pasteable version of the first 10 rows of your data. Commented Feb 15, 2017 at 21:57
  • when i try to melt this big_data frame, big_data.long <- melt(big_data,id=c("params"), na.rm = TRUE) and then plot using final.plot<-ggplot(data=big_data.long, aes(xlab = "COMPONENTS", ylab = "PARAMETERS"))+geom_point( aes(x=(variable),y=(params))) I don't get what I want. Tried a lot! Commented Feb 15, 2017 at 22:18

1 Answer 1

1

You need to melt the data frame to collapse all the Comp columns. The other columns should stay the same:

long_data = reshape2::melt(
    big_data,
    id.vars = c("params", "alphabet", "number"),
    variable.name = "comp",
    value.name = "value",
    na.rm = T
)

Now, most of your requirements are easy:

  1. parameters on x-axis
  2. components on y-axis
  3. values as points in the above graph
  4. shape of point representing folder name (A = square, B = triangle, C = circle, .., E)
  5. color inside the point shape representing file name (1, 2, 3, .., 5)
  6. color intensity describing value (For eg: light red [almost white] color representing closer to -1 like -0.98, dark red representing closer to 1 like 0.98)
ggplot(long_data, aes(
    x = params, y = comp, size = value,
    shape = folder, color = factor(number), alpha = value
)) +
    geom_point()

The tricky part is the requirements for both color intensity and overall color. The only way I know to approximate this using standard ggplot is to use transparency as I did above. This is the approach taken in, e.g., this question.


Note this is untested as your data isn't shared reproducibly. Share data with dput as suggested in the comments if there are issues that need testing.

Sign up to request clarification or add additional context in comments.

3 Comments

Worked for me. Thanks Gregor. I can tweak the fancy part. Liked your way of leading me to solution. :)
Thanks! Glad it worked out - and glad you appreciated the approach. Not everyone loves it but I'm convinced you learn more from it :D
Next time you share data though, do it with dput. Makes it so much easier to reproduce for the people trying to help you.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.