0

I am struggling to make loop through the rows of a column in a dataframe and then use the current row to define arguments that will be used in a function. Here is the sample dataframe:

df <- 
structure(list(child = c("A268", "A268497", "A268497BOX", "A268497BOX2", 
"A268497BOX218", "A277", "A277A79", "A277A79091", "A277A790911", 
"A277A79091144", "A492", "A492586", "A492586BOX", "A492586BOX1", 
"A492586BOX144", "A492A69", "A492A69027", "A492A690271", "A492A69027144", 
"A492A6902715K", "A492A6902719Y", "A492A690271BH", "A492A690271BI", 
"A492A690271CQ", "A492A690271CS", "A492A690271CT", "A492A690271CU", 
"A492A690271CV", "A492A690271CW", "A492A690271CX", "A492A690271CY", 
"A492A690271DA", "A492A69028", "A492A690281", "A492A69028144", 
"A492A69402", "A492A694021", "A492A69402144", "A492A70", "A492A70033", 
"A492A700331", "A492A70033144", "A492A700332", "A492A70033244", 
"A492A70034", "A492A700341", "A492A70034144", "A492A70035", "A492A700351", 
"A492A70035144"), clvl = c(2, 3, 4, 5, 6, 2, 3, 4, 5, 6, 2, 3, 
4, 5, 6, 3, 4, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 4, 
5, 6, 4, 5, 6, 3, 4, 5, 6, 5, 6, 4, 5, 6, 4, 5, 6), parent = c("A", 
"A268", "A268497", "A268497BOX", "A268497BOX2", "A", "A277", 
"A277A79", "A277A79091", "A277A790911", "A", "A492", "A492586", 
"A492586BOX", "A492586BOX1", "A492", "A492A69", "A492A69027", 
"A492A690271", "A492A690271", "A492A690271", "A492A690271", "A492A690271", 
"A492A690271", "A492A690271", "A492A690271", "A492A690271", "A492A690271", 
"A492A690271", "A492A690271", "A492A690271", "A492A690271", "A492A69", 
"A492A69028", "A492A690281", "A492A69", "A492A69402", "A492A694021", 
"A492", "A492A70", "A492A70033", "A492A700331", "A492A70033", 
"A492A700332", "A492A70", "A492A70034", "A492A700341", "A492A70", 
"A492A70035", "A492A700351"), plvl = c(1, 2, 3, 4, 5, 1, 2, 3, 
4, 5, 1, 2, 3, 4, 5, 2, 3, 4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 
5, 5, 5, 3, 4, 5, 3, 4, 5, 2, 3, 4, 5, 4, 5, 3, 4, 5, 3, 4, 5
)), row.names = c(NA, 50L), class = "data.frame")

enter image description here

My goal is to generate this:

enter image description here

I tried to do this with a loop and using different versions of apply function inside the loop, but I could not get it right. Here, I need to define that x and y will be the child and pathString from the current row every time I iterate. Is there a clean and easy way to do this?

df[] <- apply(df,1,function(x,y) sub(x,y,x))
1
  • 1
    what is the logic to create pathString variable ? Commented Jan 26, 2020 at 9:13

2 Answers 2

1

Assuming the number of characters in child (or pathString) would keep on increasing as shown in the data shared one way is to use purrr::accumulate which allows to take input from previous output and apply it by group.

library(dplyr)

df %>%
  group_by(gr = cumsum(c(TRUE, diff(nchar(child)) < 0))) %>%
  mutate(ans = purrr::accumulate(pathString, ~sub(".*(/.*)",paste0(.x, "\\1"),.y))) 

#   child         pathString        gr ans               
#   <chr>         <chr>          <int> <chr>             
# 1 A268          A/268              1 A/268             
# 2 A268497       A268/497           1 A/268/497         
# 3 A268497BOX    A268497/BOX        1 A/268/497/BOX     
# 4 A268497BOX2   A268497BOX/2       1 A/268/497/BOX/2   
# 5 A268497BOX218 A268497BOX2/18     1 A/268/497/BOX/2/18
# 6 A277          A/277              2 A/277             
# 7 A277A79       A277/A79           2 A/277/A79         
# 8 A277A79091    A277A79/091        2 A/277/A79/091     
# 9 A277A790911   A277A79091/1       2 A/277/A79/091/1   
#10 A277A79091144 A277A790911/44     2 A/277/A79/091/1/44

Kept the gr column of group in the final output to clarify how the groups are created.


We can implement the same logic in base R as well using Reduce

apply_fun <- function(x, y) sub(".*(/.*)", paste0(x, "\\1"), y)

df$ans <- with(df, ave(pathString, cumsum(c(TRUE, diff(nchar(child)) < 0)), 
FUN = function(x) Reduce(apply_fun, x, accumulate = TRUE)))
Sign up to request clarification or add additional context in comments.

5 Comments

so df must be sorted? The real df has more than 35k row, I will check your answer tomorrow and will get back to you
@Ibo Yes. This is what I came up with looking at the expected output. There isn't any logic shared on how to reach to output.
I tried it with a more extensive data sample and it was not generating the right output. I actually went one step back and edited the data sample so that you can have access to both child and parent values with their level (not sure it can help) if you apply your answer you will see that where gr resets at any level that is not level 2 the forward slashes are not created properly, plus in some cases it adds segments from above rows while we are only allowed to add forward slashes to the values. This is to create a path so that I can create data.tree
this is the original post that was not answered by anyone. Maybe there was a better way to get to the final answer, but I could get to here so far: stackoverflow.com/questions/59870536/…
I managed to find a solution, but I am sure there is a smarter way!
0

I managed to get it done using the following code block, but the loop takes 75-80 seconds, I guess there could be a faster way:

for(row in 1:nrow(df5)) {

  x=df5[row,2] #child
  y=df5[row,3] #pathString
  g=df5[row,c('gr')]

  df5$pathString[df5$gr==g] <- sub(x,y,df5$pathString[df5$gr==g])
  df5$child[df5$gr==g] <- sub(x,y,df5$child[df5$gr==g])

}

Note that gr was populated based on clvl=2:

library(zoo)
df4$gr <- ifelse(df4$clvl==2,df4$child,NA)
df4$gr <- na.locf(df4$gr)

and this is how df4 is made:

df4 <- sqldf("select  *, parent || replace(child,parent,'/') AS pathString FROM df ORDER BY child")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.