1

So I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is here. Now reality is that my file is just a little more subtle .

It actually looks like this (see also the data section at end of post for reproducible format):

A1
100
200
txt 
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.

Preliminary data wrangling looks like this :

type <- cumsum(raw_data[[1]] %in% c("A1","A2","A3"))
v <- tapply(raw_data[[1]], type, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))

raw_data<- as.data.frame(m, stringsAsFactors = FALSE)
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)

raw_data$Occurences <- 0

giving :

  V1     V2     V3   V4      V5   V6          V7
1 A1    100    200 txt     <NA> <NA>        <NA>
2 A2 String    300  400 txt txt  txt txt txt txt
3 A3 String String  150     250 <NA>        <NA>
4 A2   <NA>   <NA> <NA>    <NA> <NA>        <NA>

The issue is that df[3,4] should be in df[3,2], and I should state "2" in a new column. Same goes on line 2, where df[2,3] should be in df[2,2] and state "1" in that same addiotional column. In other words, I'm chasing this :

  V1  V2  V3      V4   V5          V6 Occurences
1 A1 100 200    txt  <NA>        <NA>          0
2 A2 300 400 txt txt  txt txt txt txt          1
3 A3 150 250    <NA> <NA>        <NA>          2
4 A2  NA  NA    <NA> <NA>        <NA>          0

The STRING is right after A something, sometimes it does not occur and sometimes just one time or several occurences. This is what I did to handle the problem :

#Count "STRING" occurences and readjust values in expected columns
formatString <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df$Occurences = df$Occurences + 1
      for (j in 2:ncol(df)-1){
            if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  if(length(z) > 0){formatString(df)}
}

This function is supposed to process exclusively the lines where STRING is found in column 2, it would increment the last column (Occurences) and then shift all the values one column to the left so that they all go back to where they are expected to be. The IS.NA thing is just to try and stop the loop when we're starting to see NAs. Once we've processed these lines, we look again if STRING is in the column 2, and if so call again the function.

Now my problem is that the function looks like it is processing (takes like 20 sec on almost 19k observations and 261 columns, not really sure it's the best in terms of processing time), expect my dataframe was not updated at the end of the loop. However z got updated so it seems to be working the way it should.

What am I missing ?

The data

Data in reproductible form :

DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "String",
"300","400", "txt txt", "txt", "txt txt txt", "A3",
"String", "String", "150", "250", "A2")), .Names = "V1",
row.names = c(NA, -14L), class = "data.frame")
8
  • Please show your data in reproducible form. See my original post where I did this in a note. Commented Nov 12, 2016 at 23:49
  • Is it correct that you want to remove (in the original one-column long format) all rows directly following a group header which aren't integer numbers? Commented Nov 13, 2016 at 2:44
  • Hi there. I have edited my post so that there is a reproductible form of my data input. It is correct that I want to remove all rows (original one column format) following a group header IF the following rows are "STRING". If it is not "STRING" then it should be an integer. Real data is a bit different, but not relevant. So assuming the following rows are only integer or STRING, then anything that is not an integer should indeed be removed (and counted). Hope it's clearer now... Thanks again for your help. Commented Nov 13, 2016 at 9:16
  • @homer3018 The raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE) statement will not have the intended effect as long as the integer values are not aligned in columns V2and V3: All columns will still be of type character. Commented Nov 13, 2016 at 10:05
  • How do you call formatString initially? When I do formatString(raw_data) I do get the error message: Error in $<-.data.frame(*tmp*, "Occurences", value = numeric(0)) : replacement has 0 rows, data has 4 Commented Nov 13, 2016 at 10:48

2 Answers 2

1

Opposed to the OP's approach, the general idea of this suggestion is to do all data cleaning in long form and to reshape the data from to long to wide form as last step.

It appears that the main objective is to have the integer values aligned in columns V2 and V3 of the final wide format table while keeping track of the number of removed STRING rows between the group header and the appearence of the first integer row within each group.

Therefore, the data.table (development version 1.9.7) approach below is looking for the first row containing an integer value within each group rather than to remove any rows containing explicitely the character string STRING. Thus, the approach is more flexible.

In addition, it is assumed that the same group header may appear multiple times.

library(data.table)

# read data (to make it a reproducible example)
dt <- fread("A1
            100
            200
            txt 
            A2
            STRING
            300
            400
            txt txt
            txt
            txt txt txt
            A3
            STRING
            STRING
            150
            250
            A2
            ", header = FALSE, sep = "\n")

# Identify group headers by regular expression and push them down
dt <- dt[V1 %like% "^A[1-3]$", grp := V1][, grp := zoo::na.locf(grp)]
# Count groups in case of multiple appearances of the same group headers
dt[V1 == grp, grp_cnt := .I][, grp_cnt := zoo::na.locf(grp_cnt)]

# Remove "STRING" rows
# Add row count within each individual group
dt[, id := seq_len(.N), by = grp_cnt]

# find first occurrence of an integer in each group by regex
first_int <- dt[V1 %like% "^\\d+$", .(min_id = min(id)), by = grp_cnt]

# non-equi join to start each group with a row containing the first integer
# (requires data.table development version 1.9.7)
dt <- dt[first_int, on = c("grp_cnt", "id>=min_id")]

# compute Occurences as the number of dropped "STRING" rows
dt[, Occurences := id - 2L]

print(dt)
#    grp_cnt          V1 grp id
# 1:       1         100  A1  2
# 2:       1         200  A1  3
# 3:       1         txt  A1  4
# 4:       2         300  A2  3
# 5:       2         400  A2  4
# 6:       2     txt txt  A2  5
# 7:       2         txt  A2  6
# 8:       2 txt txt txt  A2  7
# 9:       3         150  A3  4
#10:       3         250  A3  5

# prepare for dcast: add column names for each group
# (one added to have the column names in line with Q)
dt[, col := paste0("V", seq_len(.N) + 1), by = grp]

# reshape from long to wide form
z <- dcast(dt, grp_cnt + grp + Occurences ~ col, value.var = "V1")[, grp_cnt := NULL]

# do type conversion on the new columns
new_cols <- dt[, unique(col)]
z[, (new_cols) := lapply(.SD, type.convert, as.is = TRUE), .SDcols = new_cols]

print(z)
#   grp Occurences  V2  V3      V4  V5          V6
#1:  A1          0 100 200     txt  NA          NA
#2:  A2          1 300 400 txt txt txt txt txt txt
#3:  A3          2 150 250      NA  NA          NA

str(z)
#Classes ‘data.table’ and 'data.frame': 3 obs. of  7 variables:
# $ grp       : chr  "A1" "A2" "A3"
# $ Occurences: int  0 1 2
# $ V2        : int  100 300 150
# $ V3        : int  200 400 250
# $ V4        : chr  "txt" "txt txt" NA
# $ V5        : chr  NA "txt" NA
# $ V6        : chr  NA "txt txt txt" NA
# - attr(*, ".internal.selfref")=<externalptr> 

Note, that the second appearance of A2 has been dropped because there aren't any rows below the second A2 (in the original one-column file) which contain an integer value.

If the production data contain other group headers than A1, A2, and A3 the regular expression to identify group headers has to be amended accordingly.

Column names are in line with the OP's expected result (except V1 is called grp for clarity). Column order is slightly different which shouldn't be relevant.

Sign up to request clarification or add additional context in comments.

4 Comments

Hi Uwe. Thanks for this answer. It is an interesting approach indeed, but it does not answer my question (why my function is not processing my DF or at least why DF is not updated accordingly). I did not test this yet (since i don't have 1.9.7 just yet). Thank you !
@homer3018 I see :-( I thought performance was an issue. Would be interesting to see how long the data.table approach would need on your 19k x 261 problem.
@homer3018 Installation instructions for data.table 1.9.7 can be found here (and also to revert back to CRAN version.) Looks as if it won't take too long until 1.9.7 will appear officially on CRAN.
I'll give it a try when I can, and compare it with my reccursive function when I get a chance, and I'll let you know :) Thanks for the clarification !
0

So, because my example was not as accurate as I thought, the data.table approach does not suit my need. Actually the STRING is not necessarily between integer.

I came up with this function :

#Count "STRING" occurences and readjust values in expected columns
formatSearching <- function(df) {
  z <<- which(df[,2] %in% "STRING")
  if (length(z) > 0){
    for (i in z){
      df[i,"String_occurences"] = df[i,"String_occurences"] + 1
      for (j in 2:(ncol(raw_data)-1)){
        if (is.na(df[i,j]) | is.na(df[i,j+1])){
          df[i,j] = NA
        } else {
          df[i,j] = df[i,j+1]
        }
      }
    }
  }
  z <<- which(df[,2] %in% "STRING")
  #if(length(z) > 0){formatSearching(df)} This somehow does work, but does not update df...
  return(df)
}

Because of the comment at the end, I'm calling it like this :

raw_data <- formatSearching(raw_data)
while(length(z) > 0){raw_data <- formatSearching(raw_data)}

There are several issues with all this. First my intent was not to put a while in the middle of the processing of my data, but have a fully functional recursive function. I am probably missing an assignment somewhere so that my RAW_DATA dataframe gets updated.

Secondly, this process is taking time, especially the while step. It is possible that on some lines, I'll have up to 10 occurrences, but on others just 1. I am convinced that we could do better and by better i mean faster, more efficient.

Now this is doing the job the way I want to, I'm just trying to gain some processing speed.

Thank you all.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.