So I've solved my initial problem using a suggestion from G. Grothendieck, thanks again, exactly the clean way of doing that I was after. Initial post is here. Now reality is that my file is just a little more subtle .
It actually looks like this (see also the data section at end of post for reproducible format):
A1
100
200
txt
A2
STRING
300
400
txt txt
txt
txt txt txt
A3
STRING
STRING
150
250
A2
.
.
.
Preliminary data wrangling looks like this :
type <- cumsum(raw_data[[1]] %in% c("A1","A2","A3"))
v <- tapply(raw_data[[1]], type, c, simplify = FALSE)
m <- t(do.call(cbind, lapply(v, ts)))
raw_data<- as.data.frame(m, stringsAsFactors = FALSE)
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)
raw_data$Occurences <- 0
giving :
V1 V2 V3 V4 V5 V6 V7
1 A1 100 200 txt <NA> <NA> <NA>
2 A2 String 300 400 txt txt txt txt txt txt
3 A3 String String 150 250 <NA> <NA>
4 A2 <NA> <NA> <NA> <NA> <NA> <NA>
The issue is that df[3,4] should be in df[3,2], and I should state "2" in a new column. Same goes on line 2, where df[2,3] should be in df[2,2] and state "1" in that same addiotional column. In other words, I'm chasing this :
V1 V2 V3 V4 V5 V6 Occurences
1 A1 100 200 txt <NA> <NA> 0
2 A2 300 400 txt txt txt txt txt txt 1
3 A3 150 250 <NA> <NA> <NA> 2
4 A2 NA NA <NA> <NA> <NA> 0
The STRING is right after A something, sometimes it does not occur and sometimes just one time or several occurences. This is what I did to handle the problem :
#Count "STRING" occurences and readjust values in expected columns
formatString <- function(df) {
z <<- which(df[,2] %in% "STRING")
if (length(z) > 0){
for (i in z){
df$Occurences = df$Occurences + 1
for (j in 2:ncol(df)-1){
if (is.na(df[i,j]) | is.na(df[i,j+1])){
df[i,j] = NA
} else {
df[i,j] = df[i,j+1]
}
}
}
}
z <<- which(df[,2] %in% "STRING")
if(length(z) > 0){formatString(df)}
}
This function is supposed to process exclusively the lines where STRING is found in column 2, it would increment the last column (Occurences) and then shift all the values one column to the left so that they all go back to where they are expected to be. The IS.NA thing is just to try and stop the loop when we're starting to see NAs. Once we've processed these lines, we look again if STRING is in the column 2, and if so call again the function.
Now my problem is that the function looks like it is processing (takes like 20 sec on almost 19k observations and 261 columns, not really sure it's the best in terms of processing time), expect my dataframe was not updated at the end of the loop. However z got updated so it seems to be working the way it should.
What am I missing ?
The data
Data in reproductible form :
DF <- structure(list(V1 = c("A1", "100", "200", "txt ", "A2", "String",
"300","400", "txt txt", "txt", "txt txt txt", "A3",
"String", "String", "150", "250", "A2")), .Names = "V1",
row.names = c(NA, -14L), class = "data.frame")
raw_data[] <- lapply(raw_data, type.convert, as.is = TRUE)statement will not have the intended effect as long as the integer values are not aligned in columnsV2andV3: All columns will still be of typecharacter.formatStringinitially? When I doformatString(raw_data)I do get the error message: Error in$<-.data.frame(*tmp*, "Occurences", value = numeric(0)) : replacement has 0 rows, data has 4