The platform from which I'm importing data to R does not support specifying the data type, hence all my columns are character. I have an Excel file that specifies which columns are factor, including the relevant labels and levels. Now, I'm trying to write a function to dynamically change the data type of various columns of my data.frame
Thanks to the excellent answer to this question (dplyr - mutate: use dynamic variable names), I managed to write the following function, in which I dynamically set the column name to the mutate function.
readFactorData <- function(filepath) {
t <- read.xlsx(filepath)
sapply(nrow(t), function(i) {
colname <- as.character(t[i, "Item"])
factorLevels <- t[i, 3:ncol(t)][which(!is.na(t[i, 3:ncol(t)]))]
totalLevels <- length(factorLevels)
listOfLabels <- as.character(unlist(factorLevels))
mutate(d, !!colname := factor(d[[colname]], labels=(1:totalLevels), levels=listOfLabels))
# requires dplyr v.0.7+
# the syntax `!!variablename:=` forces evaluation of the variablename before evaluating the rest of the function
})
}
It works, and each iteration returns the entire data frame, with the relevant column (colname) changed to factor. But, each iteration overwrites the previous, so this function only returns the last result of i. How do I make sure that I end up with 1 single data frame, in which all the relevant columns are saved?
Sample data (make sure to comment out the first line of the function above, since we're defining t here):
d <- data.frame("id" = sample(100:999, 10), "age" = sample(18:80, 10), "factor1" = c(rep("a", 3), rep("b", 3), rep("c", 4)), "factor2" = c("x","y","y","y","y","x","x","x","x","y"), stringsAsFactors = FALSE)
t <- data.frame("Item" = c("factor1","factor2"), "Label" = c("This is factor 1", "This is factor 2"), "level1" = c("a","x"), "level2" = c("b","y"), "level3" = c("c","NA"))