I have a big .csv file to read in. Badly some lines are corrupt, meaning that something is wrong in the formatting like a number like 0.-02 instead of -0.02. Sometimes even the line break (\n) is missing, so that two lines merge to one.
I want to read the .csv file with read.table and define all colClasses to the format that I expect the file to have (except of course for the corrupt lines). This is a minimal example:
colNames <- c("date", "parA", "parB")
colClasses <- c("character", "numeric", "numeric")
inputText <- "2015-01-01;123;-0.01\n
2015-01-02;421;-0.022015-01-03;433;-0.04\n
2015-01-04;321;-0.03\n
2015-01-05;230;-0.05\n
2015-01-06;313;0.-02"
con <- textConnection(inputText, "r")
mydata <- read.table(con, sep=";", fill = T, colClasses = colClasses)
At the first corrupt lines read.table stops with the error:
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, :
scan() expected 'a real', got '-0.022015-01-03'
With this error message I have no Idea in which line of the input the error occurred. Hence my only option is to copy the line -0.022015-01-03 and search for it in the file. But this is really annoying if you have to do it for a lot of lines and always have to re-execute read.table until it detects the next corrupt line.
So my question is:
- Is there a way to get
read.tableto tell me the line where the error occurred (and maybe save it for further processing) - Is there a way to get
read.tableto just skip lines with improper formatting (not to stop at an error)? - Did anyone figure out a way to display these lines for manual correction during the read process? I mean maybe display the whole corrupt line in the plain csv format for manual correction (maybe including the line before and after) and then continue the read-in process including the corrected lines.
What I tried so far is to read everything with colClasses="character" to avoid format checking in the first place. Then do the format checking while I convert every column to the right format. Then which() all lines where the format could not be converted or the result is NA and just delete them.
I have a solution, but it its very slow
With ideas I got from some of the comments the thing I tried next is to read the input line by line with readLine and pipe the result to read.table via the text argument. If read.table files the line is presented to the user via edit() for correction and re-submission. Here is my code:
con <- textConnection(inputText, "r")
mydata <- data.frame()
while(length(text <- readLines(con, n=1)) > 0){
correction = T
while(correction) {
err <- tryCatch(part <- read.table(text=text, sep=";", fill = T,
col.names = colNames,
colClasses = colClasses),
error=function(e) e)
if(inherits(err, "error")){
# try to correct this line
message(err, "\n")
text <- edit(text)
}else{
correction = F
}
}
mydata <- rbind(mydata, part)
}
If the user made the corrections right this returns:
> mydata
date parA parB
1 2015-01-01 123 -0.01
2 2015-01-02 421 -0.02
3 2015-01-03 433 -0.04
4 2015-01-04 321 -0.03
5 2015-01-05 230 -0.05
6 2015-01-06 313 -0.02
The input text had 5 lines, since one linefeed was missing. The corrected output has 6 lines and the 0.-02 is corrected to -0.02.
What I still would change in this solution is to present all corrupt lines together for correction after everything is read in. This way the user can run the script and after it finished can do all corrections at once. But for a minimal example this should be enough.
The really bad thing about this solution is, that it is really slow! Too slow to handle big datasets. Hence I still would like to have another solution using more standard methods or probably a special package.
try(read.table(code), SILENT = True). This keeps your program running even when an error occurstry(read.table(code), silent = TRUE)would it not?try. I can continue with my script, butread.tablewill crash anyway and give me no data. If I will use it liketry(data <- read.table("file.csv"), silent=TRUE)datawill stay undefined whenread.tablethrows an error. Hence technically I can continue my script but in practice I can't, becausedatais undefined. I just tested this to be sure.freadfrom the data.table packagegsubmodifications to the character representation of the data, then run it throughread.tableusing hte text parameter. You need to tell us how019.06.2015should be interpreted. At the moment it looks like a Date.