0

I have a big .csv file to read in. Badly some lines are corrupt, meaning that something is wrong in the formatting like a number like 0.-02 instead of -0.02. Sometimes even the line break (\n) is missing, so that two lines merge to one.

I want to read the .csv file with read.table and define all colClasses to the format that I expect the file to have (except of course for the corrupt lines). This is a minimal example:

colNames <- c("date", "parA", "parB")
colClasses <- c("character", "numeric", "numeric")

inputText <- "2015-01-01;123;-0.01\n
              2015-01-02;421;-0.022015-01-03;433;-0.04\n
              2015-01-04;321;-0.03\n
              2015-01-05;230;-0.05\n
              2015-01-06;313;0.-02"

con <- textConnection(inputText, "r")

mydata <- read.table(con, sep=";", fill = T, colClasses = colClasses)

At the first corrupt lines read.table stops with the error:

Error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  : 
  scan() expected 'a real', got '-0.022015-01-03'

With this error message I have no Idea in which line of the input the error occurred. Hence my only option is to copy the line -0.022015-01-03 and search for it in the file. But this is really annoying if you have to do it for a lot of lines and always have to re-execute read.table until it detects the next corrupt line.

So my question is:

  1. Is there a way to get read.table to tell me the line where the error occurred (and maybe save it for further processing)
  2. Is there a way to get read.table to just skip lines with improper formatting (not to stop at an error)?
  3. Did anyone figure out a way to display these lines for manual correction during the read process? I mean maybe display the whole corrupt line in the plain csv format for manual correction (maybe including the line before and after) and then continue the read-in process including the corrected lines.

What I tried so far is to read everything with colClasses="character" to avoid format checking in the first place. Then do the format checking while I convert every column to the right format. Then which() all lines where the format could not be converted or the result is NA and just delete them.


I have a solution, but it its very slow

With ideas I got from some of the comments the thing I tried next is to read the input line by line with readLine and pipe the result to read.table via the text argument. If read.table files the line is presented to the user via edit() for correction and re-submission. Here is my code:

con <- textConnection(inputText, "r")
mydata <- data.frame()
while(length(text <- readLines(con, n=1)) > 0){

    correction = T
    while(correction) {

        err <- tryCatch(part <- read.table(text=text, sep=";", fill = T, 
                                            col.names = colNames, 
                                            colClasses = colClasses),
                        error=function(e) e)

        if(inherits(err, "error")){

            # try to correct this line
            message(err, "\n")
            text <- edit(text)

        }else{

            correction = F

        }
    }

    mydata <- rbind(mydata, part)

}

If the user made the corrections right this returns:

> mydata
        date parA  parB
1 2015-01-01  123 -0.01
2 2015-01-02  421 -0.02
3 2015-01-03  433 -0.04
4 2015-01-04  321 -0.03
5 2015-01-05  230 -0.05
6 2015-01-06  313 -0.02

The input text had 5 lines, since one linefeed was missing. The corrected output has 6 lines and the 0.-02 is corrected to -0.02.

What I still would change in this solution is to present all corrupt lines together for correction after everything is read in. This way the user can run the script and after it finished can do all corrections at once. But for a minimal example this should be enough.

The really bad thing about this solution is, that it is really slow! Too slow to handle big datasets. Hence I still would like to have another solution using more standard methods or probably a special package.

9
  • For your second question please try(read.table(code), SILENT = True). This keeps your program running even when an error occurs Commented Jan 15, 2016 at 18:57
  • @Bharath it would be try(read.table(code), silent = TRUE) would it not? Commented Jan 15, 2016 at 19:04
  • I know try. I can continue with my script, but read.table will crash anyway and give me no data. If I will use it like try(data <- read.table("file.csv"), silent=TRUE) data will stay undefined when read.table throws an error. Hence technically I can continue my script but in practice I can't, because data is undefined. I just tested this to be sure. Commented Jan 15, 2016 at 19:05
  • 1
    "But this is really annoying" -- That may simply be the price of working with bad data. You could try alternative csv parsers, which may give more flavorful error messages for each bad feature of the input. I use fread from the data.table package Commented Jan 15, 2016 at 19:12
  • 1
    Read it in with readLines. Do your gsub modifications to the character representation of the data, then run it through read.table using hte text parameter. You need to tell us how 019.06.2015 should be interpreted. At the moment it looks like a Date. Commented Jan 15, 2016 at 19:26

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.