1

Some .csv files with numerical data I work with contain errors, each error is marked as random string, for example after reading in, data frame could look like that :

set.seed(123)
rand.str <-  paste0(letters[sample(10)], collapse="")
wrong.output <- data.frame(a=1:5, b=c(4:5, rand.str, 7:8), stringsAsFactors=FALSE)

in this case proper output is :

proper.output <- data.frame(a=1:5, b=c(4:5, NA, 7:8))

after reading with read.csv each column with at least one character value is treated as character column.

Can I mark errors (random strings) as NAs while reading-in file? If not, what is the most convenient, proper or fastest method for subsetting them with NA's ?

There is na.strings argument in read.csv, but it is the solution only in simpler cases where it can be used like: na.strings=c("-", "unavailable")

(can't see any duplicate, so I guess there is simple, workaround)

colClasses suggestion does not work

read.csv("test.txt", sep=",", colClasses = c("numeric", "numeric"))

Error in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : scan() expected 'a real', got 'chdgfajibe' In addition: Warning message: In read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found by readTableHeader on 'test.txt'

2
  • have you tried setting colClasses=c("numeric") within read.csv ? Commented Feb 1, 2017 at 13:16
  • yes and does not work : Commented Feb 1, 2017 at 13:26

2 Answers 2

1

I adapted this solution from a different solution for csv reading which is 7 years back. I thought it is a cleaner solution. It gives your desired output.

setClass("Alpha")
# replacing words with empty characters
setAs("character", "Alpha", 
      function(from) as.numeric(gsub('[[:alpha:]]+', '', from) ) )
read.csv('data.csv', colClasses = c('numeric','Alpha'))

output

  a  b
1 1  4
2 2  5
3 3 NA
4 4  7
5 5  8

Source: How to read data when some numbers contain commas as thousand separator?

Sign up to request clarification or add additional context in comments.

Comments

0

solution is :

wrong.output[] <- lapply(wrong.output, as.numeric)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.