9

This is what I was doing:

csv = CSV.open(file_name, "r")

I used this for testing:

line = csv.shift
while not line.nil?
  puts line
  line = csv.shift
end

And I ran into this:

ArgumentError: invalid byte sequence in UTF-8

I read the answer here and this is what I tried

csv = CSV.open(file_name, "r", encoding: "windows-1251:utf-8")

I ran into the following error:

Encoding::UndefinedConversionError: "\x98" to UTF-8 in conversion from Windows-1251 to UTF-8

Then I came across a Ruby gem - charlock_holmes. I figured I'd try using it to find the source encoding.

CharlockHolmes::EncodingDetector.detect(File.read(file_name))
=> {:type=>:text, :encoding=>"windows-1252", :confidence=>37, :language=>"fr"}

So I did this:

csv = CSV.open(file_name, "r", encoding: "windows-1252:utf-8")

And still got this:

Encoding::UndefinedConversionError: "\x8F" to UTF-8 in conversion from Windows-1252 to UTF-8
2
  • Seems like [this][1] might work. ---- [1]: stackoverflow.com/a/9361667/724516 Commented Apr 4, 2013 at 22:34
  • Could you upload your csv file ? Commented Apr 5, 2013 at 9:26

1 Answer 1

4

It looks like you have problem with detecting the valid encoding of your file. CharlockHolmes provide you with useful tip of :confidence=>37 which simply means the detected encoding may not be the right one.

Basing on error messages and test_transcode.rb from https://github.com/MacRuby/MacRuby/blob/master/test-mri/test/ruby/test_transcode.rb I found the encoding that passes through both of your error messages. With help of String#encode it's easy to test:

"\x8F\x98".encode("UTF-8","cp1256") # => "ڈک"

Your issue looks like strictly related to the file and not to ruby.

In case we are not sure which encoding to use and can agree to loose some character we can use :invalid and :undef params for String#encode, in this case:

"\x8F\x98".encode("UTF-8", "CP1250",:invalid => :replace, :undef => :replace, :replace => "?") # => "Ź?"

other way is to use Iconv *//IGNORE option for target encoding:

Iconv.iconv("UTF-8//IGNORE","CP1250", "\x8F\x98")

As a source encoding suggestion of CharlockHolmes should be pretty good.

PS. String.encode was introduced in ruby 1.9. With ruby 1.8 you can use Iconv

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for responding. I'm quite certain the issue is related to my file. I still need to be able to parse it though. I'm okay with losing some of the characters. Any idea?
ooh nice! this looks really useful. I'll give it a shot. Appreciate your efforts!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.