0

I have an app that reads large customer-supplied data files. It works perfectly with several but, on one file I received today, it is failing with:

ArgumentError: invalid byte sequence in UTF-8

I am using String.match to look for regex patterns.

When I look at the file, nothing seems different from the ones that work.

Advice?

Edit: it looks like there there is an 'xE9' character in a user name.

6
  • Did you look at any of the related questions on the right-hand side of the page? Try reading some of these: stackoverflow.com/search?q=[ruby]+invalid+byte+sequence Commented Dec 4, 2012 at 19:55
  • stackoverflow.com/questions/6374756/… Commented Dec 4, 2012 at 19:59
  • I did. Nothing seemed to apply--to me at least. I am just reading a text file line by line. Commented Dec 4, 2012 at 20:04
  • 1
    A '\xE9' character suggests that you have an ISO 8859-1 file that you're treating as UTF-8. Commented Dec 4, 2012 at 20:27
  • 2
    You could open the file with the appropriate encoding and then use String#encode to switch to UTF-8. For example, if you start with ISO 8859-1 (s = "\xE9".force_encoding('iso8859-1')) and then switch to UTF-8 (s.encode!('utf-8')) then you'll get the é that you're looking for. There are tons of encoding questions kicking around so I'll just leave this as a comment. Commented Dec 5, 2012 at 19:40

2 Answers 2

3

Thanks to @muistooshort 's help, I opened the file in ISO mode and then, reading line by line, convert to UTF-8.

myfile = File.open( 'thefile.txt', 'r:iso8859-1' )
  while rawline = myfile.gets
  line = rawline.force_encoding( 'utf-8' )
  # proceed...
end
Sign up to request clarification or add additional context in comments.

1 Comment

Not saying this is the ideal solution but it seems simple enough and totally resolved my issue on multiple affected data files.
-1

A little rake job that illustrates the solution:

task :reencode, [:filename] => [:environment] do |t, args|
  myfile = File.open( args[:filename], 'r:iso8859-1' )
  outfile = File.open( args[:filename] + ".out", "w+" )
  while rawline = myfile.gets
    line = rawline.force_encoding( 'utf-8' )
    outfile.write line
  end 
  outfile.close
end

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.