How do I find/fix: ArgumentError: invalid byte sequence in UTF-8?

Question

I have an app that reads large customer-supplied data files. It works perfectly with several but, on one file I received today, it is failing with:

ArgumentError: invalid byte sequence in UTF-8

I am using String.match to look for regex patterns.

When I look at the file, nothing seems different from the ones that work.

Advice?

Edit: it looks like there there is an 'xE9' character in a user name.

Did you look at any of the related questions on the right-hand side of the page? Try reading some of these: stackoverflow.com/search?q=[ruby]+invalid+byte+sequence — the Tin Man
– the Tin Man, Commented Dec 4, 2012 at 19:55
I did. Nothing seemed to apply--to me at least. I am just reading a text file line by line. — n8gard
– n8gard, Commented Dec 4, 2012 at 20:04
A '\xE9' character suggests that you have an ISO 8859-1 file that you're treating as UTF-8. — mu is too short
– mu is too short, Commented Dec 4, 2012 at 20:27
You could open the file with the appropriate encoding and then use String#encode to switch to UTF-8. For example, if you start with ISO 8859-1 (s = "\xE9".force_encoding('iso8859-1')) and then switch to UTF-8 (s.encode!('utf-8')) then you'll get the é that you're looking for. There are tons of encoding questions kicking around so I'll just leave this as a comment. — mu is too short
– mu is too short, Commented Dec 5, 2012 at 19:40

n8gard · Accepted Answer · 2012-12-06 16:39:33Z

3

Thanks to @muistooshort 's help, I opened the file in ISO mode and then, reading line by line, convert to UTF-8.

myfile = File.open( 'thefile.txt', 'r:iso8859-1' )
  while rawline = myfile.gets
  line = rawline.force_encoding( 'utf-8' )
  # proceed...
end

answered Dec 6, 2012 at 16:39

n8gard

2,0108 gold badges29 silver badges41 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

n8gard Over a year ago

Not saying this is the ideal solution but it seems simple enough and totally resolved my issue on multiple affected data files.

Jason · Accepted Answer · 2015-08-17 18:18:00Z

-1

A little rake job that illustrates the solution:

task :reencode, [:filename] => [:environment] do |t, args|
  myfile = File.open( args[:filename], 'r:iso8859-1' )
  outfile = File.open( args[:filename] + ".out", "w+" )
  while rawline = myfile.gets
    line = rawline.force_encoding( 'utf-8' )
    outfile.write line
  end 
  outfile.close
end

answered Aug 17, 2015 at 18:18

Jason

8819 silver badges17 bronze badges

Collectives™ on Stack Overflow

How do I find/fix: ArgumentError: invalid byte sequence in UTF-8?

2 Answers 2

1 Comment

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related