I have an app that reads large customer-supplied data files. It works perfectly with several but, on one file I received today, it is failing with:
ArgumentError: invalid byte sequence in UTF-8
I am using String.match to look for regex patterns.
When I look at the file, nothing seems different from the ones that work.
Advice?
Edit: it looks like there there is an 'xE9' character in a user name.
'\xE9'character suggests that you have an ISO 8859-1 file that you're treating as UTF-8.String#encodeto switch to UTF-8. For example, if you start with ISO 8859-1 (s = "\xE9".force_encoding('iso8859-1')) and then switch to UTF-8 (s.encode!('utf-8')) then you'll get theéthat you're looking for. There are tons of encoding questions kicking around so I'll just leave this as a comment.