1

I have to load some data from external sources. When I look at the encoding, Ruby tells me ASCII-8BIT, binary file. However, some of the sources are encoded ISO-8859-1 and some of them are in UTF-8. When I try to convert the ISO-8859-1 encoded stuff to UTF-8, I get an error. But when I do something like content.force_encoding('ISO-8859-1').encode('UTF-8') everything works fine.

However, this doesn't work the other way round. When I try to encode the UTF-8 data to ISO, it ends up with broken characters like .

So, is there a way to detect the "underlying" encoding of the ASCII-8BIT data, and then convert it to UTF-8?

4
  • 1
    A quick search found a library which might solve your problem... github.com/brianmario/charlock_holmes Commented May 11, 2015 at 8:36
  • 1
    This is just not possible in a reliable way. Only heuristic approaches exist. Commented May 11, 2015 at 8:39
  • @AJFaraday I tried that gem, works like a charm! If you add your comment as an answer, I'll accept it. Commented May 11, 2015 at 10:18
  • There's no absolutely reliable way to do this, you really need to keep track of what files are in what encoding. But if you have to guess, there are some gems that will help you guess, but it will not be absolutely reliable. Commented Sep 16, 2015 at 3:29

1 Answer 1

1

I had a quick google and found the Charlock Holmes gem by Brian Lopez. It looks like it does the detection process you're after.

https://github.com/brianmario/charlock_holmes

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.