4

I am working on an application where a ruby sidekiq process calls a 3rd party and parses the data into a database.

I am using sequel ad my orm.

I am getting some weird characters back in the results, for example:

"Tweets en Ingl\xE9s y en Espa\xF1ol"

When this gets attempted to save to postgres, the following error happens:

Sequel::DatabaseError: PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0xe9 0x73 0x20

The weird thing is that the string thinks it is UTF-8, if I check the encoding name, it says:

name.encoding.name #UTF-8

What can I do to ensure that the data is in the correct format for postgres?

1 Answer 1

8

Just because the string claims to be UTF-8 doesn't mean that it is UTF-8. \xe9 is é in ISO-8859-1 (AKA Latin-1) but it is invalid in UTF-8; similarly, \xf1 is ñ in ISO-8859-1 but invalid in UTF-8. That suggests that the string is actually encoded in ISO-8859-1 rather than UTF-8. You can fix it with a combination of force_encoding to correct Ruby's confusion about the current encoding and encode to re-encode it as UTF-8:

> "Tweets en Ingl\xE9s y en Espa\xF1ol".force_encoding('iso-8859-1').encode('utf-8')
=> "Tweets en Inglés y en Español" 

So before sending that string to the database you want to:

name = name.force_encoding('iso-8859-1').encode('utf-8')

Unfortunately, there is no way to reliably detect a string's real encoding. The various encodings overlap and there's no way to tell if è (\xe8 in ISO-8859-1) or č (\xe8 in ISO-8859-2) is the right character without manual sanity checking.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.