Strange character encoding issue

Question

I have some data which has been imported into Postgres, for use in a Rails application. However somehow the foreign accents have become strangely encoded:

ä appears as âÂ§
á appears as âÂ°
é appears as âÂ©
ó appears as ââ¥

I'm pretty sure the problem is with the integrity of the data, rather than any problem with Rails. It doesn't seem to match any encoding I try:

# Replace "cp1252" with any other encoding, to no effect
"TrollâÂ§ttan".encode("cp1252").force_encoding("UTF-8") #-> junk

If anyone was able to identify what kind of encoding mixup I'm suffering from, that would be great.

As a last resort, I may have to manually replace each corrupted accent character, but if anyone can suggest a programatic solution (or a even a starting point for fixing this - I've found it very hard to debug), I'd be v. grateful.

Can you check what encoding is used by the database? Also, exactly how was the data imported? — PinnyM
– PinnyM, Commented Sep 10, 2012 at 16:53
The encoding is UTF8 (collation en_US.UTF-8). The data went through quite a complex import process (originally CSV, then went through Google Refine, and then a bunch more transformations). It won't be very easy to reimport the data, so an in-place fix would be ideal. — Alex Peattie
– Alex Peattie, Commented Sep 10, 2012 at 17:11
And the original CSV file - what encoding was that? A 'complex import process' adds a lot of variables, and it may be more than one misinterpretation of the encoding causing this... Also, if you can verify the encoding at each interval of the process, that may help pin down the source for the corruption issue quite a bit. — PinnyM
– PinnyM, Commented Sep 10, 2012 at 17:16

Daniel Vérité · Accepted Answer · 2012-09-10 17:52:24Z

It's hardly possible with recent versions of PostgreSQL to have invalid UTF8 inside a UTF8 database. There are other plausible possibilities that may lead to that output, though.

The contents of the database are valid, but some client-side layer is interpreting the bytes from the database as if they were iso-latin-something whereas they are UTF8.
The contents are valid and the SQL client-side layer is valid, but the terminal/software/webpage with which you're looking at this is configured for iso-latin1 or a similar mono-bytes encoding (win1252, iso-latin9...).
The contents of the database consist of the wrong characters with a valid UTF8 encoding. This is what you end up with if you take iso-latin-something bytes, convert them to UTF8 representation, then take the resulting byte stream as if if was still in iso-latin, and reconvert it once again to UTF8, and insert that into the database.

Note that while the Â© sequence is typical in UTF8 versus iso-latin confusion, the presence of an additional â in all your sample strings is uncommon. It may be the result of another misinterpretation on top of the primary one. If you're in case #3, that may mean that an automated fix based on search-replace will be harder than the normal case which is already tricky.

Collectives™ on Stack Overflow

Strange character encoding issue

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related