Perl's use encoding pragma breaking UTF strings

Question

I have a problem with Perl and Encoding pragma.

(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)

However. When I write

binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;

I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;

all I see is a box character. Why?

Moreover, when I add Data::Dumper and let the Dumper print the new string like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);

I see that perl changed the string to "\x{fffd}". Why?

See also: stackoverflow.com/questions/492838/…

Eugene Yarmash
– Eugene Yarmash

2011-03-19 16:09:26 +00:00
Commented Mar 19, 2011 at 16:09 — Eugene Yarmash
– Eugene Yarmash, Commented Mar 19, 2011 at 16:09

Anomie · Accepted Answer · 2011-03-19 16:07:03Z

10

use encoding 'utf8' is broken. Rather than interpreting \x{ed} as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".

Just stick with use utf8 to specify that your source is in UTF-8, and binmode or the open pragma to specify the encoding for your file handles.

answered Mar 19, 2011 at 16:07

Anomie

95.5k13 gold badges130 silver badges145 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

Karel Bílek Over a year ago

Oh... OK. I can't claim I understand the reason for the reinterpreting, but there are far, far more weird things in perl. Thanks

Anomie Over a year ago

As far as I can tell, the reason is that use encoding was designed so people could write use encoding 'euc-jp'; $r = "\xF1\xD1\xF1\xCC"; and have it interpreted "correctly". But that would mean you'd have to write your UTF-8 string in the same style, as $r = "\xC3\xAD";. Which then gets confusing when combined with Perl's native support for UTF-8 like $r = "\x{200b}";, escapes with codes 0x80-0xff are interpreted differently from escapes with codes 0x100 and up.

hobbs Over a year ago

Yeah, Perl's support for 8-bit locales (use encoding, use locale) should be kept at the other end of a very long stick.

hobbs · Accepted Answer · 2011-03-19 16:20:02Z

5

Your actual code needs neither use encoding nor use utf8 to run properly -- the only thing it depends on is the encoding layer on STDOUT.

binmode(STDOUT, ":utf8");
print "\xed";

is an equally valid complete program that does what you want.

use utf8 should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written

my $r = "í";

then use utf8 would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.

use encoding should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C or PERLUNICODE or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should useopen.

answered Mar 19, 2011 at 16:20

hobbs

245k20 gold badges225 silver badges304 bronze badges

1 Comment

Karel Bílek Over a year ago

hobbs: yes, I have actual UTF-8 literals in my code (in regular expressions). Thanks.

Collectives™ on Stack Overflow

Perl's use encoding pragma breaking UTF strings

2 Answers 2

3 Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related