8

I have a problem with Perl and Encoding pragma.

(I use utf-8 everywhere, in input, output, the perl scripts themselves. I don't want to use other encoding, never ever.)

However. When I write

binmode(STDOUT, ':utf8');
use utf8;
$r = "\x{ed}";
print $r;

I see the string "í" (which is what I want - and what is U+00ED unicode char). But when I add the "use encoding" pragma like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
print $r;

all I see is a box character. Why?

Moreover, when I add Data::Dumper and let the Dumper print the new string like this

binmode(STDOUT, ':utf8');
use utf8;
use encoding 'utf8';
$r = "\x{ed}";
use Data::Dumper;
print Dumper($r);

I see that perl changed the string to "\x{fffd}". Why?

1

2 Answers 2

10

use encoding 'utf8' is broken. Rather than interpreting \x{ed} as the code point U+00ED, it interprets it as the single byte 237 and then tries to interpret that as UTF-8. Which of course fails, so it winds up replacing it with the replacement character U+FFFD, literally "�".

Just stick with use utf8 to specify that your source is in UTF-8, and binmode or the open pragma to specify the encoding for your file handles.

Sign up to request clarification or add additional context in comments.

3 Comments

Oh... OK. I can't claim I understand the reason for the reinterpreting, but there are far, far more weird things in perl. Thanks
As far as I can tell, the reason is that use encoding was designed so people could write use encoding 'euc-jp'; $r = "\xF1\xD1\xF1\xCC"; and have it interpreted "correctly". But that would mean you'd have to write your UTF-8 string in the same style, as $r = "\xC3\xAD";. Which then gets confusing when combined with Perl's native support for UTF-8 like $r = "\x{200b}";, escapes with codes 0x80-0xff are interpreted differently from escapes with codes 0x100 and up.
Yeah, Perl's support for 8-bit locales (use encoding, use locale) should be kept at the other end of a very long stick.
5

Your actual code needs neither use encoding nor use utf8 to run properly -- the only thing it depends on is the encoding layer on STDOUT.

binmode(STDOUT, ":utf8");
print "\xed";

is an equally valid complete program that does what you want.

use utf8 should be used only if you have UTF-8 in literal strings in your program -- e.g. if you had written

my $r = "í";

then use utf8 would cause that string to be interpreted as the single character U+00ED instead of the series of bytes C3 AD.

use encoding should never be used, especially by someone who likes Unicode. If you want the encoding of stdin/out to be changed you should use -C or PERLUNICODE or binmode them yourself, and if you want other handles to be automatically openhed with encoding layers you should useopen.

1 Comment

hobbs: yes, I have actual UTF-8 literals in my code (in regular expressions). Thanks.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.