Perl recognizing utf8 as Unicode after 4096 bytes

Question

I have an application in Perl/CGI where I receive a utf8 txt file and treat its content.

For some reason (I think that Perl divides the file into 4096 bytes buffers and only the first one has the Byte Order Mark) Perl interprets the content of the file as Unicode after 4096 bytes.

If I spread some en dashes ("–") in the middle of the file (at least one for each block of 4k) the program recognizes it as utf8, probably because Unicode doesn't have en dashes.

I'm receiving the txt from an html page and sending it to an scalar variable like this:

while(my $l = <$fh>){
    $text .= $l;
}

I tried to force utf8 by concatenating each line of the file with an en dash:

while(my $l = <$fh>){
    $text .= "–".$l;
}

But I get this error:

Wide character in print at (eval 12) line 94.

Does anyone have a tip? has Thank you!

"Unicode doesn't have en dashes". That is untrue. fileformat.info/info/unicode/char/2013/index.htm — Dave Cross
– Dave Cross, Commented Sep 9, 2013 at 15:31

cjm · Accepted Answer · 2013-09-08 06:03:28Z

3

Perl can operate on Unicode codepoints, but all I/O is done with bytes. When you print a string with high code points to a normal file handle, you get the “wide character in print” warning.

You should decode all input data, and encode all your output. The best way to do this is to use PerlIO layers. You can add layers with binmode. E.g:

use utf8; # This source file is encoded in UTF-8.
          # Else, the literal "–" would be seen as multiple bytes, not one single character.

binmode STDOUT, ":uft8"; # encode all strings (that get printed to STDOUT)
                         # to the binary UTF-8 representation
print "–\n"; # EN DASH – works.

When opening a file, you can add PerlIO layers in the open mode, e.g.

open my $fh "<:utf8", $filename or die ...;

This transparently translates the binary input to codepoints.

Do not concatenate byte strings that contain binary UTF-8 with properly decoded strings – the result will most likely be invalid data. Of course, such issues don't arise when you decode all input.

The way Perl buffers the input should not affect your program; it is likely you misdiagnosed that. Perl does not do encoding detection via BOMs on input files.

In the context of web programming, encoding your output as UTF-8 is a good choice, but make sure to also set the charset property in the response headers:

Content-Type: text/html; charset=UTF-8

The HTML document should reiterate this with <meta charset="UTF-8">.

edited Sep 8, 2013 at 6:03

cjm

62.2k9 gold badges133 silver badges179 bronze badges

answered Sep 8, 2013 at 1:14

amon

57.8k2 gold badges93 silver badges152 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

gvieira Over a year ago

Hi amon, thanks for the response. What do you mean by "print a string with high code points to a normal file handle"? I've already got the "wide character in print" warning another times and never truly understood this. Everytime I open a file from disk I use: open FH, "<:encoding(UTF-8)"; And it goes ok. But how could I do this, as soon the CGI gives me just the $fh?

amon Over a year ago

@gvieira That is good, but you also need to encode the data when you print it out. This can be done with the binmode STDOUT, ":utf8" (or whatever filehandle and encoding you want). Then that warning should go away. A high code point is any character not in Latin-1 (IIRC).

amon Over a year ago

@gvieira That response was written before your edit. You need to add the :utf8 layer (or an equivalent layer) on both the input and the output file handle. In both cases, layers can be added with binmode, as shown.

gvieira Over a year ago

I'm doing a test: binmode $fh, ":utf8"; open (AUXT, ">"."auxtext.txt"); binmode AUXT, ":utf8"; while (<$fh>){ print AUXT $_; } close AUXT; And I still get the wide char just cause the file uploaded has a 'ç'. How could it even happen? Edit: The error.log says that the wide char error is in line print AUXT $_;

amon Over a year ago

@gvieira Are you by chance using the upload function from the CGI module? In that case, try $fh = $fh->handle as soon as you obtain it. Could you please also list all applied PerlIO layers after specifying them with binmode? You can do so like say for PerlIO::get_layers(AUXT); etc..

MJ Walsh · Accepted Answer · 2013-10-22 08:41:08Z

0

Try:

use Encode qw(encode);

$text = join '', <$fh>;

$text = encode("utf8", $text);

answered Oct 22, 2013 at 8:41

MJ Walsh

6837 silver badges14 bronze badges

Collectives™ on Stack Overflow

Perl recognizing utf8 as Unicode after 4096 bytes

2 Answers 2

5 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related