2

I'm not sure if this is a Perl issue, an Nginx issue, or an HTTP issue. I know there are a bazillion questions about character encoding, but I just can't figure this out. Anyway, here's the problem.

My web site pulls data from two different type of sources. Some of those sources are utf-8 files. Some of them are files which contain URL encoded data. The problem is that I can't figure out how to output the characters from both of those sources without getting funky characters in the web browser.

The following Perl script demonstrates the problem. You can see this script live and in action at https://www.mikobiko.com/demo.pl

#!/usr/bin/perl -wT
use strict;
use CGI;

# variables
my ($in, $from_file, $from_url);

# HTTP header
print qq|Content-type: text/html; charset=utf-8\n\n|;

# from utf-8 file
open($in, '<', './utf-8.txt');
$in or die $!;
($from_file) = <$in>;
print "<h1>from utf-8 file</h1>\n";
print "<p>character: ", $from_file, "</p>\n";
print '<p>length: ', length($from_file), "</p>\n";

# from url encoded
print "<h1>from url encoded</h1>\n";
$from_url = '%F1';
$from_url = CGI::unescape($from_url);
print "<p>character: ", $from_url, "</p>\n";
print '<p>length: ', length($from_url), "</p>\n";

Here's what this script does. It outputs a standard Content-type header, including indicating that the character set is utf-8.

Then it slurps in a utf-encoded file that contains the character ñ (an "n" with a tilde over it). Then it outputs that character. You can see the source file itself at https://www.mikobiko.com/utf-8.txt . Here's the linux "file" command output for that file:

utf-8.txt: UTF-8 Unicode text, with no line terminators

Then the script decodes the URL character string for ñ, then outputs that.

Here's a screen shot of what the browser shows. This screenshot is from Chrome, but Firefox does the same thing. The character that is from the utf-8 file is displayed with the little question mark symbol.

enter image description here

If I remove the "charset=utf-8" part of the Content-type, then the problem is reversed, and url decoded character is displayed funky.

Here's some system info:

nginx: nginx/1.10.3 (Ubuntu)

Perl: perl 5, version 22, subversion 1 (v5.22.1)

Linux on the server:

Distributor ID: Ubuntu
Description:    Ubuntu 16.04.2 LTS
Release:        16.04
Codename:       xenial

Please let me know if there's any other info I can provide to help solve this problem. Thanks!

1 Answer 1

1

OK, so I figured it out. After the string is url decoded, it needs to be encode as utf-8. First load the Encode module:

use Encode 'encode';

Then encode the string:

$from_url = encode('UTF-8', $from_url);

Easy peasy.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.