0

It seems that LWP::UserAgent always encodes form data as UTF-8, even if explicitely encode it as ISO-8859-1 as follows:

use Encode;
use LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->post('http://localhost:8080/', {
    text => encode("iso-8859-1", 'è'),
});

Request content is text=%C3%A8. How can I have è encoded as %E8 instead?

2
  • How did you determine the request content? Network sniffer definitely says text=%E8: i.sstatic.net/rM3xS.png Commented Jun 23, 2011 at 18:34
  • Interesting. I'm running nc on port 8080, and I get text=%C3%A8. Specs: MacOS X 10.6, perl v5.10.0, libwww-perl/5.837. Commented Jun 23, 2011 at 19:40

3 Answers 3

2

Hehe. :-) This has to do with the incrementally growing support for Unicode in the last dozen of Perl releases and the regex feature \C used by the URI module, to be more precise, by URI::Escape. Read this thread on perl-unicode from 2010 (Don't use the \C escape in regexes - Why not?) to understand the background.

Why the URI module? Because it is used to do form and URL encoding by HTTP::Request::Common.

Meanwhile, here's a script I wrote to remind myself of how tricky this issue is, especially as the URI module is such a frequently used one:

use 5.010;
use utf8;
# Perl and URI.pm might behave differently when you encode your script in
# Latin1 and drop the utf8 pragma.
use Encode;
use URI;
use Test::More;
use constant C3A8 => 'text=%C3%A8';
use constant   E8 => 'text=%E8';
diag "Perl $^V";
diag "URI.pm $URI::VERSION";
my $chars = 'è';
my $octets = encode 'iso-8859-1', $chars;
my $uri = URI->new('http:');

$uri->query_form( text => $chars );
is $uri->query, C3A8, C3A8;

my @exp;
given ( "$^V $URI::VERSION" ) {
        when ( 'v5.12.3 1.56' ) { @exp = (   E8, C3A8 ) }
        when ( 'v5.10.1 1.54' ) { @exp = ( C3A8, C3A8 ) }
        when ( 'v5.10.1 1.58' ) { @exp = ( C3A8, C3A8 ) }
        default                 { die 'not tested :-)' }
}

$uri->query_form( text => $octets );
is $uri->query, $exp[0], $exp[0];

utf8::upgrade $octets;
$uri->query_form( text => $octets );
is $uri->query, $exp[1], $exp[1];

done_testing;

So what I get (on Windows and Cygwin) is:

C:\Windows\system32 :: perl \Opt\Cygwin\tmp\uri.pl
# Perl v5.12.3
# URI.pm 1.56
ok 1 - text=%C3%A8
ok 2 - text=%E8
ok 3 - text=%C3%A8
1..3

And:

MiLu@Dago: ~/comp > perl /tmp/uri.pl
# Perl v5.10.1
# URI.pm 1.54
ok 1 - text=%C3%A8
ok 2 - text=%C3%A8
ok 3 - text=%C3%A8
1..3

UPDATE

You can handcraft the request body:

use utf8;
use Encode;
use LWP::UserAgent;
my $chars = 'ölè';
my $octets = encode( 'iso-8859-1', $chars );
my $body = 'text=' .
        join '',
        map { $o = ord $_; $o < 128 ? $_ : sprintf '%%%X', $o }
        split //, $octets;
my $uri = 'http://localhost:8080/';
my $req = HTTP::Request->new( POST => $uri, [], $body );
print $req->as_string;
my $ua = LWP::UserAgent->new;
my $rsp = $ua->request( $req );
print $rsp->as_string;
Sign up to request clarification or add additional context in comments.

2 Comments

Good catch, Lumi. That's something ugly ugly ugly. Upgrading URI.pm to 1.58 didn't fix the issue; it seems that I need Perl 5.12 for that work out of the box. So, back to my question, is building the HTTP::Request manually the only portable way to achieve %E8?
I don't know but I would pursue that road as it is easy enough. I updated my answer to include the code.
1
use strict;
use warnings;
use utf8;  # Script is encoded using UTF-8.

use Encode                qw( encode );
use HTTP::Request::Common qw( POST );  # This is what ->post uses

my $req = POST('http://localhost:8080/', {
    text => encode("iso-8859-1", 'è'),
});

print($req->as_string());

gives

POST http://localhost:8080/
Content-Length: 8
Content-Type: application/x-www-form-urlencoded

text=%E8

Are you use you are passing «è» and not its UTF-8 encoding? If I use its UTF-8 encoding, I get the same result as you.

...
my $req = POST('http://localhost:8080/', {
    text => encode("iso-8859-1", encode("UTF-8", 'è')),
});
...

gives

POST http://localhost:8080/
Content-Length: 11
Content-Type: application/x-www-form-urlencoded

text=%C3%A8

Comments

1

Short answer to myself: just put the variable name (i.e. "text") in quotes instead of writing it as a bareword.

$ua->post('http://localhost:8080/', {
    'text' => encode("iso-8859-1", 'è'),
});

Ratio: this weird behaviour is caused by the combination of the following factors:

  • Perl bug #68812 caused the UTF-8 internal flag to be set to all barewords. This was fixed in latest Perl versions (>= 5.12);
  • URI.pm concatenates keys to values (i.e. "text=è") before converting characters, so the value is always promoted to UTF-8 if the key has the internal flag set, even if you passed the value as octects.

I don't think that the bug pointed out by @Lumi about URI.pm using \C has effect on this specific issue.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.