8

We are dealing with a strange bug in a Joyent Solaris server that never happened before (doesn't happen in localhost or two other Solaris servers with identical php configuration). Actually, I'm not sure if we have to look at php or solaris, and if it is a software or hardware problem...

I just want to post this in case somebody can point us in the right direction.

So, the problem seems to be in var_export()when dealing with strange characters. Executing this in the CLI, we get the expected result in our localhost machines and in two of the servers, but not in the 3rd one. All of them are configured to work with utf-8.

$ php -r "echo var_export('ñu', true);"

Gives this in older servers and localhost (expected):

'ñu'

But in the server we are having problems with (PHP Version => 5.3.6), it adds \0 null characters whenever it encounters an "uncommon" character: è, á, ç, ... you name it.

'' . "\0" . '' . "\0" . 'u'

Any idea on where should be looking at? Thanks in advance.


More info:

  • PHP version 5.3.6.
  • setlocale() is not solving anything.
  • default_charset is UTF-8 in php.ini.
  • mbstring.internal_encoding is set to UTF-8 in php.ini.
  • mbstring.func_overload = 0.
  • this happens in both CLI (example) and web application (php-fpm + nginx).
  • iconv encoding is also UTF-8
  • all files utf-8 encoded.

system('locale') returns:

LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

Some of the tests done so far (CLI):

Normal behaviour:

$ php -r "echo bin2hex('ñu');" => 'c3b175'
$ php -r "echo mb_strtoupper('ñu');" => 'ÑU'
$ php -r "echo serialize(\"\\xC3\\xB1\");" => 's:2:"ñ";'
$ php -r "echo bin2hex(addcslashes(b\"\\xC3\\xB1\", \"'\\\\\"));" => 'c3b1'
$ php -r "echo ucfirst('iñu');" => 'Iñu'

Not normal:

$ php -r "echo strtoupper('ñu');" => 'U' 
$ php -r "echo ucfirst('ñu');" => '?u' 
$ php -r "echo ucfirst(b\"\\xC3\\xB1u\");" => '?u' 
$ php -r "echo bin2hex(ucfirst('ñu'));" => '00b175'
$ php -r "echo bin2hex(var_export('ñ', 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'
$ php -r "echo bin2hex(var_export(b\"\\xC3\\xB1\", 1));" => '2727202e20225c3022202e202727202e20225c3022202e202727'

So the problem seems to be in var_export() and "string functions that use the current locale but operate byte-by-byte" Docs (view @hakre's answer).

18
  • I'd start by checking the version of software running on each server. Specifically php. A function in one version assumes UTF-8 while the same function in a different version assumes ISO-8859-1. Commented Mar 16, 2012 at 17:25
  • Also try comparing the output of locale(1) and/or checking the environment variables that start with LC. Commented Mar 20, 2012 at 7:31
  • Does this only happen on the CLI? That may be some special case of how Solaris' terminal handles Unicode. Or does this happen as well when running from source code files which guaranteed do not contain NUL bytes? Commented Apr 11, 2012 at 13:38
  • Check two things, one the php.ini that gets executed at CLI (might differ from the one over webserver), setting there the default_charset to "utf-8". Secondly check /etc/locale.gen if you even have an en_US.UTF-8 on that one server. Commented Apr 12, 2012 at 12:05
  • 1
    I'm sure this is related to Solaris and the system C libraries that are used by PHP. I'd say that the compiled packages have been messed by the hoster, otherwise strtoupper must be working. Get proper binaries. Commented Apr 16, 2012 at 21:18

5 Answers 5

5
+50

I suggest you verify the PHP binary you've got problems with. Check the compiler flags and the libraries it makes use of.

Normally PHP internally uses binary strings, which means that functions like ucfirst work byte-to-byte and only support what your locale support (if and like configured). See Details of the String TypeDocs.

$ php -r "echo ucfirst('ñu');" 

returns

?u

This makes sense, ñ is

LATIN SMALL LETTER N WITH TILDE (U+00F1)    UTF8: \xC3\xB1

You have some locale configured that makes PHP change \xC3 into something else, breaking the UTF-8 byte-sequence and making your shell display the � replacement characterWikipedia.

I suggest if you really want to analyze the issues, you should start with hexdumps next to how things get displayed in shell and elsewhere. Know that you can explicitly define binrary strings b"string" (that's forward compatibility, mabye you've got enabled some compile flag and you're on unicode experimental?), and also you can write strings literally, here hex-way for UTF-8:

 $ php -r "echo ucfirst(b\"\\xC3\\xB1u\");"

And there are a lot more settings that can play a role, I started to list some points in an answer to Preparing PHP application to use with UTF-8.


Example of a multibyte ucfirst variant:

/**
 * multibyte ucfirst
 *
 * @param string $str
 * @param string|null $encoding (optional)
 * @return string
 */
function mb_ucfirst($str, $encoding = NULL)
{
    $first = mb_substr($str, 0, 1, $encoding);
    $rest = mb_substr($str, 1, strlen($str), $encoding);
    return mb_strtoupper($first, $encoding) . $rest;
}

See mb_strtoupperDocs and as well mb_convert_caseDocs.

Sign up to request clarification or add additional context in comments.

11 Comments

I've made the 'hexadecimal test': all the servers, including the 'bad guy', return c3b175 when executing $ php -r "echo bin2hex('ñu');". Not sure how I should interpret this...
And $ php -r "echo ucfirst(b\"\\xC3\\xB1u\");" returns ?u.
and what does bin2hex(ucfirst('ñu')); give? (your report show that for both cases, PHP uses the UTF-8 sequences inside the strings, so that is the same across those systems).
bin2hex(ucfirst('ñu')); returns 00b175.
Well, as this is related to strtoupper even, I suspect it's related to the underlying c libs when PHP has been compiled. You should check with Joynet support and ask them for providing a properly configured/compiled binary. Also I suggest you get a PHP version that's a more current PHP 5.3 one, like PHP 5.3.10.
|
0

try force utf-8 in php:

<? ini_set( 'default_charset', 'UTF-8' ); ?>

in very top (first line of code) of your any page/template. It helps me with my special characters mostly. Not sure that it can help you too, try it.

1 Comment

default_charset is UTF-8 in php.ini. Thanks anyway.
0

Probably all your servers are in good state . In one of the comments you said that you have only issue with ucfirst() and var_export(). Depending on these responses you might be looking at this SOQ. Most of the php string function will not work properly when working with multibyte strings. That is why php has separate set of functions to deal with them.

This might be helpful

Comments

0

I normally use utf8_encode('ñu') for all the french characters

1 Comment

Thanks Vinay, but it seems to be an underlying C problem, maybe a compilation problem. Still trying to find it out, but PHP doesn't seem to be the source of the problem.
0

phpunit tests for this are being added to https://gist.github.com/68f5781a83a8986b9d30 - can we build up a better unit test suite so that we can figure out what the expected output should be?

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.