2

I got following code

<?php
define('PREG_CLASS_SEARCH_EXCLUDE',
'\x{0}-\x{2c}\x{2e}-\x{2f}\x{3a}-\x{40}\x{5b}-\x{60}\x{7b}-\x{bf}\x{d7}\x{f7}\x{2b0}-'.
'\x{385}\x{387}\x{3f6}\x{482}-\x{489}\x{559}-\x{55f}\x{589}-\x{5c7}\x{5f3}-'.
'\x{61f}\x{640}\x{64b}-\x{65e}\x{66a}-\x{66d}\x{670}\x{6d4}\x{6d6}-\x{6ed}'.
'\x{6fd}\x{6fe}\x{700}-\x{70f}\x{711}\x{730}-\x{74a}\x{7a6}-\x{7b0}\x{901}-'.
'\x{903}\x{93c}\x{93e}-\x{94d}\x{951}-\x{954}\x{962}-\x{965}\x{970}\x{981}-'.
'\x{983}\x{9bc}\x{9be}-\x{9cd}\x{9d7}\x{9e2}\x{9e3}\x{9f2}-\x{a03}\x{a3c}-'.
'\x{a4d}\x{a70}\x{a71}\x{a81}-\x{a83}\x{abc}\x{abe}-\x{acd}\x{ae2}\x{ae3}'.
'\x{af1}-\x{b03}\x{b3c}\x{b3e}-\x{b57}\x{b70}\x{b82}\x{bbe}-\x{bd7}\x{bf0}-'.
'\x{c03}\x{c3e}-\x{c56}\x{c82}\x{c83}\x{cbc}\x{cbe}-\x{cd6}\x{d02}\x{d03}'.
'\x{d3e}-\x{d57}\x{d82}\x{d83}\x{dca}-\x{df4}\x{e31}\x{e34}-\x{e3f}\x{e46}-'.
'\x{e4f}\x{e5a}\x{e5b}\x{eb1}\x{eb4}-\x{ebc}\x{ec6}-\x{ecd}\x{f01}-\x{f1f}'.
'\x{f2a}-\x{f3f}\x{f71}-\x{f87}\x{f90}-\x{fd1}\x{102c}-\x{1039}\x{104a}-'.
'\x{104f}\x{1056}-\x{1059}\x{10fb}\x{10fc}\x{135f}-\x{137c}\x{1390}-\x{1399}'.
'\x{166d}\x{166e}\x{1680}\x{169b}\x{169c}\x{16eb}-\x{16f0}\x{1712}-\x{1714}'.
'\x{1732}-\x{1736}\x{1752}\x{1753}\x{1772}\x{1773}\x{17b4}-\x{17db}\x{17dd}'.
'\x{17f0}-\x{180e}\x{1843}\x{18a9}\x{1920}-\x{1945}\x{19b0}-\x{19c0}\x{19c8}'.
'\x{19c9}\x{19de}-\x{19ff}\x{1a17}-\x{1a1f}\x{1d2c}-\x{1d61}\x{1d78}\x{1d9b}-'.
'\x{1dc3}\x{1fbd}\x{1fbf}-\x{1fc1}\x{1fcd}-\x{1fcf}\x{1fdd}-\x{1fdf}\x{1fed}-'.
'\x{1fef}\x{1ffd}-\x{2070}\x{2074}-\x{207e}\x{2080}-\x{2101}\x{2103}-\x{2106}'.
'\x{2108}\x{2109}\x{2114}\x{2116}-\x{2118}\x{211e}-\x{2123}\x{2125}\x{2127}'.
'\x{2129}\x{212e}\x{2132}\x{213a}\x{213b}\x{2140}-\x{2144}\x{214a}-\x{2b13}'.
'\x{2ce5}-\x{2cff}\x{2d6f}\x{2e00}-\x{3005}\x{3007}-\x{303b}\x{303d}-\x{303f}'.
'\x{3099}-\x{309e}\x{30a0}\x{30fb}\x{30fd}\x{30fe}\x{3190}-\x{319f}\x{31c0}-'.
'\x{31cf}\x{3200}-\x{33ff}\x{4dc0}-\x{4dff}\x{a015}\x{a490}-\x{a716}\x{a802}'.
'\x{a806}\x{a80b}\x{a823}-\x{a82b}\x{d800}-\x{f8ff}\x{fb1e}\x{fb29}\x{fd3e}'.
'\x{fd3f}\x{fdfc}-\x{fe6b}\x{feff}-\x{ff0f}\x{ff1a}-\x{ff20}\x{ff3b}-\x{ff40}'.
'\x{ff5b}-\x{ff65}\x{ff70}\x{ff9e}\x{ff9f}\x{ffe0}-\x{fffd}');
$string = preg_replace('/['.PREG_CLASS_SEARCH_EXCLUDE.']+/u', ' ', $string);

the $string is null, meaning that an error occured (as described in php manual). preg_last_error() returns 0 (meaning no error occured).

This happens on server with php 5.4. On serwers with php < 5.4 everything is fine. Works like intended.

What could be a reason for such behaviour?

5
  • Have you enabled error reporting ? I got some nice error message on some PHP versions Commented Aug 28, 2013 at 19:33
  • I think yes... but maybe I'm wrong. will check again in a minute. How is it possible it works everywhere except one server with 5.4? Anyway: its not my code, its Prestashop 1.5 Search class code taken from Drupal. So should be tested.... Commented Aug 28, 2013 at 20:00
  • Replace \x{1a1f} by \x{2116} ;) Commented Aug 28, 2013 at 20:13
  • Well, PHP comes shipped with a library called PCRE. Of course each library has some versions/configuration. Maybe that's the cause ? I can get ride of the error by just removing the piece that's triggering it: take a look. If you specified what you exactly want to do, maybe I could come with a new regex for your purpose. Commented Aug 28, 2013 at 20:13
  • On internet found the problem is \x{d800} character. However many more characters are 255 > . And works on Problematic PCRE Library Version: 8.32 2012-11-30 Working PCRE Library Version 8.02 2010-03-19 Commented Aug 28, 2013 at 21:03

1 Answer 1

3

I think I found a reason

According to changelog at http://www.pcre.org/changelog.txt

Version 8.30 04-February-2012


9.The invalid Unicode surrogate codepoints U+D800 to U+DFFF are now rejected if they appear, or are escaped, in patterns.

Sign up to request clarification or add additional context in comments.

5 Comments

For those not familiar, "surrogate pairs" are the bit patterns used by UTF-16 to represent characters beyond the range which can be represented in 16 bits. To avoid ambiguity, those codepoints are reserved in the Unicode standard and cannot be used as characters in their own right.
if we go that way... so what are they for? somehow can't understand their meaning..
Unicode is a list of abstract "code points", each assigned a number and a meaning, but not a way of representing in bits on disk etc. UTF-16 is a way of representing Unicode in groups of 16 bits, which allows the most common characters to be trivially mapped straight to their codepoint "number"; for those codepoints that cannot be mapped that way, "surrogate" bit patterns are used. If these bit patterns also corresponded directly to codepoints, it would complicate things, so the Unicode standard promises never to define those codepoints, and any attempt to treat them as characters is an error.
In other words, those numbers have a meaning in UTF-16, but not in Unicode itself. So searching for them as Unicode characters is an error.
Think i somehow understand. character mapping is not my strong point atm... got to do some reading. Thank you!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.