php regex word boundary matching in utf-8

Question

I have the following php code in a utf-8 php file:

var_dump(setlocale(LC_CTYPE, 'de_DE.utf8', 'German_Germany.utf-8', 'de_DE', 'german'));
var_dump(mb_internal_encoding());
var_dump(mb_internal_encoding('utf-8'));
var_dump(mb_internal_encoding());
var_dump(mb_regex_encoding());
var_dump(mb_regex_encoding('utf-8'));
var_dump(mb_regex_encoding());
var_dump(preg_replace('/\bweiß\b/iu', 'weiss', 'weißbier'));

I would like the last regex to replace only full words and not parts of words.

On my windows computer, it returns:

string 'German_Germany.1252' (length=19)
string 'ISO-8859-1' (length=10)
boolean true
string 'UTF-8' (length=5)
string 'EUC-JP' (length=6)
boolean true
string 'UTF-8' (length=5)
string 'weißbier' (length=9)

On the webserver (linux), I get:

string(10) "de_DE.utf8"
string(10) "ISO-8859-1"
bool(true)
string(5) "UTF-8"
string(10) "ISO-8859-1"
bool(true)
string(5) "UTF-8"
string(9) "weissbier"

Thus, the regex works as I expected on windows but not on linux.

So the main question is, how should I write my regex to only match at word boundaries?

A secondary questions is how I can let windows know that I want to use utf-8 in my php application.

Alan Moore · Accepted Answer · 2015-09-30 10:45:51Z

19

Even in UTF-8 mode, standard class shorthands like \w and \b are not Unicode-aware. You just have to use the Unicode shorthands, as you worked out, but you can make it a little less ugly by using lookarounds instead of alternations:

/(?<!\pL)weiß(?!\pL)/u

Notice also how I left the curly braces out of the Unicode class shorthands; you can do that when the class name consists of a single letter.

edited Sep 30, 2015 at 10:45

answered Mar 15, 2010 at 17:12

Alan Moore

75.6k13 gold badges110 silver badges161 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Álvaro González Over a year ago

+1 - \w and \b appear to work as expected in recent PHP versions but they're definitively not something you can rely on since they'll probably break when you deploy your app.

Andreas W. Wylach Over a year ago

Also note the accepted answer here: stackoverflow.com/questions/4781898/… if you want to use the unicode shorthands!

bobble bubble · Accepted Answer · 2016-12-10 10:32:13Z

5

Guess this was related to Bug #52971

PCRE-Meta-Characters like \b \w not working with unicode strings.

and fixed in PHP 5.3.4

PCRE extension: Fixed bug #52971 (PCRE-Meta-Characters not working with utf-8).

answered Dec 10, 2016 at 10:32

bobble bubble

18.8k4 gold badges32 silver badges52 bronze badges

Comments

tomsv · Accepted Answer · 2010-03-12 14:37:30Z

here is what I have found so far. By rewriting the search and replacement patterns like this:

$before = '(^|[^\p{L}])';
$after = '([^\p{L}]|$)';
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weißbier'));
// Test some other cases:
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weiß'));
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', 'weiß bier'));
var_dump(preg_replace('/'.$before.'weiß'.$after.'/iu', '$1weiss$2', ' weiß'));

I get the wanted result:

string 'weißbier' (length=9)
string 'weiss' (length=5)
string 'weiss bier' (length=10)
string ' weiss' (length=6)

on both my windows computer running apache and on the hosted linux webserver running apache.

I assume there is some better way to do this.

Also, I still would like to setlocale my windows computer to utf-8.

ntd · Accepted Answer · 2010-03-14 14:44:44Z

0

According to this comment, that is a bug in PHP. Does using \W instead of \b give any benefit?

edited Mar 14, 2010 at 14:44

answered Mar 14, 2010 at 14:25

ntd

7,4451 gold badge31 silver badges46 bronze badges

2 Comments

ntd Over a year ago

Yes it was, 10 years ago.

ntd Over a year ago

Yes they were. Better now?

Collectives™ on Stack Overflow

php regex word boundary matching in utf-8

4 Answers 4

2 Comments

Comments

Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

2 Comments

Comments

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related