How can I handle validation of non-latin script input in PHP?

Question

I am trying to adapt a php application to handle non-latin scripts (specifically: Japanese, simplified Chinese and Arabic). The app's data validation routines make frequent use of regular expressions to check input, but I am not sure how to adapt the \w character type to other languages without installing additional locales on the system (which I cannot rely on).

Previous developers to have worked on the app have simply added needed characters to the regexes as the number of languages we supported grew (you frequently see "[\wÀÁÂÃÄÅÆÇÈÉ... etc" in the code), but I can't really do this for all the alphabets I need to support now.

Does anybody out there have some advice on how to tackle this?

What does "validation" mean? You could use the locale-aware ctype_alnum, but what you're asking for is "what is an alphanumeric character in any locale"... — Kerrek SB
– Kerrek SB, Commented Jun 26, 2011 at 23:54
unicode is broken in PHP, I know that much. It would be nice to see links to some libraries that people know to work. — Ярослав Рахматуллин
– Ярослав Рахматуллин, Commented Jun 27, 2011 at 0:14
@Ярослав How is Unicode "broken" in PHP? Most basic string functions don't explicitly support it, that's all. The ones that do work fine. — deceze
– deceze ♦, Commented Jun 27, 2011 at 1:27

Dr.Molle · Accepted Answer · 2011-06-27 01:18:03Z

2

See this comment on php.net: http://www.php.net/manual/en/regexp.reference.unicode.php#102756

for example:

//$string may only contain arabic letters
preg_match('@^\p{Arabic}+$@u',$string);

//$string may only contain cyrillic letters
preg_match('@^\p{Cyrillic}+$@u',$string);

//$string may contain word-characters and greek
preg_match('@^[\p{Greek}\w]+$@u',$str);

...and so on

demonstration: http://cecb.freephptest.com/

edited Jun 27, 2011 at 1:18

answered Jun 27, 2011 at 0:42

Dr.Molle

117k16 gold badges200 silver badges206 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Kerrek SB Over a year ago

All fine and well, but is there anything that matches "any alphabetical character in any language or script"?

Kerrek SB Over a year ago

Ah, according to this website some languages support \p{Letter} to match "any sort of letter". Does PHP?

Alix Axel Over a year ago

@Kerrek SB: PHP supports it (if PCRE engine is compiled with Unicode support - most are) but you still have to use the u modifier.

Matt Over a year ago

Thank you. This works, and your help is greatly appreciated. For those who are wondering, the regex is /\p{L}/u (the "u") modifier enabling unicode support.

Collectives™ on Stack Overflow

How can I handle validation of non-latin script input in PHP?

1 Answer 1

4 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Related