2

I am trying to adapt a php application to handle non-latin scripts (specifically: Japanese, simplified Chinese and Arabic). The app's data validation routines make frequent use of regular expressions to check input, but I am not sure how to adapt the \w character type to other languages without installing additional locales on the system (which I cannot rely on).

Previous developers to have worked on the app have simply added needed characters to the regexes as the number of languages we supported grew (you frequently see "[\wÀÁÂÃÄÅÆÇÈÉ... etc" in the code), but I can't really do this for all the alphabets I need to support now.

Does anybody out there have some advice on how to tackle this?

3
  • 3
    What does "validation" mean? You could use the locale-aware ctype_alnum, but what you're asking for is "what is an alphanumeric character in any locale"... Commented Jun 26, 2011 at 23:54
  • unicode is broken in PHP, I know that much. It would be nice to see links to some libraries that people know to work. Commented Jun 27, 2011 at 0:14
  • @Ярослав How is Unicode "broken" in PHP? Most basic string functions don't explicitly support it, that's all. The ones that do work fine. Commented Jun 27, 2011 at 1:27

1 Answer 1

2

See this comment on php.net: http://www.php.net/manual/en/regexp.reference.unicode.php#102756

for example:

//$string may only contain arabic letters
preg_match('@^\p{Arabic}+$@u',$string);

//$string may only contain cyrillic letters
preg_match('@^\p{Cyrillic}+$@u',$string);

//$string may contain word-characters and greek
preg_match('@^[\p{Greek}\w]+$@u',$str);

...and so on

demonstration: http://cecb.freephptest.com/

Sign up to request clarification or add additional context in comments.

4 Comments

All fine and well, but is there anything that matches "any alphabetical character in any language or script"?
Ah, according to this website some languages support \p{Letter} to match "any sort of letter". Does PHP?
@Kerrek SB: PHP supports it (if PCRE engine is compiled with Unicode support - most are) but you still have to use the u modifier.
Thank you. This works, and your help is greatly appreciated. For those who are wondering, the regex is /\p{L}/u (the "u") modifier enabling unicode support.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.