Split string into array based on a unicode character range in PHP

Question

Sorry for the ambiguous subject, what I'm looking for is to have a string with cyrillic characters that may go like

«Добрый день!» - сказал он, потянувшись…

into an array that goes like

[0] => «
[1] => Добрый␠
[2] => день!»␠-␠
[3] => сказал␠
[4] => он,␠
[5] => потянувшись…

So essentially I'm looking for a break to occur on a border between any character and a cyrillic character ([а-я] range) although this must only be true when we transit from any character to a cyrillic character, not vice versa. I've seen examples that successfully solve this with punctuation characters and latin alphabet with

preg_split('/([^.:!?]+[.:!?]+)/', 'hello:there.everyone!so.how?are:you', NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

but my attempts to repurpose it into something different have so far failed:

preg_split ('/(?<=[^а-я])/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

almost works but it also splits by regular characters such as spaces and punctuation marks and that is not what I want. Clearly there's something wrong with my regex. How should I modify that to get the result as in the example above?

why « character is captured as a separate item and the same opposite » is captured as a part of a string день!».. ? — RomanPerekhrest
– RomanPerekhrest, Commented Dec 9, 2016 at 22:12
Yes, it's not really the best example, I'm willing to sacrifice the [0] there somehow. — Захар Joe
– Захар Joe, Commented Dec 9, 2016 at 22:46

Wiktor Stribiżew · Accepted Answer · 2016-12-09 22:22:05Z

2

Use the following regex solution:

$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
//   [0] => «
//   [1] => Добрый 
//   [2] => день!» - 
//   [3] => сказал 
//   [4] => он, 
//   [5] => потянувшись…
//)

See the PHP demo

Details:

\b(\p{Cyrillic}+\W*) - matches and captures a whole Cyrillic word with 0+ non-word chars after it
The pattern is wrapped with capturing parentheses and PREG_SPLIT_DELIM_CAPTURE will push the captured values into the resulting array
PREG_SPLIT_NO_EMPTY will discard empty values in the array
/u modifier will make the \b (word boundary) and \W Unicode aware, and will allow processing Unicode strings with regex.

answered Dec 9, 2016 at 22:22

Wiktor Stribiżew

631k41 gold badges502 silver badges632 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Захар Joe Over a year ago

I really like this elegant solution but when I try it in my own PHP, all I get is just a single line, no splits. It does work in your demo though. Why could that be?

bobble bubble · Accepted Answer · 2016-12-10 01:07:24Z

2

How about splitting at an initial \b word boundary with u modifier.

$res = preg_split('/\b(?=\w)(?!^)/u', $str);

The lookahead ensures \b is followed by a word character. (?!^) prevents empty match if start.

See this demo at eval.in

edited Dec 10, 2016 at 1:07

answered Dec 9, 2016 at 22:51

bobble bubble

18.8k4 gold badges32 silver badges52 bronze badges

9 Comments

Захар Joe Over a year ago

It is a logical solution but unfortunately I need the breaks to occur only on cyrillic characters so that, for example, "слово word" doesn't get split into two.

bobble bubble Over a year ago

@ЗахарJoe In this case you could try $res = preg_split('/\b(?=[^\Wa-z])/iu', $str);

Захар Joe Over a year ago

I've just tried the two regexes you provided and unfortunately my version of PHP (5.5.38) for some reason returns just a single array element in both cases.

bobble bubble Over a year ago

@ЗахарJoe Probably same issue with preg_split('/\b(?=\p{Cyrillic})/u', $str); similar Wiktor's answer.

Захар Joe Over a year ago

It probably is, and it's kinda baffling. I did set mb_internal_encoding ( 'UTF-8' ); and I don't think it should need any other tricks. Wonder what's broken and where.

|

Martin Cup · Accepted Answer · 2016-12-09 22:40:49Z

1

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

Array
(
    [0] => «
    [1] => Добрый 
    [2] => день!» - 
    [3] => сказал 
    [4] => он, 
    [5] => потянувшись…
)

Here you can try it.

edited Dec 9, 2016 at 22:40

answered Dec 9, 2016 at 22:21

Martin Cup

2,5821 gold badge26 silver badges36 bronze badges

3 Comments

Martin Cup Over a year ago

Thank you but I think you should also checkout bobble bubble's answer which seems to be a little more elegant.

bobble bubble Over a year ago

Have already voted for this. Another variant: $res = preg_split('/\b(?=[а-я])/iu', $str);

Захар Joe Over a year ago

Same story. My PHP disrespects something (although I don't see why it would do that) and only the lookahead variant works.

Nicolas · Accepted Answer · 2016-12-09 22:21:44Z

0

Try this regex: [\x{0400}-\x{04FF}]*[^\x{0400}-\x{04FF}]*. All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \x{0400}-\x{04FF} with \p{Cyrillic} as suggested in another answer.

This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

answered Dec 9, 2016 at 22:21

Nicolas

7,2094 gold badges35 silver badges81 bronze badges

2 Comments

Захар Joe Over a year ago

This regex loses every other word when I try it, only odd words are going into the array, even words are lost.

Nicolas Over a year ago

Don't use it with split, use it with match. This matches a string not a position to split.

Collectives™ on Stack Overflow

Split string into array based on a unicode character range in PHP

4 Answers 4

1 Comment

9 Comments

3 Comments

2 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

9 Comments

3 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related