2

Sorry for the ambiguous subject, what I'm looking for is to have a string with cyrillic characters that may go like

«Добрый день!» - сказал он, потянувшись…

into an array that goes like

[0] => «
[1] => Добрый␠
[2] => день!»␠-␠
[3] => сказал␠
[4] => он,␠
[5] => потянувшись…

So essentially I'm looking for a break to occur on a border between any character and a cyrillic character ([а-я] range) although this must only be true when we transit from any character to a cyrillic character, not vice versa. I've seen examples that successfully solve this with punctuation characters and latin alphabet with

preg_split('/([^.:!?]+[.:!?]+)/', 'hello:there.everyone!so.how?are:you', NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY );

but my attempts to repurpose it into something different have so far failed:

preg_split ('/(?<=[^а-я])/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

almost works but it also splits by regular characters such as spaces and punctuation marks and that is not what I want. Clearly there's something wrong with my regex. How should I modify that to get the result as in the example above?

2
  • why « character is captured as a separate item and the same opposite » is captured as a part of a string день!».. ? Commented Dec 9, 2016 at 22:12
  • Yes, it's not really the best example, I'm willing to sacrifice the [0] there somehow. Commented Dec 9, 2016 at 22:46

4 Answers 4

2

Use the following regex solution:

$s = "«Добрый день!» - сказал он, потянувшись…";
$res = preg_split('/\b(\p{Cyrillic}+\W*)/u', $s, NULL, PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_NO_EMPTY);
print_r($res);
// Array(
//   [0] => «
//   [1] => Добрый 
//   [2] => день!» - 
//   [3] => сказал 
//   [4] => он, 
//   [5] => потянувшись…
//)

See the PHP demo

Details:

  • \b(\p{Cyrillic}+\W*) - matches and captures a whole Cyrillic word with 0+ non-word chars after it
  • The pattern is wrapped with capturing parentheses and PREG_SPLIT_DELIM_CAPTURE will push the captured values into the resulting array
  • PREG_SPLIT_NO_EMPTY will discard empty values in the array
  • /u modifier will make the \b (word boundary) and \W Unicode aware, and will allow processing Unicode strings with regex.
Sign up to request clarification or add additional context in comments.

1 Comment

I really like this elegant solution but when I try it in my own PHP, all I get is just a single line, no splits. It does work in your demo though. Why could that be?
2

How about splitting at an initial \b word boundary with u modifier.

$res = preg_split('/\b(?=\w)(?!^)/u', $str);

The lookahead ensures \b is followed by a word character. (?!^) prevents empty match if start.

See this demo at eval.in

9 Comments

It is a logical solution but unfortunately I need the breaks to occur only on cyrillic characters so that, for example, "слово word" doesn't get split into two.
@ЗахарJoe In this case you could try $res = preg_split('/\b(?=[^\Wa-z])/iu', $str);
I've just tried the two regexes you provided and unfortunately my version of PHP (5.5.38) for some reason returns just a single array element in both cases.
@ЗахарJoe Probably same issue with preg_split('/\b(?=\p{Cyrillic})/u', $str); similar Wiktor's answer.
It probably is, and it's kinda baffling. I did set mb_internal_encoding ( 'UTF-8' ); and I don't think it should need any other tricks. Wonder what's broken and where.
|
1

You have to check also with a look ahead if the next character is a cyrrilic one. This code will do the job:

$t = preg_split ('/(?<=[^а-я])(?=[а-я]+)/ius', $text, NULL, PREG_SPLIT_NO_EMPTY);

It gives this output:

Array
(
    [0] => «
    [1] => Добрый 
    [2] => день!» - 
    [3] => сказал 
    [4] => он, 
    [5] => потянувшись…
)

Here you can try it.

3 Comments

Thank you but I think you should also checkout bobble bubble's answer which seems to be a little more elegant.
Have already voted for this. Another variant: $res = preg_split('/\b(?=[а-я])/iu', $str);
Same story. My PHP disrespects something (although I don't see why it would do that) and only the lookahead variant works.
0

Try this regex: [\x{0400}-\x{04FF}]*[^\x{0400}-\x{04FF}]*. All unicode characters from 0400 to 04FF are considered as cyrillic. It should match exactly what you want. You can also replace \x{0400}-\x{04FF} with \p{Cyrillic} as suggested in another answer.

This is all the characters in that range:
ЀЁЂЃЄЅІЇЈЉЊЋЌЍЎЏ0АБВГДЕЖЗИЙКЛМНОП0РСТУФХЦЧШЩЪЫЬЭЮЯ0абвгдежзийклмнопрстуфхцчшщъыьэюяѐёђѓєѕіїјљњћќѝўџ0460ѠѡѢѣѤѥѦѧѨѩѪѫѬѭѮѯѰѱѲѳѴѵѶѷѸѹѺѻѼѽѾѿҀҁ҂҃҄҅҆҇҈҉ҊҋҌҍҎҏҐґҒғҔҕҖҗҘҙҚқҜҝҞҟҠҡҢңҤҥҦҧҨҩҪҫҬҭҮүҰұҲҳҴҵҶҷҸҹҺһҼҽҾҿ04C0ӀӁӂӃӄӅӆӇӈӉӊӋӌӍӎӏ04D0ӐӑӒӓӔӕӖӗӘәӚӛӜӝӞӟӠӡӢӣӤӥӦӧӨөӪӫӬӭӮӯ04F0ӰӱӲӳӴӵӶӷӸӹӺӻӼӽӾӿ

2 Comments

This regex loses every other word when I try it, only odd words are going into the array, even words are lost.
Don't use it with split, use it with match. This matches a string not a position to split.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.