6

I'm trying to split a utf8 encoded string into an array of chars. The function that I now use used to work, but for some reason it doesn't work anymore. What could be the reason. And better yet, how can I fix it?

This is my string:

Zelf heb ik maar één vraag: wie ben jij?

This is my function:

function utf8Split($str, $len = 1)
{
  $arr = array();
  $strLen = mb_strlen($str);
  for ($i = 0; $i < $strLen; $i++)
  {
    $arr[] = mb_substr($str, $i, $len);
  }
  return $arr;
}

This is the result:

Array
(
    [0] => Z
    [1] => e
    [2] => l
    [3] => f
    [4] =>  
    [5] => h
    [6] => e
    [7] => b
    [8] =>  
    [9] => i
    [10] => k
    [11] =>  
    [12] => m
    [13] => a
    [14] => a
    [15] => r
    [16] =>  
    [17] => e
    [18] => ́
    [19] => e
    [20] => ́
    [21] => n
    [22] =>  
    [23] => v
    [24] => r
    [25] => a
    [26] => a
    [27] => g
    [28] => :
    [29] =>  
    [30] => w
    [31] => i
    [32] => e
    [33] =>  
    [34] => b
    [35] => e
    [36] => n
    [37] =>  
    [38] => j
    [39] => i
    [40] => j
    [41] => ?
)
3
  • 1
    Define "doesn't work". What is it doing that it's not supposed to be doing and/or what is it not doing that it's supposed to be doing? Commented Feb 24, 2012 at 21:20
  • The éé part isn't splitted as it should Commented Feb 25, 2012 at 7:35
  • SOLUTION: stackoverflow.com/a/21654160/2377343 Commented Jan 24, 2016 at 17:14

7 Answers 7

18

This is the best solution!:

I've found this nice solution in the PHP manual pages.

preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);

It works really fast:

In PHP 5.6.18 it split a 6 MB big text file in a matter of seconds.

Best of all. It doesn't need MultiByte (mb_) support!

Similar answer also here.

Sign up to request clarification or add additional context in comments.

Comments

13

For the mb_... functions you should specify the charset encoding.

In your example code these are especially the following two lines:

$strLen = mb_strlen($str, 'UTF-8');
$arr[] = mb_substr($str, $i, $len, 'UTF-8');

The full picture:

function utf8Split($str, $len = 1)
{
  $arr = array();
  $strLen = mb_strlen($str, 'UTF-8');
  for ($i = 0; $i < $strLen; $i++)
  {
    $arr[] = mb_substr($str, $i, $len, 'UTF-8');
  }
  return $arr;
}

Because you're using UTF-8 here. However, if the input is not properly encoded, this won't work "any longer" - just because it has not been designed for something else.

You can alternativly process UTF-8 encoded strings with PCRE regular expressions, for example this will return what you're looking for in less code:

$str = 'Zelf heb ik maar één vraag: wie ben jij?';

$chars = preg_split('/(?!^)(?=.)/u', $str);

Next to preg_split there is also mb_split.

3 Comments

I specify the encoding globally with: mb_internal_encoding('UTF-8');
That should set it (but it sets also HTTP input and output encoding), you could analyze the string (e.g. with a hexdump) and check the string encoding firsthand, I suspect either the encoding setting is not right or the charset encoding of the string is something else than UTF-8.
The performance is really high in this approach. Please Consider this one which is much faster something like 10000 times better preg_split('//u', $str, -1, PREG_SPLIT_NO_EMPTY);
4

If you not sure about availability of mb_string function library, then use:

Version 1:

function utf8_str_split($str='',$len=1){
    preg_match_all("/./u", $str, $arr);
    $arr = array_chunk($arr[0], $len);
    $arr = array_map('implode', $arr);
    return $arr;
}

Version 2:

function utf8_str_split($str='',$len=1){
    return preg_split('/(?<=\G.{'.$len.'})/u', $str,-1,PREG_SPLIT_NO_EMPTY);
}

Both functions tested in PHP5

Comments

3

There is a multibyte split function in PHP, mb_split.

1 Comment

Be certain to set the mb_regex_encoding(), too!
1

I found out the é was not the character I expected. Apparently there is a difference between né and ńe. I got it working by normalizing the string first.

Comments

0
mb_internal_encoding("UTF-8"); 

46 arrays - off 41 arrays

1 Comment

Could you clarify this answer?
0

Since php 7.4, you can use mb_str_split:

$arr = mb_str_split($str);

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.