Saturday, May 9, 2009

str_split and UTF-8 (and other encodings)

If you've ever dealt with UTF-8 in PHP, you'll probably know that you'll be getting into a lot of hassle, since PHP doesn't internally support any character encodings (until PHP6, that is). Luckily, at least the mb_string extension exists, that provides the basic functionality for handling various different encodings with PHP, but even that library is missing some often needed and useful functions.

One of these missing functions is called str_split(). The function splits the string into array of strings with specified number of characters. This function can come quite handy at times, even though it has only been available since PHP5 among the normal string functions. Here's how to achieve that same functionality with UTF-8 and other encodings.

Dealing with UTF-8

If you're working with UTF-8, there's a relatively easy solution for you, since the PCRE functions support UTF-8 simply by using the modifier u. If you just need to separate the entire string into array of characters, you could simply just use preg_replace():

$chars = preg_split('//u', $string, -1, PREG_SPLIT_NO_EMPTY);

The empty regular expression will match between all characters, which will cause the string to be split into a character array. Because of the u modifier, the string is treated as UTF-8 and all characters in the encoding will matched as whole. The use of PREG_SPLIT_NO_EMPTY is required, because otherwise it would return an empty string at the beginning and the end, because the regular expression would match between the first character and the the beginning and between the last character and the end.

You could also use preg_match_all() to just create a list of all characters like:

preg_match_all('/./us', $string, $match);
$chars = $match[0];

If you want to replicate the str_split functionality, allowing you to split the string into longer character sequences, you could use the following function:

function str_split_utf8 ($string, $split_length = 1)
{
 $length = (int) $split_length;
 $string = (string) $string;
 
 if ($length < 1)
 {
  return false;
 }
 
 return preg_split("/(.{{$length}})/us", $string, -1,
  PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE);
}

This function will work just like str_split, except that it will work correctly for UTF-8 strings. I prefer to use the preg_split() here instead of preg_match(), because my testing indicates it is slightly faster. The function works in a bit roundabout way, because the characters are captured as delimiters, instead of splitting the string into array of strings (hence the use of PREG_SPLIT_DELIM_CAPTURE). Only the last "overflow" characters are actually captured as nonempty separated string.

You could use preg_split() to actually split the string into array of strings with proper size by replacing the above preg_split() with:

preg_split(
 "/(?<=\G.{" . ($length - 1) . "}|\A.{{$length}})(?<=.{{$length}})(?=.)/us",
 $string);

However, this regular expression is considerably slower and I also think you'll agree with me, if I say it's not the most obvious regular expression either. If you need practice in understanding regular expressions, feel free to try to figure out how and why that works.

Encodings other than UTF-8

If you want an mb_string equivalent version of str_split(), which allows you to use different encodings, you'll have to resort to manually obtaining parts of the string with use of the mb_string functions. While it isn't really that much harder, it is significantly slower. To replicate the str_split() function, you could use a function like:

function mb_str_split ($string, $split_length = 1, $encoding = null)
{
 $chunk = (int) $split_length;
 $string = (string) $string;
 
 if ($chunk < 1)
 {
  return false;
 }
 
 // User internal encoding if none provided
 if ($encoding === null)
 {
  $encoding = mb_internal_encoding();
 } 
 
 $len = mb_strlen($string, $encoding);
 $return = array();
 for ($i = 0; $i < $len; $i += $chunk)
 {
  $return[] = mb_substr($string, $i, $chunk, $encoding);
 }
 return $return;
}

This would work just fine with UTF-8 too, but the problem is that due to the iterated nature of the code, it's much slower than calling a single function to do the entire operation for you. UTF-8 is used more much often in PHP than any other multibyte encodings, which is why I provided a separate way for working with UTF-8 strings.

0 comments:

Post a Comment

Note: Only a member of this blog may post a comment.