PHP mb_split(), capturing delimiters -
preg_split
has optional preg_split_delim_capture
flag, returns delimiters in returned array. mb_split
not.
is there way split multibyte string (not utf-8, kinds) , capture delimiters?
i'm trying make multibyte-safe linebreak splitter, keeping linebreaks, prefer more genericaly usable solution.
solution user casimir et hippolyte, built solution , posted on github (https://github.com/vanderlee/php-multibyte-functions/blob/master/functions/mb_explode.php), allows preg_split flags:
/** * cross between mb_split , preg_split, adding preg_split flags * mb_split. * @param string $pattern * @param string $string * @param int $limit * @param int $flags * @return array */ function mb_explode($pattern, $string, $limit = -1, $flags = 0) { $strlen = strlen($string); // bytes! mb_ereg_search_init($string); $lengths = array(); $position = 0; while (($array = mb_ereg_search_pos($pattern)) !== false) { // capture split $lengths[] = array($array[0] - $position, false, null); // move position $position = $array[0] + $array[1]; // capture delimiter $regs = mb_ereg_search_getregs(); $lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]); // continue on? if ($position >= $strlen) { break; } } // add last bit, if not ending split $lengths[] = array($strlen - $position, false, null); // substrings $parts = array(); $position = 0; $count = 1; foreach ($lengths $length) { $is_delimiter = $length[1]; $is_captured = $length[2]; if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & preg_split_no_empty) && ++$count > $limit) { if ($length[0] > 0 || ~$flags & preg_split_no_empty) { $parts[] = $flags & preg_split_offset_capture ? array(mb_strcut($string, $position), $position) : mb_strcut($string, $position); } break; } elseif ((!$is_delimiter || ($flags & preg_split_delim_capture && $is_captured)) && ($length[0] || ~$flags & preg_split_no_empty)) { $parts[] = $flags & preg_split_offset_capture ? array(mb_strcut($string, $position, $length[0]), $position) : mb_strcut($string, $position, $length[0]); } $position += $length[0]; } return $parts; }
capturing delimiters possible preg_split
, not available in other functions.
so 3 possibilities:
1) convert string utf8, use preg_split
preg_split_delim_capture
, , use array_map
convert each items original encoding.
this way more simple. not case in second way. (note in general, more simple work in utf8, instead of dealing exotic encodings)
2) in place of split-like function need use example mb_ereg_search_regs
matched parts , build pattern this:
delimiter|all_that_is_not_the_delimiter
(note 2 branches of alternation must mutually exclusive , take care write them in way makes impossible gaps between results. first part must @ beginning of string , last part must @ end. each part must contiguous previous , on.)
3) use mb_split
lookarounds. definition, lookarounds zero-width assertions , don't match characters positions in string. can use kind of pattern matches positions after or before delimiter:
(?=delimiter)|(<=delimiter)
(the limitation of way subpattern in lookbehind can't have variable length (in other words, can't use quantifier inside), can alternation of fixed length subpatterns: (?<=subpat1|subpat2|subpat3)
)
Comments
Post a Comment