PHP mb_split(), capturing delimiters -


preg_split has optional preg_split_delim_capture flag, returns delimiters in returned array. mb_split not.

is there way split multibyte string (not utf-8, kinds) , capture delimiters?

i'm trying make multibyte-safe linebreak splitter, keeping linebreaks, prefer more genericaly usable solution.

solution user casimir et hippolyte, built solution , posted on github (https://github.com/vanderlee/php-multibyte-functions/blob/master/functions/mb_explode.php), allows preg_split flags:

/**  * cross between mb_split , preg_split, adding preg_split flags  * mb_split.  * @param string $pattern  * @param string $string  * @param int $limit  * @param int $flags  * @return array  */ function mb_explode($pattern, $string, $limit = -1, $flags = 0) {            $strlen = strlen($string);      // bytes!        mb_ereg_search_init($string);      $lengths = array();     $position = 0;     while (($array = mb_ereg_search_pos($pattern)) !== false) {         // capture split         $lengths[] = array($array[0] - $position, false, null);          // move position         $position = $array[0] + $array[1];          // capture delimiter         $regs = mb_ereg_search_getregs();                    $lengths[] = array($array[1], true, isset($regs[1]) && $regs[1]);          // continue on?         if ($position >= $strlen) {             break;         }                }      // add last bit, if not ending split     $lengths[] = array($strlen - $position, false, null);      // substrings     $parts = array();     $position = 0;           $count = 1;     foreach ($lengths $length) {         $is_delimiter   = $length[1];         $is_captured    = $length[2];          if ($limit > 0 && !$is_delimiter && ($length[0] || ~$flags & preg_split_no_empty) && ++$count > $limit) {             if ($length[0] > 0 || ~$flags & preg_split_no_empty) {                           $parts[]    = $flags & preg_split_offset_capture                             ? array(mb_strcut($string, $position), $position)                             : mb_strcut($string, $position);                             }             break;         } elseif ((!$is_delimiter || ($flags & preg_split_delim_capture && $is_captured))                && ($length[0] || ~$flags & preg_split_no_empty)) {             $parts[]    = $flags & preg_split_offset_capture                         ? array(mb_strcut($string, $position, $length[0]), $position)                         : mb_strcut($string, $position, $length[0]);         }          $position += $length[0];     }      return $parts; } 

capturing delimiters possible preg_split , not available in other functions.

so 3 possibilities:

1) convert string utf8, use preg_split preg_split_delim_capture, , use array_map convert each items original encoding.

this way more simple. not case in second way. (note in general, more simple work in utf8, instead of dealing exotic encodings)

2) in place of split-like function need use example mb_ereg_search_regs matched parts , build pattern this:

delimiter|all_that_is_not_the_delimiter 

(note 2 branches of alternation must mutually exclusive , take care write them in way makes impossible gaps between results. first part must @ beginning of string , last part must @ end. each part must contiguous previous , on.)

3) use mb_split lookarounds. definition, lookarounds zero-width assertions , don't match characters positions in string. can use kind of pattern matches positions after or before delimiter:

(?=delimiter)|(<=delimiter) 

(the limitation of way subpattern in lookbehind can't have variable length (in other words, can't use quantifier inside), can alternation of fixed length subpatterns: (?<=subpat1|subpat2|subpat3) )


Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -