python - Regex taking forever on short string. -
i'm looking through bunch of strings, trying match following pattern.
location_pattern = re.compile( r""" \b (?p<location> ([a-z]\w*[ -]*)+[, ]+ ( [a-z]{2} | [a-z]\w+\ *\d ## ) ) \b """, flags=re.verbose) now regex runs ok on data set, takes forever (well, 5 seconds) on this particular string:
' javascript software architect, successful serial' there bunch of strings 1 (all caps, lots of space chars) @ point in input data , program slowed when hits it. tried taking out different parts of regex, , turns out culprit the
\ *\d @ end of commented line.
i'd understand how causing regex validation take long.
can help?
the reason why removing \ *\d works because turn example in question non-matching case matching case. in backtracking engine, matching case takes (much) less time non-matching case, since in non-matching case, engine must exhaust search space come conclusion.
where problem lies correctly pointed out fede (the explanation rather hand-waving , inaccurate, though).
([a-z]\w*[ -]*)+ since [ -]* optional , \w can match [a-z], regex above degenerates ([a-z][a-z]*)+, matches classic example of catastrophic backtracking (a*)*. degenerate form shows problem manifests on long strings of uppercase letter.
the fragment doesn't cause harm. however, long sequel (whatever follows after fragment above) fails, cause catastrophic backtracking.
here 1 way rewrite (without knowing exact requirement):
[a-z]\w*(?:[ -]+[a-z]\w*)*[ -]* by forcing [ -]+ @ least once, pattern can no longer match string of uppercase letter in multiple ways.
Comments
Post a Comment