python - Regex taking forever on short string. -
i'm looking through bunch of strings, trying match following pattern.
location_pattern = re.compile( r""" \b (?p<location> ([a-z]\w*[ -]*)+[, ]+ ( [a-z]{2} | [a-z]\w+\ *\d ## ) ) \b """, flags=re.verbose)
now regex runs ok on data set, takes forever (well, 5 seconds) on this particular string:
' javascript software architect, successful serial'
there bunch of strings 1 (all caps, lots of space chars) @ point in input data , program slowed when hits it. tried taking out different parts of regex, , turns out culprit the
\ *\d @ end of commented line.
i'd understand how causing regex validation take long.
can help?
the reason why removing \ *\d
works because turn example in question non-matching case matching case. in backtracking engine, matching case takes (much) less time non-matching case, since in non-matching case, engine must exhaust search space come conclusion.
where problem lies correctly pointed out fede (the explanation rather hand-waving , inaccurate, though).
([a-z]\w*[ -]*)+
since [ -]*
optional , \w
can match [a-z]
, regex above degenerates ([a-z][a-z]*)+
, matches classic example of catastrophic backtracking (a*)*
. degenerate form shows problem manifests on long strings of uppercase letter.
the fragment doesn't cause harm. however, long sequel (whatever follows after fragment above) fails, cause catastrophic backtracking.
here 1 way rewrite (without knowing exact requirement):
[a-z]\w*(?:[ -]+[a-z]\w*)*[ -]*
by forcing [ -]+
@ least once, pattern can no longer match string of uppercase letter in multiple ways.
Comments
Post a Comment