python - Regex taking forever on short string. -


i'm looking through bunch of strings, trying match following pattern.

location_pattern = re.compile( r"""                                 \b                                 (?p<location>                                     ([a-z]\w*[ -]*)+[, ]+                                     (                                       [a-z]{2}                                         |                                       [a-z]\w+\ *\d  ##                                     )                                 )                                 \b                                  """, flags=re.verbose) 

now regex runs ok on data set, takes forever (well, 5 seconds) on this particular string:

' javascript   software   architect,   successful   serial' 

there bunch of strings 1 (all caps, lots of space chars) @ point in input data , program slowed when hits it. tried taking out different parts of regex, , turns out culprit the
\ *\d @ end of commented line.

i'd understand how causing regex validation take long.

can help?

the reason why removing \ *\d works because turn example in question non-matching case matching case. in backtracking engine, matching case takes (much) less time non-matching case, since in non-matching case, engine must exhaust search space come conclusion.

where problem lies correctly pointed out fede (the explanation rather hand-waving , inaccurate, though).

([a-z]\w*[ -]*)+ 

since [ -]* optional , \w can match [a-z], regex above degenerates ([a-z][a-z]*)+, matches classic example of catastrophic backtracking (a*)*. degenerate form shows problem manifests on long strings of uppercase letter.

the fragment doesn't cause harm. however, long sequel (whatever follows after fragment above) fails, cause catastrophic backtracking.

here 1 way rewrite (without knowing exact requirement):

[a-z]\w*(?:[ -]+[a-z]\w*)*[ -]* 

by forcing [ -]+ @ least once, pattern can no longer match string of uppercase letter in multiple ways.


Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -