python - Most efficient way to remove multiple substrings from string? -


what's efficient method remove list of substrings string?

i'd cleaner, quicker way following:

words = 'word1 word2 word3 word4, word5' replace_list = ['word1', 'word3', 'word5']  def remove_multiple_strings(cur_string, replace_list):   cur_word in replace_list:     cur_string = cur_string.replace(cur_word, '')   return cur_string  remove_multiple_strings(words, replace_list) 

regex:

>>> import re >>> re.sub(r'|'.join(map(re.escape, replace_list)), '', words) ' word2  word4, ' 

the above one-liner not fast string.replace version, shorter:

>>> words = ' '.join([hashlib.sha1(str(random.random())).hexdigest()[:10] _ in xrange(10000)]) >>> replace_list = words.split()[:1000] >>> random.shuffle(replace_list) >>> %timeit remove_multiple_strings(words, replace_list) 10 loops, best of 3: 49.4 ms per loop >>> %timeit re.sub(r'|'.join(map(re.escape, replace_list)), '', words) 1 loops, best of 3: 623 ms per loop 

gosh! 12x slower.

but can improve it? yes.

as concerned words can filter out words words string using \w+ , compare against set of replace_list(yes actual set: set(replace_list)):

>>> def sub(m):     return '' if m.group() in s else m.group() >>> %%timeit s = set(replace_list) re.sub(r'\w+', sub, words) ... 100 loops, best of 3: 7.8 ms per loop 

for larger string , words string.replace approach , first solution end taking quadratic time, solution should run in linear time.


Comments

Popular posts from this blog

python - TypeError: start must be a integer -

c# - DevExpress RepositoryItemComboBox BackColor property ignored -

django - Creating multiple model instances in DRF3 -