Fastest Python method for search and replace on a large string

Question

I'm looking for the fastest way to replace a large number of sub-strings inside a very large string. Here are two examples I've used.

findall() feels simpler and more elegant, but it takes an astounding amount of time.

finditer() blazes through a large file, but I'm not sure this is the right way to do it.

Here's some sample code. Note that the actual text I'm interested in is a single string around 10MB in size, and there's a huge difference in these two methods.

import re

def findall_replace(text, reg, rep):
    for match in reg.findall(text):
        output = text.replace(match, rep)
    return output

def finditer_replace(text, reg, rep):
    cursor_pos = 0
    output = ''
    for match in reg.finditer(text):
        output += "".join([text[cursor_pos:match.start(1)], rep])
        cursor_pos = match.end(1)
    output += "".join([text[cursor_pos:]])
    return output

reg = re.compile(r'(dog)')
rep = 'cat'
text = 'dog cat dog cat dog cat'

finditer_replace(text, reg, rep)

findall_replace(text, reg, rep)

UPDATE Added re.sub method to tests:

def sub_replace(reg, rep, text):
    output = re.sub(reg, rep, text)
    return output

Results

re.sub() - 0:00:00.031000
finditer() - 0:00:00.109000
findall() - 0:01:17.260000

and the second one is really much faster? Seems strange to me, they should take approx. the same time. And i think both ways are correct. — Sören
– Sören, Commented Feb 4, 2011 at 1:04
Using += with strings is an O(n^2) operation, compared to the O(n) of building a list and using "" to join. — Seth Johnson
– Seth Johnson, Commented Feb 4, 2011 at 1:14
Sören: yes, the difference between the two methods is profound. — cyrus
– cyrus, Commented Feb 4, 2011 at 1:15
Is it a very simplified example? Because you could just use the_string.replace('dog', 'cat'). If it is, how complicated is the actual regex? — Jochen Ritzel
– Jochen Ritzel, Commented Feb 4, 2011 at 1:41

btilly · Accepted Answer · 2011-02-04 01:06:55Z

28

The standard method is to use the built-in

re.sub(reg, rep, text)

Incidentally the reason for the performance difference between your versions is that each replacement in your first version causes the entire string to be recopied. Copies are fast, but when you're copying 10 MB at a go, enough copies will become slow.

answered Feb 4, 2011 at 1:06

btilly

47.8k3 gold badges70 silver badges101 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

cyrus Over a year ago

Thank you. I didn't use re.sub() because I thought it operated in the same was as findall. I ran my tests again and re.sub is clearly the fastest method. The results have been added to the question.

eyquem · Accepted Answer · 2011-02-04 01:42:40Z

You can, and I think you must because it certainly is an optimized function, use

re.sub(pattern, repl, string[, count, flags])

The reason why your findall_replace() function is long is that at each match, a new string object is created, as you will see by executed the following code:

ch = '''qskfg qmohb561687ipuygvnjoihi2576871987uuiazpoieiohoihnoipoioh
opuihbavarfgvipauhbi277auhpuitchpanbiuhbvtaoi541987ujptoihbepoihvpoezi 
abtvar473727tta aat tvatbvatzeouithvbop772iezubiuvpzhbepuv454524522ueh'''

import re

def findall_replace(text, reg, rep):
    for match in reg.findall(text):
        text = text.replace(match, rep)
        print id(text)
    return text

pat = re.compile('\d+')
rep = 'AAAAAAA'

print id(ch)
print
print findall_replace(ch, pat, rep)

Note that in this code I replaced output = text.replace(match, rep) with text = text.replace(match, rep) , otherwise only the last occurence is replaced.

finditer_replace() is long for the same reason as for findall_replace(): repeated creation of a string object. But the former uses an iterator re.finditer() while the latter constructs beforhand a list object, so it is longer. That's the difference between iterator and not-iterator.

Armali · Accepted Answer · 2015-08-27 12:10:32Z

4

By the way, your code with findall_replace() isn't safe, it can return unawaited results:

ch = 'sea sun ABC-ABC-DEF bling ranch micABC-DEF fish'

import re

def findall_replace(text, reg, rep):
    for gr in reg.findall(text):
        text = text.replace(gr, rep)
        print 'group==',gr
        print 'text==',text
    return '\nresult is : '+text

pat = re.compile('ABC-DE')
rep = 'DEFINITION'

print 'ch==',ch
print
print findall_replace(ch, pat, rep)

display

ch== sea sun ABC-ABC-DEF bling ranch micABC-DEF fish

group== ABC-DE
text== sea sun ABC-DEFINITIONF bling ranch micDEFINITIONF fish
group== ABC-DE
text== sea sun DEFINITIONFINITIONF bling ranch micDEFINITIONF fish

result is : sea sun DEFINITIONFINITIONF bling ranch micDEFINITIONF fish

edited Aug 27, 2015 at 12:10

Armali

19.6k15 gold badges64 silver badges185 bronze badges

answered Feb 4, 2011 at 2:06

eyquem

27.8k7 gold badges43 silver badges46 bronze badges

Comments

iSergium · Accepted Answer · 2025-04-22 14:24:01Z

For your case better to use usual str.replace:

text.replace('dog', 'cat')

Methods with regex are slower.

+----------------------------------------------------------------------------+
|    full   |   max    |   min    |   avg    |           f           |   %   |
| 25.217 ms | 4.260 ms | 2.056 ms | 2.522 ms | test_re_finditer      | 163.4 |
| 22.624 ms | 2.945 ms | 2.096 ms | 2.262 ms | test_re_search        | 146.6 |
| 20.606 ms | 2.210 ms | 1.981 ms | 2.061 ms | test_re_sub_lambda    | 133.5 |
| 15.435 ms | 1.581 ms | 1.495 ms | 1.544 ms | test_reduce_function2 | 100.0 |
| 15.740 ms | 1.644 ms | 1.501 ms | 1.574 ms | test_str_replace      | 102.0 |
+----------------------------------------------------------------------------+

By the way, if you need to replace one-char multiple strings, best way is str.translate. For example:

trans = str.maketrans({
            '0': '!',
            'b': '@',
            'c': '#',
            'd': '$',
        })
text.translate(trans)

Methods with regex are much slower.

+------------------------------------------------------------------------------------+
|     full    |    max     |    min     |    avg     |           f           |   %   |
| 1757.060 ms | 182.222 ms | 169.885 ms | 175.706 ms | test_re_finditer      |  999+ |
| 1959.685 ms | 202.098 ms | 190.872 ms | 195.968 ms | test_re_search        |  999+ |
| 1245.943 ms | 149.822 ms | 116.749 ms | 124.594 ms | test_re_sub_lambda    |  999+ |
|   60.170 ms |   6.356 ms |   5.710 ms |   6.017 ms | test_reduce_function2 | 365.4 |
|   58.681 ms |   6.050 ms |   5.569 ms |   5.868 ms | test_str_replace      | 356.4 |
|   16.466 ms |   2.264 ms |   1.267 ms |   1.647 ms | test_translate        | 100.0 |
+------------------------------------------------------------------------------------+

It tested on text file (1.8 MB, 1771184 chars), python3.13, by cotests, full test here.

For single string (for example {'0': '!'}) difference is smaller, but still large. str.replace also showed a good result.

+--------------------------------------------------------------------------------+
|    full    |    max    |    min    |    avg    |           f           |   %   |
| 359.318 ms | 41.078 ms | 34.620 ms | 35.932 ms | test_re_finditer      |  999+ |
| 393.215 ms | 40.254 ms | 38.788 ms | 39.322 ms | test_re_search        |  999+ |
| 262.919 ms | 32.364 ms | 24.856 ms | 26.292 ms | test_re_sub_lambda    |  999+ |
|  15.471 ms |  2.093 ms |  1.352 ms |  1.547 ms | test_reduce_function2 | 119.8 |
|  14.745 ms |  1.547 ms |  1.325 ms |  1.475 ms | test_str_replace      | 114.2 |
|  12.914 ms |  1.553 ms |  1.204 ms |  1.291 ms | test_translate        | 100.0 |
+--------------------------------------------------------------------------------+

As a result, do not use regex for such tasks.

Thanks str.translate is faster than str.replace converted List of Strings [('"',' '),('♪',' '),] # Old, New to a Dictionary first instead {'"': ' ', '♪': ' ',}

Collectives™ on Stack Overflow

Fastest Python method for search and replace on a large string

4 Answers 4

1 Comment

Comments

Comments

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

1 Comment

Comments

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related