multiple replacement values in python3

Question

I have a few hundred thousand wonky values in a fixed width file. I want to find the strings old_values and replace them with the strings in the corresponding position in new_values. I could loop through and do this one at a time, but I'm nearly certain there is a much faster way that I am not expert enough to know about.

old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')  # and many more
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')  # and many more
file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

Looping through each value and running .replace on each line seems slow. eg:

for x in len(old_values):
  line.replace(old_values[x], new_values[x])

Any tips for speeding things up?

How many is "many many more"? Are they all of the same length? Do they all come at the same length offset? — Lennart Regebro
– Lennart Regebro, Commented Sep 7, 2013 at 19:25
You will eventually have to iterate through your data if you want to change it. — Slater Victoroff
– Slater Victoroff, Commented Sep 7, 2013 at 19:28
Are all of the data broken up into such 5-character values? If so, you could simply split the data and set up a dictionary to for the old/new values. — Sajjan Singh
– Sajjan Singh, Commented Sep 7, 2013 at 19:33
@LennartRegebro About a hundred. And they are all the same length and come at the same offsets (1500 or so variables in a fixed width file). — Don
– Don, Commented Sep 7, 2013 at 19:37

Lennart Regebro · Accepted Answer · 2013-09-07 19:48:15Z

Here is code that will go through the data character by character and replace it if it finds a mapping. This assumes though that each data that needs to be replaced is absolutely unique.

def replacer(instring, mapping):

    item = ''

    for char in instring:
        item += char
        yield item[:-5]
        item = item[-5:]
        if item in mapping:
            yield mapping[item]
            item = ''

    yield item


old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
value_map = dict(zip(old_values, new_values))

file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

result = ''.join(replacer(file_snippet, value_map))
print result

On your example data this gives:

0000000000000001   -0000000000000000000020000200000000000000000003   -10000100000000000000500000000000000000000000

A faster way would be to split the data into 5-character chunks, if the data fits that way:

old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
value_map = dict(zip(old_values, new_values))

file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

result = []
for chunk in [ file_snippet[i:i+5] for i in range(0, len(file_snippet), 5) ]:
    if chunk in value_map:
        result.append(value_map[chunk])
    else:
        result.append(chunk)

result = ''.join(result)
print result

This results in no replacements in your example data, unless you remove a leading zero, and then you get:

000000000000001   -0000000000000000000020000200000000000000000003   -10000100000000000000500000000000000000000000

Same as above.

alecxe · Accepted Answer · 2013-09-07 19:44:47Z

2

Making a substitution mapping (dict) makes things faster:

import timeit

input_string = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000'
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
mapping = dict(zip(old_values,new_values))


def test_replace_tuples(input_string, old_values, new_values):
    for x in xrange(len(old_values)):
        input_string = input_string.replace(old_values[x], new_values[x])
    return input_string


def test_replace_mapping(input_string, mapping):
    for k, v in mapping.iteritems():
        input_string = input_string.replace(k, v)
    return input_string


print timeit.Timer('test_replace_tuples(input_string, old_values, new_values)',
                   'from __main__ import test_replace_tuples, input_string, old_values, new_values').timeit(10000)

print timeit.Timer('test_replace_mapping(input_string, mapping)',
                   'from __main__ import test_replace_mapping, input_string, mapping').timeit(10000)

prints:

0.0547060966492
0.048122882843

Note, that the result may be different for different inputs, test it on your real data.

answered Sep 7, 2013 at 19:44

alecxe

476k127 gold badges1.1k silver badges1.2k bronze badges

3 Comments

Lennart Regebro Over a year ago

This small difference is likely to disappear with the real data, which has much larger strings. What you are seeing is likely that the lookup in the tuples take a bit of time. With longer strings that difference will in practice disappear.

alecxe Over a year ago

@LennartRegebro good to know, thank you. I was actually thinking about suggesting pypy, if it's applicable for the OP, of course..

Lennart Regebro Over a year ago

Pypy may indeed help, but choosing the right algorithm will help more. :-)

Collectives™ on Stack Overflow

multiple replacement values in python3

2 Answers 2

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related