1

I have a few hundred thousand wonky values in a fixed width file. I want to find the strings old_values and replace them with the strings in the corresponding position in new_values. I could loop through and do this one at a time, but I'm nearly certain there is a much faster way that I am not expert enough to know about.

old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')  # and many more
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')  # and many more
file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

Looping through each value and running .replace on each line seems slow. eg:

for x in len(old_values):
  line.replace(old_values[x], new_values[x])

Any tips for speeding things up?

7
  • 1
    Please post your current method that is slow. Commented Sep 7, 2013 at 19:22
  • How many is "many many more"? Are they all of the same length? Do they all come at the same length offset? Commented Sep 7, 2013 at 19:25
  • You will eventually have to iterate through your data if you want to change it. Commented Sep 7, 2013 at 19:28
  • Are all of the data broken up into such 5-character values? If so, you could simply split the data and set up a dictionary to for the old/new values. Commented Sep 7, 2013 at 19:33
  • @LennartRegebro About a hundred. And they are all the same length and come at the same offsets (1500 or so variables in a fixed width file). Commented Sep 7, 2013 at 19:37

2 Answers 2

3

Here is code that will go through the data character by character and replace it if it finds a mapping. This assumes though that each data that needs to be replaced is absolutely unique.

def replacer(instring, mapping):

    item = ''

    for char in instring:
        item += char
        yield item[:-5]
        item = item[-5:]
        if item in mapping:
            yield mapping[item]
            item = ''

    yield item


old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
value_map = dict(zip(old_values, new_values))

file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

result = ''.join(replacer(file_snippet, value_map))
print result

On your example data this gives:

0000000000000001   -0000000000000000000020000200000000000000000003   -10000100000000000000500000000000000000000000

A faster way would be to split the data into 5-character chunks, if the data fits that way:

old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
value_map = dict(zip(old_values, new_values))

file_snippet = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000' # each line is >7K chars long and there are over 6 gigs of text data

result = []
for chunk in [ file_snippet[i:i+5] for i in range(0, len(file_snippet), 5) ]:
    if chunk in value_map:
        result.append(value_map[chunk])
    else:
        result.append(chunk)

result = ''.join(result)
print result

This results in no replacements in your example data, unless you remove a leading zero, and then you get:

000000000000001   -0000000000000000000020000200000000000000000003   -10000100000000000000500000000000000000000000

Same as above.

Sign up to request clarification or add additional context in comments.

Comments

2

Making a substitution mapping (dict) makes things faster:

import timeit

input_string = '00000000000000010000}0000000000000000000200002000000000000000000030000J0000100000000000000500000000000000000000000'
old_values = ('0000}', '0000J', '0000K', '0000L', '0000M', '0000N')
new_values = ('   -0', '   -1', '   -2', '   -3', '   -4', '   -5')
mapping = dict(zip(old_values,new_values))


def test_replace_tuples(input_string, old_values, new_values):
    for x in xrange(len(old_values)):
        input_string = input_string.replace(old_values[x], new_values[x])
    return input_string


def test_replace_mapping(input_string, mapping):
    for k, v in mapping.iteritems():
        input_string = input_string.replace(k, v)
    return input_string


print timeit.Timer('test_replace_tuples(input_string, old_values, new_values)',
                   'from __main__ import test_replace_tuples, input_string, old_values, new_values').timeit(10000)

print timeit.Timer('test_replace_mapping(input_string, mapping)',
                   'from __main__ import test_replace_mapping, input_string, mapping').timeit(10000)

prints:

0.0547060966492
0.048122882843

Note, that the result may be different for different inputs, test it on your real data.

3 Comments

This small difference is likely to disappear with the real data, which has much larger strings. What you are seeing is likely that the lookup in the tuples take a bit of time. With longer strings that difference will in practice disappear.
@LennartRegebro good to know, thank you. I was actually thinking about suggesting pypy, if it's applicable for the OP, of course..
Pypy may indeed help, but choosing the right algorithm will help more. :-)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.