2

In a huge text file which I handle as a big string for efficiency reasons (I don't read the file line by line) I want to delete any character that is after -swf and before ||

I have a huge text which looks like this:

bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL

I want the final result to look like this:

bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

I can do this line by line using the partition function in python but it takes a lot of time since it requires to handle the file line by line and the file has more than 10M rows. Is there any way to do this by not examining the file line by line?

1
  • 1
    Your problem has nothing to do with the question title. I'd recomend rewriting it refering to using regex to substitute text in a big text file. Commented Apr 1, 2014 at 21:29

2 Answers 2

3

This should do what you want

import re

s = '''bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL'''

# bad_regex = re.compile(r'(?<=swf)[^|]+') # will stop at a single pipe character |
regex = re.compile(r'(?<=-swf).*?(?=\|\|)') # matches everything between -swf and || 
regex.sub('', s)

Output =

>>> print(s)
bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

Edit 1: The regex I gave in the original answer fails if the text for removal has a '|' character in it. I've replaced it with a regex that doesn't have this problem.

Sign up to request clarification or add additional context in comments.

Comments

1

Probably to make it really quick you could try to use Cython. Also: maybe you could first try to see if this performs better ->

def test_speed():
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    string_list = row_text.split('||') # which gives a list
    # Then only partition in the string_list[2] area -> 
    string_list[2] = ''.join(string_list[2].partition('-swf')[0:2])
    # then join it together again: 
    row_text = '||'.join(string_list)

%timeit test_speed()
100000 loops, best of 3: 1.36 µs per loop

just some ideas! seems to be quite fast?

Edit: looking at Kevin's regex example:

import re
regex = re.compile(r'(?<=swf)[^|]+')
def test_regex_speed(regex):
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    regex.sub('', row_text)

%timeit test_regex_speed(regex)
100000 loops, best of 3: 2.16 µs per loop

So that's a bit slower, but you could do the entire file at once with the regex.

Edit 2: sorry, i see i didn't see the "entire file is already in memory". For optimal memory usage I would suggest to go row by row through large files though.

1 Comment

Thank you for your reply. Yes, I was looking for something that wouldn't require me to go through the file row by row.. I have implemented the row by row code and it took 4 hours to go through the entire file. The regex takes few minutes!

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.