Python Regex Replace Matching Text

Question

In a huge text file which I handle as a big string for efficiency reasons (I don't read the file line by line) I want to delete any character that is after -swf and before ||

I have a huge text which looks like this:

bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL

I want the final result to look like this:

bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

I can do this line by line using the partition function in python but it takes a lot of time since it requires to handle the file line by line and the file has more than 10M rows. Is there any way to do this by not examining the file line by line?

Your problem has nothing to do with the question title. I'd recomend rewriting it refering to using regex to substitute text in a big text file. — aldux
– aldux, Commented Apr 1, 2014 at 21:29

Kevin Gori · Accepted Answer · 2014-04-01 22:08:57Z

3

This should do what you want

import re

s = '''bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf%28ML%29endBeliefnet.Web.UI.S||NULL||NULL'''

# bad_regex = re.compile(r'(?<=swf)[^|]+') # will stop at a single pipe character |
regex = re.compile(r'(?<=-swf).*?(?=\|\|)') # matches everything between -swf and || 
regex.sub('', s)

Output =

>>> print(s)
bla bla bla ||NULL||abc-swf||NULL||NULL
bla bla bla ||NULL||cdacda-swf||NULL||NULL
bla bla bla ||NULL||bgdbgdbgd-swf||NULL||NULL

Edit 1: The regex I gave in the original answer fails if the text for removal has a '|' character in it. I've replaced it with a regex that doesn't have this problem.

edited Apr 1, 2014 at 22:08

answered Apr 1, 2014 at 21:40

Kevin Gori

1511 silver badge6 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Carst · Accepted Answer · 2014-04-01 21:53:59Z

1

Probably to make it really quick you could try to use Cython. Also: maybe you could first try to see if this performs better ->

def test_speed():
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    string_list = row_text.split('||') # which gives a list
    # Then only partition in the string_list[2] area -> 
    string_list[2] = ''.join(string_list[2].partition('-swf')[0:2])
    # then join it together again: 
    row_text = '||'.join(string_list)

%timeit test_speed()
100000 loops, best of 3: 1.36 µs per loop

just some ideas! seems to be quite fast?

Edit: looking at Kevin's regex example:

import re
regex = re.compile(r'(?<=swf)[^|]+')
def test_regex_speed(regex):
    row_text = 'bla bla bla ||NULL||cdacda-swfend%23wrapclass||NULL||NULL'
    regex.sub('', row_text)

%timeit test_regex_speed(regex)
100000 loops, best of 3: 2.16 µs per loop

So that's a bit slower, but you could do the entire file at once with the regex.

Edit 2: sorry, i see i didn't see the "entire file is already in memory". For optimal memory usage I would suggest to go row by row through large files though.

edited Apr 1, 2014 at 21:53

answered Apr 1, 2014 at 21:41

Carst

1,6143 gold badges17 silver badges28 bronze badges

1 Comment

Georgia2004 Over a year ago

Thank you for your reply. Yes, I was looking for something that wouldn't require me to go through the file row by row.. I have implemented the row by row code and it took 4 hours to go through the entire file. The regex takes few minutes!

Collectives™ on Stack Overflow

Python Regex Replace Matching Text

2 Answers 2

Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related