Find and replace for millions of rows with regex in Python

Ask Question

Asked 8 years, 6 months ago

Modified 8 years, 6 months ago

Viewed 514 times

I have 2 set of data.

First one which serves as dictionary has two columns keyword and id and 180000 rows. Below is some sample data.

Also, note that some of the keyword are as small as 2 character and as big as 700 characters and there is no fixed length of the keywords. Although id has fixed pattern of 3 digit number with a hash symbol before and after the number.

keyword         id
salesman        #123#
painter         #486#
senior painter  #215#

Second file has one column which is corpus and it runs into 22 million records and length of each record varies between 10 to 1000. Below is sample data which can be considered as input.

corpus
I am working as a salesman. salesmanship is not my forte, however i have become a good at it
I have been a painter since i was 19
are you the salesman?

Output

corpus
I am working as a #123#. salesmanship is not my forte, however i have become a good at it
I have been a #486# since i was 19
are you the #123#?

Please note that i want to replace complete word and not overlapped words. so in the first sentence salesman was replaced with #123# where as salesmanship was not replaced with #123#ship. This requires me to add regular expression '\b' before and after the keyword. This is why Regex is important for the search function

So this is a search and replace operation for multi-million rows and has regex. I have read Mass string replace in python? and Speed up millions of regex replacements in Python 3, however it is taking me days to do this find and replace, which i can't afford as this is a weekly task. I want to be able to do this much faster. Below is my code

Id = df_dict.Id.tolist()
#convert to list with regex
keyword = [r'\b'+ x + r'\b' for x in df_dict.keyword]
#light on memory to clean file
del df_dict
#replace
df_corpus["corpus_text"].replace(keyword, Id, regex=False,inplace=True)

edited Jun 3, 2017 at 12:03

asked Jun 3, 2017 at 11:54

StatguyUser

2,6754 gold badges26 silver badges49 bronze badges

Parallelize it using multiprocessing. If you have an 8-core machine you should be able to split the job into 8 or more parts and do them concurrently.

John Zwinck
– John Zwinck

2017-06-03 12:32:50 +00:00
Commented Jun 3, 2017 at 12:32
I have 4 core machine with 8 GB RAM, hence not the best option

StatguyUser
– StatguyUser

2017-06-03 12:34:24 +00:00
Commented Jun 3, 2017 at 12:34
Is writing C, C++, Java, etc an option?

John Zwinck
– John Zwinck

2017-06-03 12:35:09 +00:00
Commented Jun 3, 2017 at 12:35
I only know R, Python

StatguyUser
– StatguyUser

2017-06-03 12:35:36 +00:00
Commented Jun 3, 2017 at 12:35
If you check closely df_dict.Id.tolist() is pandas code df_dict is dataframe Id is name of column and tolist() is used like data_frame.column_name.tolist()

StatguyUser
– StatguyUser

2017-06-03 13:42:05 +00:00
Commented Jun 3, 2017 at 13:42

| Show 4 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

Find and replace for millions of rows with regex in Python

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked