pandas - efficient iterrow and replace

Question

I'm trying to find a way to do this process more efficiently. Since my dataframe has about 100k rows and columns could contain up to 20k strings per row.

I want to remove words from a list if that word is in another list. Besides how big my data is my remove list is about 600k size wise.

I was hoping for a some sort of vectorized solution but not sure its possible.

What i'm currently doing

removelist = df2.words.tolist()
for row in df.itertuples():
   df.at[row.Index, 'tweet'] = [x for x in row.tweet if x not in removelist]

I know I can convert them to a set and do

set(row.tweet).intersection(screen)

but maintaining duplicates is pretty important. Can anyone point me in the right direction?

Edit: Sample data

df
                  tweet  user
0                  [@a]     1
1                  [@b]     2
2  [#c, #d, #e, #f, #e]     3
3                  [@g]     4

df2
    words
0  #d
1  @a

Desired ouput:

                       tweet  user
    0                  []       1
    1                  [@b]     2
    2      [#c, #e, #f, #e]     3
    3                  [@g]     4

it'll be good if you provide some sample data set and your desired output. — Allen Qin
– Allen Qin, Commented Mar 12, 2018 at 5:42

cs95 · Accepted Answer · 2018-03-12 05:39:03Z

3

Iterating over itertuples is slow. I'd recommend using a list comprehension for maximum speed (since this isn't an operation that you can vectorise, this is likely your best bet):

removeset = set(df2.words.tolist())
df['tweet'] = [
     [j for j in i if j not in removeset] for i in df.tweet.tolist()
]

answered Mar 12, 2018 at 5:39

cs95

406k106 gold badges744 silver badges798 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

pandas - efficient iterrow and replace

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related