I'm trying to find a way to do this process more efficiently. Since my dataframe has about 100k rows and columns could contain up to 20k strings per row.
I want to remove words from a list if that word is in another list. Besides how big my data is my remove list is about 600k size wise.
I was hoping for a some sort of vectorized solution but not sure its possible.
What i'm currently doing
removelist = df2.words.tolist()
for row in df.itertuples():
df.at[row.Index, 'tweet'] = [x for x in row.tweet if x not in removelist]
I know I can convert them to a set and do
set(row.tweet).intersection(screen)
but maintaining duplicates is pretty important. Can anyone point me in the right direction?
Edit: Sample data
df
tweet user
0 [@a] 1
1 [@b] 2
2 [#c, #d, #e, #f, #e] 3
3 [@g] 4
df2
words
0 #d
1 @a
Desired ouput:
tweet user
0 [] 1
1 [@b] 2
2 [#c, #e, #f, #e] 3
3 [@g] 4