1

I'm trying to find a way to do this process more efficiently. Since my dataframe has about 100k rows and columns could contain up to 20k strings per row.

I want to remove words from a list if that word is in another list. Besides how big my data is my remove list is about 600k size wise.

I was hoping for a some sort of vectorized solution but not sure its possible.

What i'm currently doing

removelist = df2.words.tolist()
for row in df.itertuples():
   df.at[row.Index, 'tweet'] = [x for x in row.tweet if x not in removelist]

I know I can convert them to a set and do

set(row.tweet).intersection(screen)

but maintaining duplicates is pretty important. Can anyone point me in the right direction?

Edit: Sample data

df
                  tweet  user
0                  [@a]     1
1                  [@b]     2
2  [#c, #d, #e, #f, #e]     3
3                  [@g]     4

df2
    words
0  #d
1  @a

Desired ouput:

                       tweet  user
    0                  []       1
    1                  [@b]     2
    2      [#c, #e, #f, #e]     3
    3                  [@g]     4
1
  • it'll be good if you provide some sample data set and your desired output. Commented Mar 12, 2018 at 5:42

1 Answer 1

3

Iterating over itertuples is slow. I'd recommend using a list comprehension for maximum speed (since this isn't an operation that you can vectorise, this is likely your best bet):

removeset = set(df2.words.tolist())
df['tweet'] = [
     [j for j in i if j not in removeset] for i in df.tweet.tolist()
]
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.