Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopwords

Question

hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word. I also have a dictionary of common stopwords.

I want to remove the rows where the word (value of the row) is in the stopwords dict.

Input:

+-------+
|  word |
+-------+
|    the|
|   food|
|     is|
|amazing|
|    and|
|  great|
+-------+

stopwords = {'the', 'is', 'and'}

Expected Output:

+-------+
|  word |
+-------+
|   food|
|amazing|
|  great|
+-------+

vladsiv · Accepted Answer · 2021-11-12 09:53:23Z

2

Use negative isin:

df = df.filter(~F.col("word").isin(stop_words))

where stop_words:

stop_words = {"the", "is", "and"}

Result:

+-------+                                                                       
|word   |
+-------+
|food   |
|amazing|
|great  |
+-------+

answered Nov 12, 2021 at 9:53

vladsiv

2,9841 gold badge13 silver badges23 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

blackbishop · Accepted Answer · 2021-11-12 09:55:17Z

1

You can create dataframe using the set of stopwords then join with input dataframe using left_anti join:

stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])

result_df = input_df.join(stopwords_df, ["word"], "left_anti")

result_df.show()
#+-------+
#|   word|
#+-------+
#|amazing|
#|   food|
#|  great|
#+-------+

answered Nov 12, 2021 at 9:55

blackbishop

32.8k11 gold badges61 silver badges86 bronze badges

Collectives™ on Stack Overflow

Python Pyspark - Text Analysis / Removing rows if word (value of row) is in a dictionary of stopwords

2 Answers 2

Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related