0

hope someone can help with a simple sentiment analysis in Pyspark. I have a Pyspark dataframe where each row contains a word. I also have a dictionary of common stopwords.

I want to remove the rows where the word (value of the row) is in the stopwords dict.

Input:

+-------+
|  word |
+-------+
|    the|
|   food|
|     is|
|amazing|
|    and|
|  great|
+-------+

stopwords = {'the', 'is', 'and'}

Expected Output:

+-------+
|  word |
+-------+
|   food|
|amazing|
|  great|
+-------+

2 Answers 2

2

Use negative isin:

df = df.filter(~F.col("word").isin(stop_words))

where stop_words:

stop_words = {"the", "is", "and"}

Result:

+-------+                                                                       
|word   |
+-------+
|food   |
|amazing|
|great  |
+-------+
Sign up to request clarification or add additional context in comments.

Comments

1

You can create dataframe using the set of stopwords then join with input dataframe using left_anti join:

stopwords_df = spark.createDataFrame([[w] for w in stopwords], ["word"])

result_df = input_df.join(stopwords_df, ["word"], "left_anti")

result_df.show()
#+-------+
#|   word|
#+-------+
#|amazing|
#|   food|
#|  great|
#+-------+

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.