0

I have a dataframe containing maybe ~200k words and phrases. Many of the phrases are duplicative (e.g., the word "adventure" appears thousands of times). I'd like to get a count of each word, and then dedupe. My attempts at doing so take a really long time--longer than anything else in the entire script--and doing anything with the resulting dataframe after I've gotten my counts also takes forever.

My initial input looks like this:

Keyword_Type    Keyword     Clean_Keyword
Not Geography   cat         cat
Not Geography   cat         cat
Not Geography   cat         cat
Not Geography   cats        cat
Not Geography   cave        cave
Not Geography   celebrity   celebrity
Not Geography   celebrities celebrity

I'm looking for an output like this, which counts all of the times that a given word appears in the dataframe.

Keyword_Type    Keyword     Clean_Keyword   Frequency
Not Geography   cat         cat             4 
Not Geography   cat         cat             4
Not Geography   cat         cat             4
Not Geography   cats        cat             4
Not Geography   cave        cave            1
Not Geography   celebrity   celebrity       2
Not Geography   celebrities celebrity       2

The code snippet I'm currently using is as follows:

w = Window.partitionBy("Clean_Keyword")
key_count = key_lemma.select("Keyword_Type","Keyword", "Clean_Keyword", count("Clean_Keyword").over(w).alias("Frequency")).dropDuplicates()

I also tried a groupby-count on the Clean Keyword column and joining back to the original key_count df, but this also takes a disgustingly large amount of time. The functions on this same dataset before counting and deduping run in a second or two or less, and any subsequent functions I run on the resulting df are faster on my computer using straight Python than PySpark running on a decent-sized cluster. So, needless to say, something I'm doing is definitely not optimal...

7
  • I don't understand how those frequency numbers are made. Where is the text? In the keyword column? Commented Sep 17, 2017 at 21:19
  • I created them through the partition function. The dataset here isn't sorted, but adventure might appear thousands of times further down within my dataset. I windowed on the keyword and counted the rows within. Windows are created based on the keyword and rows within counted. Hope that answers your question--not sure if that was the source of confusion or not! Commented Sep 17, 2017 at 21:24
  • Unless I'm missing something, you shouldn't need to join after the groupby and count. Group by all three columns, not just Clean_Keyword. Commented Sep 17, 2017 at 21:26
  • Are you trying to count from anywhere in the dataframe? Does case matter? Can you make a smaller example, perhaps? Commented Sep 17, 2017 at 21:29
  • Just updated with a hopefully better example. I don't think grouping by all 3 columns will work, since I'm interested in the number of clean_keyword values specifically. I also need to keep the other two columns for later. Commented Sep 17, 2017 at 21:43

1 Answer 1

2

Your window function solution is going to be the most efficient, a join implies 2 sorts whereas a window only implies one. You may be able to optimize by doing a groupBy before windowing instead of a dropDuplicates after:

import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("Clean_Keyword")
key_count  = key_lemma\
    .groupBy(key_lemma.columns)\
    .agg(psf.count("*").alias("Frequency"))\
    .withColumn("Frequency", psf.sum("Frequency").over(w))
key_count.show()

    +-------------+-----------+-------------+---------+
    | Keyword_Type|    Keyword|Clean_Keyword|Frequency|
    +-------------+-----------+-------------+---------+
    |Not Geography|       cave|         cave|        1|
    |Not Geography|        cat|          cat|        4|
    |Not Geography|       cats|          cat|        4|
    |Not Geography|  celebrity|    celebrity|        2|
    |Not Geography|celebrities|    celebrity|        2|
    +-------------+-----------+-------------+---------+

This will be more efficient especially if you have a lot of lines but not so many distinct keys (most Keywords are equal to their Clean_Keyword)

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks! Unfortunately, this approach takes even longer :( I killed it after 10 minutes; the approach I was using took ~3-4. Is the efficiency meant to be gained through the groupby? (Not asking this in an accusatory way, just curious.)
It will depend on your data but if you have a lot of Keywords equal to there Clean_Keyword the groupBy will reduce the number of lines significantly. Therefore you'll have fewer lines for the window function computation

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.