What is the most efficient way to count values in this dataframe?

Question

I have a dataframe containing maybe ~200k words and phrases. Many of the phrases are duplicative (e.g., the word "adventure" appears thousands of times). I'd like to get a count of each word, and then dedupe. My attempts at doing so take a really long time--longer than anything else in the entire script--and doing anything with the resulting dataframe after I've gotten my counts also takes forever.

My initial input looks like this:

Keyword_Type    Keyword     Clean_Keyword
Not Geography   cat         cat
Not Geography   cat         cat
Not Geography   cat         cat
Not Geography   cats        cat
Not Geography   cave        cave
Not Geography   celebrity   celebrity
Not Geography   celebrities celebrity

I'm looking for an output like this, which counts all of the times that a given word appears in the dataframe.

Keyword_Type    Keyword     Clean_Keyword   Frequency
Not Geography   cat         cat             4 
Not Geography   cat         cat             4
Not Geography   cat         cat             4
Not Geography   cats        cat             4
Not Geography   cave        cave            1
Not Geography   celebrity   celebrity       2
Not Geography   celebrities celebrity       2

The code snippet I'm currently using is as follows:

w = Window.partitionBy("Clean_Keyword")
key_count = key_lemma.select("Keyword_Type","Keyword", "Clean_Keyword", count("Clean_Keyword").over(w).alias("Frequency")).dropDuplicates()

I also tried a groupby-count on the Clean Keyword column and joining back to the original key_count df, but this also takes a disgustingly large amount of time. The functions on this same dataset before counting and deduping run in a second or two or less, and any subsequent functions I run on the resulting df are faster on my computer using straight Python than PySpark running on a decent-sized cluster. So, needless to say, something I'm doing is definitely not optimal...

I don't understand how those frequency numbers are made. Where is the text? In the keyword column? — OneCricketeer
– OneCricketeer, Commented Sep 17, 2017 at 21:19
I created them through the partition function. The dataset here isn't sorted, but adventure might appear thousands of times further down within my dataset. I windowed on the keyword and counted the rows within. Windows are created based on the keyword and rows within counted. Hope that answers your question--not sure if that was the source of confusion or not! — kodachrome
– kodachrome, Commented Sep 17, 2017 at 21:24
Unless I'm missing something, you shouldn't need to join after the groupby and count. Group by all three columns, not just Clean_Keyword. — hoyland
– hoyland, Commented Sep 17, 2017 at 21:26
Are you trying to count from anywhere in the dataframe? Does case matter? Can you make a smaller example, perhaps? — OneCricketeer
– OneCricketeer, Commented Sep 17, 2017 at 21:29
Just updated with a hopefully better example. I don't think grouping by all 3 columns will work, since I'm interested in the number of clean_keyword values specifically. I also need to keep the other two columns for later. — kodachrome
– kodachrome, Commented Sep 17, 2017 at 21:43

MaFF · Accepted Answer · 2017-09-18 08:08:03Z

2

Your window function solution is going to be the most efficient, a join implies 2 sorts whereas a window only implies one. You may be able to optimize by doing a groupBy before windowing instead of a dropDuplicates after:

import pyspark.sql.functions as psf
from pyspark.sql import Window
w = Window.partitionBy("Clean_Keyword")
key_count  = key_lemma\
    .groupBy(key_lemma.columns)\
    .agg(psf.count("*").alias("Frequency"))\
    .withColumn("Frequency", psf.sum("Frequency").over(w))
key_count.show()

    +-------------+-----------+-------------+---------+
    | Keyword_Type|    Keyword|Clean_Keyword|Frequency|
    +-------------+-----------+-------------+---------+
    |Not Geography|       cave|         cave|        1|
    |Not Geography|        cat|          cat|        4|
    |Not Geography|       cats|          cat|        4|
    |Not Geography|  celebrity|    celebrity|        2|
    |Not Geography|celebrities|    celebrity|        2|
    +-------------+-----------+-------------+---------+

This will be more efficient especially if you have a lot of lines but not so many distinct keys (most Keywords are equal to their Clean_Keyword)

edited Sep 18, 2017 at 8:08

answered Sep 18, 2017 at 7:54

MaFF

10.2k2 gold badges39 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

kodachrome Over a year ago

Thanks! Unfortunately, this approach takes even longer :( I killed it after 10 minutes; the approach I was using took ~3-4. Is the efficiency meant to be gained through the groupby? (Not asking this in an accusatory way, just curious.)

MaFF Over a year ago

It will depend on your data but if you have a lot of Keywords equal to there Clean_Keyword the groupBy will reduce the number of lines significantly. Therefore you'll have fewer lines for the window function computation

Collectives™ on Stack Overflow

What is the most efficient way to count values in this dataframe?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related