Calculate quantile on grouped data in spark Dataframe

Question

I have the following Spark dataframe :

 agent_id|payment_amount|
+--------+--------------+
|       a|          1000|
|       b|          1100|
|       a|          1100|
|       a|          1200|
|       b|          1200|
|       b|          1250|
|       a|         10000|
|       b|          9000|
+--------+--------------+

my desire output would be something like

agen_id   95_quantile
  a          whatever is 95 quantile for agent a payments
  b          whatever is 95 quantile for agent b payments

for each group of agent_id I need to calculate the 0.95 quantile, I take the following approach:

test_df.groupby('agent_id').approxQuantile('payment_amount',0.95)

but I take the following error:

'GroupedData' object has no attribute 'approxQuantile'

I need to have .95 quantile(percentile) in a new column so later can be used for filtering purposes

I am using Spark 2.0.0

approxQuantile isn't avaible under version 2 of spark

eliasah
– eliasah

2016-09-22 09:54:34 +00:00
Commented Sep 22, 2016 at 9:54 — eliasah
– eliasah, Commented Sep 22, 2016 at 9:54

eliasah · Accepted Answer · 2018-05-18 16:38:43Z

16

One solution would be to use percentile_approx :

>>> test_df.registerTempTable("df")
>>> df2 = sqlContext.sql("select agent_id, percentile_approx(payment_amount,0.95) as approxQuantile from df group by agent_id")

>>> df2.show()
# +--------+-----------------+
# |agent_id|   approxQuantile|
# +--------+-----------------+
# |       a|8239.999999999998|
# |       b|7449.999999999998|
# +--------+-----------------+

Note 1 : This solution was tested with spark 1.6.2 and requires a HiveContext.

Note 2 : approxQuantile isn't available in Spark < 2.0 for pyspark.

Note 3 : percentile returns an approximate pth percentile of a numeric column (including floating point types) in the group. When the number of distinct values in col is smaller than second argument value, this gives an exact percentile value.

EDIT : From Spark 2+, HiveContext is not required.

edited May 18, 2018 at 16:38

answered Sep 22, 2016 at 9:53

eliasah

40.5k12 gold badges128 silver badges159 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

chessosapiens Over a year ago

thanks, i am going to test it , please correct me if i am wrong , the reason i get that error is that approxQuantile is not an aggregate function?

eliasah Over a year ago

approxQuantile is a stat function, indeed it's not an aggregate function.

chessosapiens Over a year ago

thanks 1.is there any way to apply stat functions to group of data? 2. is it possible to create a python wrapper of Hive context?

eliasah Over a year ago

Im not sure. I need to test first. hiveContext should be available if I'm not mistaken in pyspark you just need the right build.

eliasah Over a year ago

@Nabid check if your packages version are compatible (spark packages version must be the same)

|

Collectives™ on Stack Overflow

Calculate quantile on grouped data in spark Dataframe

1 Answer 1

8 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

8 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related