Pyspark - Aggregation on multiple columns

Question

I have data like below. Filename:babynames.csv.

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy

I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3

I am not sure how to proceed after the following step in pyspark. Need your help on this

testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????

3 revs · Accepted Answer · 2017-08-31 19:01:53Z

53

Follow the instructions from the README to include spark-csv package

Load data

df = (sqlContext.read
    .format("com.databricks.spark.csv")
    .options(inferSchema="true", delimiter=";", header="true")
    .load("babynames.csv"))

Import required functions

from pyspark.sql.functions import count, avg

Group by and aggregate (optionally use Column.alias:

df.groupBy("year", "sex").agg(avg("percent"), count("*"))

Alternatively:

cast percent to numeric
reshape to a format ((year, sex), percent)
aggregateByKey using pyspark.statcounter.StatCounter

edited Aug 31, 2017 at 19:01

community wiki

3 revs
zero323

Sign up to request clarification or add additional context in comments.

1 Comment

zero323 Over a year ago

SparkSQL: apply aggregate functions to a list of column | Multiple Aggregate operations on the same column of a spark dataframe.

Collectives™ on Stack Overflow

Pyspark - Aggregation on multiple columns

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related