25

I have data like below. Filename:babynames.csv.

year    name    percent     sex
1880    John    0.081541    boy
1880    William 0.080511    boy
1880    James   0.050057    boy

I need to sort the input based on year and sex and I want the output aggregated like below (this output is to be assigned to a new RDD).

year    sex   avg(percentage)   count(rows)
1880    boy   0.070703         3

I am not sure how to proceed after the following step in pyspark. Need your help on this

testrdd = sc.textFile("babynames.csv");
rows = testrdd.map(lambda y:y.split(',')).filter(lambda x:"year" not in x[0])
aggregatedoutput = ????

1 Answer 1

53
  1. Follow the instructions from the README to include spark-csv package
  2. Load data

    df = (sqlContext.read
        .format("com.databricks.spark.csv")
        .options(inferSchema="true", delimiter=";", header="true")
        .load("babynames.csv"))
    
  3. Import required functions

    from pyspark.sql.functions import count, avg
    
  4. Group by and aggregate (optionally use Column.alias:

    df.groupBy("year", "sex").agg(avg("percent"), count("*"))
    

Alternatively:

  • cast percent to numeric
  • reshape to a format ((year, sex), percent)
  • aggregateByKey using pyspark.statcounter.StatCounter
Sign up to request clarification or add additional context in comments.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.