Reading and grouping data to get count using python spark

Question

I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark.

I have a file like below -

empid||deptid||salary
1||10||500
2||10||200
3||20||300
4||20||400
5||20||100

I want to write a small python spark to read the print the count of employees in each department.

I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example

I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html

Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code

MaFF · Accepted Answer · 2017-09-21 17:18:19Z

3

You can read the text file as a dataframe using :

df = spark.createDataFrame(
    sc.textFile("path/to/my/file").map(lambda l: l.split(',')),
    ["empid","deptid","salary"]
)

textFile loads the data sample as an RDD with only one column. Then we split each line through a map and convert it to a dataframe.

Starting from a python list of lists:

df = spark.createDataFrame(
    sc.parallelize([[1,10,500],
                    [2,10,200],
                    [3,20,300],
                    [4,20,400],
                    [5,20,100]]),
    ["empid","deptid","salary"]
)

df.show()

    +-----+------+------+
    |empid|deptid|salary|
    +-----+------+------+
    |    1|    10|   500|
    |    2|    10|   200|
    |    3|    20|   300|
    |    4|    20|   400|
    |    5|    20|   100|
    +-----+------+------+

Now to count the number of employees by department we'll use a groupBy and then use the count aggregation function:

df_agg = df.groupBy("deptid").count().show()

    +------+-----+
    |deptid|count|
    +------+-----+
    |    10|    2|
    |    20|    3|
    +------+-----+

For the max:

import pyspark.sql.functions as psf
df_agg.agg(psf.max("count")).show()

edited Sep 21, 2017 at 17:18

answered Sep 21, 2017 at 15:02

MaFF

10.2k2 gold badges39 silver badges43 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Firstname Over a year ago

df.groupBy("deptid").count().show() -> gives me all the count, but if I need the max value of count ? df.groupBy("deptid").count().max().show() -> doesnt work @Marie

MaFF Over a year ago

I added the part with the maximum value of count

Collectives™ on Stack Overflow

Reading and grouping data to get count using python spark

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related