1

I'm new to spark using python and I'm trying to do some basic stuff to get an understanding of python and spark.

I have a file like below -

empid||deptid||salary
1||10||500
2||10||200
3||20||300
4||20||400
5||20||100

I want to write a small python spark to read the print the count of employees in each department.

I've been working with databases and this is quite simple in a sql, but I'm trying to do this using python spark. I don't have a code to share as I'm completely new to python and spark, but wanted to understand how it works using a simple hands-on example

I've install pyspark and did some quick reading here https://spark.apache.org/docs/latest/quick-start.html

Form my understanding there are dataframes on which one can perform sql like group by, but not sure how to write a proper code

1 Answer 1

3

You can read the text file as a dataframe using :

df = spark.createDataFrame(
    sc.textFile("path/to/my/file").map(lambda l: l.split(',')),
    ["empid","deptid","salary"]
)

textFile loads the data sample as an RDD with only one column. Then we split each line through a map and convert it to a dataframe.

Starting from a python list of lists:

df = spark.createDataFrame(
    sc.parallelize([[1,10,500],
                    [2,10,200],
                    [3,20,300],
                    [4,20,400],
                    [5,20,100]]),
    ["empid","deptid","salary"]
)

df.show()

    +-----+------+------+
    |empid|deptid|salary|
    +-----+------+------+
    |    1|    10|   500|
    |    2|    10|   200|
    |    3|    20|   300|
    |    4|    20|   400|
    |    5|    20|   100|
    +-----+------+------+

Now to count the number of employees by department we'll use a groupBy and then use the count aggregation function:

df_agg = df.groupBy("deptid").count().show()

    +------+-----+
    |deptid|count|
    +------+-----+
    |    10|    2|
    |    20|    3|
    +------+-----+

For the max:

import pyspark.sql.functions as psf
df_agg.agg(psf.max("count")).show()
Sign up to request clarification or add additional context in comments.

2 Comments

df.groupBy("deptid").count().show() -> gives me all the count, but if I need the max value of count ? df.groupBy("deptid").count().max().show() -> doesnt work @Marie
I added the part with the maximum value of count

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.