pyspark sql query equivalent functions

Question

I'm just starting to dive into Pyspark.

There's this dataset which contains some values I'll demonstrate below to ask the query I'm not able to create.

This is a sample of the actual dataset which contains roughly 20K rows. I'm reading this CSV file in pyspark shell as data frame. Trying to convert some basic SQL queries on this data to get hands on. Below are one such query I'm not able to:

1. Which country has the least number of Government Type (4th Column).

There are other queries I've manually created myself that I can do in SQL but I'm just stuck in understanding the one. If I get an idea for this, it'll be fairly relatable to address other ones.

This is the only line I can create after much bugging:

df.filter(df.Government=='Democratic').select('Country').show()

I'm not sure how to approach this problem statement. Any ideas?

My question: Finding out the country that has least number of Government Type. For ex: country A has it's corresponding cities listed, every city has a Govt Type value as 'Democratic' or 'Republic', etc. Let's say the value in Government column appears 3 times for country A, 4 times for country B, 8 times for country C. Then the least number's country should be the answer. — knowone
– knowone, Commented Nov 11, 2019 at 13:22

Jayadeep Jayaraman · Accepted Answer · 2019-11-11 15:03:35Z

Here is how you can do it

Demography = Row("City", "Country", "Population", "Government")

demo1 = Demography("a","AD",1.2,"Democratic")
demo2 = Demography("b","AD",1.2,"Democratic")
demo3 = Demography("c","AD",1.2,"Democratic")
demo4 = Demography("m","XX",1.2,"Democratic")
demo5 = Demography("n","XX",1.2,"Democratic")
demo6 = Demography("o","XX",1.2,"Democratic")
demo7 = Demography("q","XX",1.2,"Democratic")

demographic_data = [demo1,demo2,demo3,demo4,demo5,demo6,demo7]

demographic_data_df = spark.createDataFrame(demographic_data)
demographic_data_df.show(10)

+----+-------+----------+----------+
|City|Country|Population|Government|
+----+-------+----------+----------+
|   a|     AD|       1.2|Democratic|
|   b|     AD|       1.2|Democratic|
|   c|     AD|       1.2|Democratic|
|   m|     XX|       1.2|Democratic|
|   n|     XX|       1.2|Democratic|
|   o|     XX|       1.2|Democratic|
|   q|     XX|       1.2|Democratic|
+----+-------+----------+----------+

new = demographic_data_df.groupBy('Country').count().select('Country', f.col('count').alias('n'))

max = new.agg(f.max('n').alias('n'))

new.join(max , on = "n",
    how = "inner").show()

+---+-------+                                                                   
|  n|Country|
+---+-------+
|  4|     XX|
+---+-------+

The other option is to register the dataframe as a temporary table and run normal SQL queries. For registering it as temporary table you can do the following

demographic_data_df.registerTempTable("demographic_data_table")

Hope that helps

Collectives™ on Stack Overflow

pyspark sql query equivalent functions

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related