0

I'm just starting to dive into Pyspark.

There's this dataset which contains some values I'll demonstrate below to ask the query I'm not able to create.

Image Dataset

This is a sample of the actual dataset which contains roughly 20K rows. I'm reading this CSV file in pyspark shell as data frame. Trying to convert some basic SQL queries on this data to get hands on. Below are one such query I'm not able to:

1. Which country has the least number of Government Type (4th Column).

There are other queries I've manually created myself that I can do in SQL but I'm just stuck in understanding the one. If I get an idea for this, it'll be fairly relatable to address other ones.

This is the only line I can create after much bugging:

df.filter(df.Government=='Democratic').select('Country').show()

I'm not sure how to approach this problem statement. Any ideas?

4
  • can you please add your code sample which you have used. Commented Nov 11, 2019 at 13:12
  • @MaheshGupta: Added. Commented Nov 11, 2019 at 13:15
  • what you want as expected output Commented Nov 11, 2019 at 13:15
  • My question: Finding out the country that has least number of Government Type. For ex: country A has it's corresponding cities listed, every city has a Govt Type value as 'Democratic' or 'Republic', etc. Let's say the value in Government column appears 3 times for country A, 4 times for country B, 8 times for country C. Then the least number's country should be the answer. Commented Nov 11, 2019 at 13:22

1 Answer 1

1

Here is how you can do it

Demography = Row("City", "Country", "Population", "Government")

demo1 = Demography("a","AD",1.2,"Democratic")
demo2 = Demography("b","AD",1.2,"Democratic")
demo3 = Demography("c","AD",1.2,"Democratic")
demo4 = Demography("m","XX",1.2,"Democratic")
demo5 = Demography("n","XX",1.2,"Democratic")
demo6 = Demography("o","XX",1.2,"Democratic")
demo7 = Demography("q","XX",1.2,"Democratic")

demographic_data = [demo1,demo2,demo3,demo4,demo5,demo6,demo7]

demographic_data_df = spark.createDataFrame(demographic_data)
demographic_data_df.show(10)

+----+-------+----------+----------+
|City|Country|Population|Government|
+----+-------+----------+----------+
|   a|     AD|       1.2|Democratic|
|   b|     AD|       1.2|Democratic|
|   c|     AD|       1.2|Democratic|
|   m|     XX|       1.2|Democratic|
|   n|     XX|       1.2|Democratic|
|   o|     XX|       1.2|Democratic|
|   q|     XX|       1.2|Democratic|
+----+-------+----------+----------+

new = demographic_data_df.groupBy('Country').count().select('Country', f.col('count').alias('n'))

max = new.agg(f.max('n').alias('n'))

new.join(max , on = "n",
    how = "inner").show()

+---+-------+                                                                   
|  n|Country|
+---+-------+
|  4|     XX|
+---+-------+

The other option is to register the dataframe as a temporary table and run normal SQL queries. For registering it as temporary table you can do the following

demographic_data_df.registerTempTable("demographic_data_table")

Hope that helps

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.