Modifying Pandas code for PySpark DataFrame

Question

I have the following snippet of code which is used to create a graph. I want to modify it to work in PySpark but am not sure how to proceed. The issue is I can't iterate over a column in PySpark and I have unsuccessfully tried making this into a function.

Context: The DataFrame has a column called City which is just the name of the city as a string

cities = [i.City for i in df.select('City').distinct().collect()]

stack = [] 

for city in cities:
    df = sqlContext.sql(   'SELECT Complaint Type, COUNT(*) as `counts` '
                           'FROM c311 '
                           'WHERE City = "{}" COLLATE NOCASE '
                           'GROUP BY `Complaint Type` '
                           'ORDER BY counts DESC'.format(city))

    stack.append(Bar(x=df['Complaint Type'], y=df.counts, name=city.capitalize()))

My goal is then to send this toPandas() and graph it locally. However I am encountering errors since Column is not iterable. How do I approach this for PySpark?

user6022341 · Accepted Answer · 2016-12-13 13:54:24Z

1

You can just:

from pyspark.sql.functions import upper, col

pdf = df.withColumn("city", upper(col("city"))) \
    .groupBy("Complaint Type").pivot("city").count() \
    .toPandas()

(or group by city and pivot by type) and use it from there.

answered Dec 13, 2016 at 13:54

community wiki

user6022341

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Modifying Pandas code for PySpark DataFrame

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related