0

I have the following snippet of code which is used to create a graph. I want to modify it to work in PySpark but am not sure how to proceed. The issue is I can't iterate over a column in PySpark and I have unsuccessfully tried making this into a function.

Context: The DataFrame has a column called City which is just the name of the city as a string

cities = [i.City for i in df.select('City').distinct().collect()]

stack = [] 

for city in cities:
    df = sqlContext.sql(   'SELECT Complaint Type, COUNT(*) as `counts` '
                           'FROM c311 '
                           'WHERE City = "{}" COLLATE NOCASE '
                           'GROUP BY `Complaint Type` '
                           'ORDER BY counts DESC'.format(city))

    stack.append(Bar(x=df['Complaint Type'], y=df.counts, name=city.capitalize()))

My goal is then to send this toPandas() and graph it locally. However I am encountering errors since Column is not iterable. How do I approach this for PySpark?

1 Answer 1

1

You can just:

from pyspark.sql.functions import upper, col

pdf = df.withColumn("city", upper(col("city"))) \
    .groupBy("Complaint Type").pivot("city").count() \
    .toPandas()

(or group by city and pivot by type) and use it from there.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.