I have the following snippet of code which is used to create a graph. I want to modify it to work in PySpark but am not sure how to proceed. The issue is I can't iterate over a column in PySpark and I have unsuccessfully tried making this into a function.
Context: The DataFrame has a column called City which is just the name of the city as a string
cities = [i.City for i in df.select('City').distinct().collect()]
stack = []
for city in cities:
df = sqlContext.sql( 'SELECT Complaint Type, COUNT(*) as `counts` '
'FROM c311 '
'WHERE City = "{}" COLLATE NOCASE '
'GROUP BY `Complaint Type` '
'ORDER BY counts DESC'.format(city))
stack.append(Bar(x=df['Complaint Type'], y=df.counts, name=city.capitalize()))
My goal is then to send this toPandas() and graph it locally. However I am encountering errors since Column is not iterable. How do I approach this for PySpark?