0

Trying to do a simple count in Pyspark programmatically but coming up with errors. .count() works at the end of the statement if I drop AS (count(city)) but I need the count to appear inside not on the outside.

result = spark.sql("SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'") 

One of many errors

Py4JJavaError: An error occurred while calling o24.sql.
: org.apache.spark.sql.catalyst.parser.ParseException: 
mismatched input '(' expecting ')'(line 1, pos 21)

== SQL ==
SELECT city AS (count(city)) AND business_id FROM business WHERE city = 'Reading'
---------------------^^^

2 Answers 2

2

Your syntax is incorrect. Maybe you want to do this instead:

result = spark.sql("""
    SELECT 
        count(city) over(partition by city), 
        business_id 
    FROM business 
    WHERE city = 'Reading'
""")

You need to provide a window if you use count without group by. In this case, you probably want a count for each city.

Sign up to request clarification or add additional context in comments.

1 Comment

Boss this solution doesn't work for me but its where I need to be with the code I need to write. it is giving me the count in one column and all business ids in the other. I'm going to add my solution below but if I can get this to work kudos to you. And Thank You.
0

Just my solution to the problem I'm trying to solve. The solution above is where I would like to be at.

result = spark.sql("SELECT count(*) FROM business WHERE city='Reading'")

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.