Can Spark SQL not count correctly or can I not write SQL correctly?

Question

In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)

After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

With that and the DataFrame API, I can count the call types that occurred:

fire_service_calls_df.select('CallType').distinct().count()

Out[n]: 34

... or with SQL in Python:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()

+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

... or with an SQL cell:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)

If your CallType is a String, can you check perhaps that SQL is (or is not) making the char vs varchar distinction? What happens when you count distinct trimmed values? — ernest_k
– ernest_k, Commented Mar 13, 2018 at 20:40
Count distinct in SQL will ignore nulls usually. I bet DataFrame counts them as a distinct value. — CharlesC
– CharlesC, Commented Mar 13, 2018 at 20:42
With only 30 something values, you could just sort and print all the distinct items to see where the difference is. — pault
– pault, Commented Mar 13, 2018 at 20:52

das-g · Accepted Answer · 2018-03-13 21:19:37Z

4

To answer the question

Can Spark SQL not count correctly or can I not write SQL correctly?

from the title: I can't write SQL correctly.

Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)

34

Also, i apparently can't read:

pault suggested in a comment

With only 30 something values, you could just sort and print all the distinct items to see where the difference is.

Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+

edited Mar 13, 2018 at 21:19

answered Mar 13, 2018 at 20:57

das-g

10.1k6 gold badges44 silver badges95 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

stevel Over a year ago

always good to start teaching people about dirty data from day 1. Interesting about the difference with DF though; probably need to do a filtering there first

das-g Over a year ago

To me, the DataFrame behavior is much more intuitive than SQL's. Sure, NULL is a rather special (non-)value, but why shouldn't it be considered another distinct value? But yeah, if you want to count without NULL, you could do so easily by injecting a .filter(...) that removes NULL entries.

Collectives™ on Stack Overflow

Can Spark SQL not count correctly or can I not write SQL correctly?

1 Answer 1

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related