5

In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)

After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:

sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")

With that and the DataFrame API, I can count the call types that occurred:

fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34

... or with SQL in Python:

spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+
|count(DISTINCT CallType)|
+------------------------+
|                      33|
+------------------------+

... or with an SQL cell:

%sql

SELECT count(DISTINCT CallType)
FROM fireServiceCalls

Databricks table output with column "count(DISTINCT CallType)" and single value "33".

Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)

4
  • If your CallType is a String, can you check perhaps that SQL is (or is not) making the char vs varchar distinction? What happens when you count distinct trimmed values? Commented Mar 13, 2018 at 20:40
  • 2
    Count distinct in SQL will ignore nulls usually. I bet DataFrame counts them as a distinct value. Commented Mar 13, 2018 at 20:42
  • 1
    With only 30 something values, you could just sort and print all the distinct items to see where the difference is. Commented Mar 13, 2018 at 20:52
  • 1
    @CharlesC seems to be right: NULL seems to be the issue. Commented Mar 13, 2018 at 20:53

1 Answer 1

4

To answer the question

Can Spark SQL not count correctly or can I not write SQL correctly?

from the title: I can't write SQL correctly.

Rule <insert number> of writing SQL: Think about NULL and UNDEFINED.

%sql
SELECT count(*)
FROM (
  SELECT DISTINCT CallType
  FROM fireServiceCalls 
)

34

Also, i apparently can't read:

pault suggested in a comment

With only 30 something values, you could just sort and print all the distinct items to see where the difference is.

Well, I actually thought of that myself. (Minus the sorting.) Except, there wasn't any difference, there were always 34 call types in the output, whether I generated it with SQL or DataFrame queries. I simply didn't notice that one of them was ominously named null:

+--------------------------------------------+
|CallType                                    |
+--------------------------------------------+
|Elevator / Escalator Rescue                 |
|Marine Fire                                 |
|Aircraft Emergency                          |
|Confined Space / Structure Collapse         |
|Administrative                              |
|Alarms                                      |
|Odor (Strange / Unknown)                    |
|Lightning Strike (Investigation)            |
|null                                        |
|Citizen Assist / Service Call               |
|HazMat                                      |
|Watercraft in Distress                      |
|Explosion                                   |
|Oil Spill                                   |
|Vehicle Fire                                |
|Suspicious Package                          |
|Train / Rail Fire                           |
|Extrication / Entrapped (Machinery, Vehicle)|
|Other                                       |
|Transfer                                    |
|Outside Fire                                |
|Traffic Collision                           |
|Assist Police                               |
|Gas Leak (Natural and LP Gases)             |
|Water Rescue                                |
|Electrical Hazard                           |
|High Angle Rescue                           |
|Structure Fire                              |
|Industrial Accidents                        |
|Medical Incident                            |
|Mutual Aid / Assist Outside Agency          |
|Fuel Spill                                  |
|Smoke Investigation (Outside)               |
|Train / Rail Incident                       |
+--------------------------------------------+
Sign up to request clarification or add additional context in comments.

2 Comments

always good to start teaching people about dirty data from day 1. Interesting about the difference with DF though; probably need to do a filtering there first
To me, the DataFrame behavior is much more intuitive than SQL's. Sure, NULL is a rather special (non-)value, but why shouldn't it be considered another distinct value? But yeah, if you want to count without NULL, you could do so easily by injecting a .filter(...) that removes NULL entries.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.