In a Python notebook on Databricks "Community Edition", I'm experimenting with the City of San Francisco open data about emergency calls to 911 requesting firefighters. (The old 2016 copy of the data used in "Using Apache Spark 2.0 to Analyze the City of San Francisco's Open Data" (YouTube) and made available on S3 for that tutorial.)
After mounting the data and reading it with the explicitly defined schema into a DataFrame fire_service_calls_df, I aliased that DataFrame as an SQL table:
sqlContext.registerDataFrameAsTable(fire_service_calls_df, "fireServiceCalls")
With that and the DataFrame API, I can count the call types that occurred:
fire_service_calls_df.select('CallType').distinct().count()
Out[n]: 34
... or with SQL in Python:
spark.sql("""
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
""").show()
+------------------------+ |count(DISTINCT CallType)| +------------------------+ | 33| +------------------------+
... or with an SQL cell:
%sql
SELECT count(DISTINCT CallType)
FROM fireServiceCalls
Why do I get two different count results? (It seems like 34 is the correct one, even though the talk in the video and the accompanying tutorial notebook mention "35".)

CallTypeis a String, can you check perhaps that SQL is (or is not) making thecharvsvarchardistinction? What happens when you count distinct trimmed values?NULLseems to be the issue.