1

I load parquet file into sql context like this:

sqlCtx = SQLContext(sc)
rdd_file = sqlCtx.read.parquet("hdfs:///my_file.parquet")
rdd_file.registerTempTable("type_table")

Then I run this simple query:

sqlCtx.sql('SELECT count(name), name from type_table group by name order by count(name)').show()

The result:

+----------------+----------+
|count(name)     |name      |
+----------------+----------+
|               0|      null|
|          226307|         x|
+----------------+----------+

However, if I use groupBy from rdd set. I got a different result:

sqlCtx.sql("SELECT name FROM type_table").groupBy("name").count().show()

+----------+------+
| name     | count|
+----------+------+
|         x|226307|
|      null|586822|
+----------+------+

The count of x is the same for the two methods but null is quite different. It seems like the sql statement doesn't count null with group by correctly. Can you point out what I did wrong?

Thanks,

1
  • Can you print out your DataFrame schema ? Commented Sep 30, 2016 at 6:28

1 Answer 1

3

count(name) will exclude null values , if you give count(*) it will give you the null values as well .

Try below.

sqlCtx.sql('SELECT count(*), name from type_table group by name order by count(*)').show()
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.