0

NOTE: I am running this query on Azure Databricks in a serverless Notebook.

I have two tables with identical schema: foo and bar. They have the same number of columns, with the same names, in the same order, and the columns are the same data type.

"foo" has 758,104 records. "bar" has 213,094 records. When I run the following, I get 6,092 records returned.

WITH combined AS (
SELECT col1, col2, col3, ...
FROM catalog.schema.foo
UNION ALL
SELECT col1, col2, col3, ...
FROM catalog.schema.bar
)
SELECT * 
FROM combined
WHERE DateCollected = '2025-10-27';

When I run the following, I get 971,198 records:

WITH combined AS (
SELECT pkid, col1, col2, col3, ...
FROM catalog.schema.foo
UNION ALL
SELECT pkid, col1, col2, col3, ...
FROM catalog.schema.bar
)
SELECT COUNT(*)
FROM combined
WHERE DateCollected = '2025-10-27';

According to Databricks, "UNION Returns the result of subquery1 plus the rows of subquery2. If ALL is specified duplicate rows are preserved." - see Set operators. I would expect the first query to return all records.

There are 36 columns in my data set. If I remove the last 12 columns from each of the SELECT statements in the UNION, I see the records 971,198 records when querying "combined" with a SELECT *.

These columns have the following data types and all contain NULLs:

col25 decimal(10,2)
col26 decimal(10,2)
col27 string
col28 date
col29 decimal(10,2)
col30 decimal(10,2)
col31 timestamp
col32 string
col33 string
col34 string
col35 string
col36 decimal(38,2)

The behavior with the missing rows happens whether I wrap the 12 columns in a coalesce to avoid NULLs or explicitly CAST them to the desired data type. It happens whether I the SELECT statement calling the CTE using a SELECT * or each column explicitly named.

Why would the exclusion of rows change the number of records returned by the "combined" CTE with the UNION ALL?

1
  • SELECT * FROM foo is iffy because it can burn you in several interesting ways; one of them is running afoul of column-level access restrictions. Commented Oct 28 at 11:49

1 Answer 1

1

The issue wasn't with the query. The issue was with how I interpreted the number of rows in the output pane. The pane showed 6,092 records because of the limitation on notebook cell output - see Known limitations Databricks notebooks. If I download the results of the output frame showing 6,092 rows I see the complete result set of 971,198 records. Mystery solved. Hoped this helps someone.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.