Union Two Datasets Causes Records to Unexpectedly Filter

Question

NOTE: I am running this query on Azure Databricks in a serverless Notebook.

I have two tables with identical schema: foo and bar. They have the same number of columns, with the same names, in the same order, and the columns are the same data type.

"foo" has 758,104 records. "bar" has 213,094 records. When I run the following, I get 6,092 records returned.

WITH combined AS (
SELECT col1, col2, col3, ...
FROM catalog.schema.foo
UNION ALL
SELECT col1, col2, col3, ...
FROM catalog.schema.bar
)
SELECT * 
FROM combined
WHERE DateCollected = '2025-10-27';

When I run the following, I get 971,198 records:

WITH combined AS (
SELECT pkid, col1, col2, col3, ...
FROM catalog.schema.foo
UNION ALL
SELECT pkid, col1, col2, col3, ...
FROM catalog.schema.bar
)
SELECT COUNT(*)
FROM combined
WHERE DateCollected = '2025-10-27';

According to Databricks, "UNION Returns the result of subquery1 plus the rows of subquery2. If ALL is specified duplicate rows are preserved." - see Set operators. I would expect the first query to return all records.

There are 36 columns in my data set. If I remove the last 12 columns from each of the SELECT statements in the UNION, I see the records 971,198 records when querying "combined" with a SELECT *.

These columns have the following data types and all contain NULLs:

col25 decimal(10,2)
col26 decimal(10,2)
col27 string
col28 date
col29 decimal(10,2)
col30 decimal(10,2)
col31 timestamp
col32 string
col33 string
col34 string
col35 string
col36 decimal(38,2)

The behavior with the missing rows happens whether I wrap the 12 columns in a coalesce to avoid NULLs or explicitly CAST them to the desired data type. It happens whether I the SELECT statement calling the CTE using a SELECT * or each column explicitly named.

Why would the exclusion of rows change the number of records returned by the "combined" CTE with the UNION ALL?

SELECT * FROM foo is iffy because it can burn you in several interesting ways; one of them is running afoul of column-level access restrictions. — DarthGizka
– DarthGizka, Commented Oct 28 at 11:49

Adam · Accepted Answer · 2025-10-28 18:43:28Z

1

The issue wasn't with the query. The issue was with how I interpreted the number of rows in the output pane. The pane showed 6,092 records because of the limitation on notebook cell output - see Known limitations Databricks notebooks. If I download the results of the output frame showing 6,092 rows I see the complete result set of 971,198 records. Mystery solved. Hoped this helps someone.

answered Oct 28 at 18:43

Adam

4,2366 gold badges24 silver badges51 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

Union Two Datasets Causes Records to Unexpectedly Filter

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related