NOTE: I am running this query on Azure Databricks in a serverless Notebook.
I have two tables with identical schema: foo and bar. They have the same number of columns, with the same names, in the same order, and the columns are the same data type.
"foo" has 758,104 records. "bar" has 213,094 records. When I run the following, I get 6,092 records returned.
WITH combined AS (
SELECT col1, col2, col3, ...
FROM catalog.schema.foo
UNION ALL
SELECT col1, col2, col3, ...
FROM catalog.schema.bar
)
SELECT *
FROM combined
WHERE DateCollected = '2025-10-27';
When I run the following, I get 971,198 records:
WITH combined AS (
SELECT pkid, col1, col2, col3, ...
FROM catalog.schema.foo
UNION ALL
SELECT pkid, col1, col2, col3, ...
FROM catalog.schema.bar
)
SELECT COUNT(*)
FROM combined
WHERE DateCollected = '2025-10-27';
According to Databricks, "UNION Returns the result of subquery1 plus the rows of subquery2. If ALL is specified duplicate rows are preserved." - see Set operators. I would expect the first query to return all records.
There are 36 columns in my data set. If I remove the last 12 columns from each of the SELECT statements in the UNION, I see the records 971,198 records when querying "combined" with a SELECT *.
These columns have the following data types and all contain NULLs:
col25 decimal(10,2)
col26 decimal(10,2)
col27 string
col28 date
col29 decimal(10,2)
col30 decimal(10,2)
col31 timestamp
col32 string
col33 string
col34 string
col35 string
col36 decimal(38,2)
The behavior with the missing rows happens whether I wrap the 12 columns in a coalesce to avoid NULLs or explicitly CAST them to the desired data type. It happens whether I the SELECT statement calling the CTE using a SELECT * or each column explicitly named.
Why would the exclusion of rows change the number of records returned by the "combined" CTE with the UNION ALL?
SELECT * FROM foois iffy because it can burn you in several interesting ways; one of them is running afoul of column-level access restrictions.