I'm working on a large-scale ETL pipeline processing ~500GB daily across multiple data sources. We're currently using Great Expectations for data quality validation, but facing performance bottlenecks when validating the entire dataset on each run. Current Setup:
Apache Spark 3.5 on Databricks Delta Lake for storage with schema evolution enabled Great Expectations v0.18 running full-table validations ~30 data quality rules (nullability, uniqueness, referential integrity, statistical distributions)
The Problem: Full dataset validation takes 45+ minutes, which is unacceptable for our SLA. I need to implement incremental validation that only checks new/modified records, but I'm struggling with:
Maintaining referential integrity checks when only validating incremental data (e.g., foreign keys might reference records outside the incremental batch) Handling schema evolution - when new columns are added, should I re-validate historical data against new rules? Statistical distribution checks (mean, stddev, quantiles) require full dataset context - how to approximate these incrementally?
What I've Tried:
new_records = delta_table.history(1).where("operation = 'WRITE'")
validation_results = ge_context.run_checkpoint(
checkpoint_name="incremental_check",
batch_request={"batch_data": new_records}
)
This works for row-level checks but fails for:
Cross-batch uniqueness constraints Historical trend comparisons Aggregate validations
Questions:
What's the industry best practice for incremental data quality validation in streaming/batch hybrid architectures? Should I maintain a separate "validation state" table tracking metrics over time? Are there alternatives to Great Expectations better suited for incremental validation at scale? How do others handle the tradeoff between validation coverage and performance?
Any insights from production implementations would be greatly appreciated.