-1

I'm working on a large-scale ETL pipeline processing ~500GB daily across multiple data sources. We're currently using Great Expectations for data quality validation, but facing performance bottlenecks when validating the entire dataset on each run. Current Setup:

Apache Spark 3.5 on Databricks Delta Lake for storage with schema evolution enabled Great Expectations v0.18 running full-table validations ~30 data quality rules (nullability, uniqueness, referential integrity, statistical distributions)

The Problem: Full dataset validation takes 45+ minutes, which is unacceptable for our SLA. I need to implement incremental validation that only checks new/modified records, but I'm struggling with:

Maintaining referential integrity checks when only validating incremental data (e.g., foreign keys might reference records outside the incremental batch) Handling schema evolution - when new columns are added, should I re-validate historical data against new rules? Statistical distribution checks (mean, stddev, quantiles) require full dataset context - how to approximate these incrementally?

What I've Tried:

new_records = delta_table.history(1).where("operation = 'WRITE'")
validation_results = ge_context.run_checkpoint(
    checkpoint_name="incremental_check",
    batch_request={"batch_data": new_records}
)

This works for row-level checks but fails for:

Cross-batch uniqueness constraints Historical trend comparisons Aggregate validations

Questions:

What's the industry best practice for incremental data quality validation in streaming/batch hybrid architectures? Should I maintain a separate "validation state" table tracking metrics over time? Are there alternatives to Great Expectations better suited for incremental validation at scale? How do others handle the tradeoff between validation coverage and performance?

Any insights from production implementations would be greatly appreciated.

4
  • Just wondering out loud, for the statistical checks can you not store the latest context somewhere and compare against it? For ex: mean, standard deviation and quantiles you can store the existing mean and then use that to calculate new mean instead of going through whole table scan? Commented Nov 25 at 5:15
  • And as for referential integrity, I have not heard of having extensive checks for data lakes much. Only individual row level checks are something commonly used. Commented Nov 25 at 5:21
  • @VindhyaG Yeah, storing the running stats is definitely on my radar. The challenge I'm hitting is that storing just the mean/stddev isn't quite enough for our use case - we also track percentiles (p50, p95, p99) which are harder to update incrementally without losing accuracy. I've looked into t-digest and similar algorithms but haven't found a good integration with Great Expectations yet. Have you used anything like that in practice? On the referential integrity point - that's fair, and honestly it's making me question whether we're over-engineering this. Commented Nov 25 at 6:07
  • No I have not used GE or have done any statistical data quality checks Commented Nov 25 at 10:13

0

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.