top of page

Rule-Driven Data Quality Framework for a Multi-Source Azure Lakehouse

pexels-diva-plavalaguna-6146816.jpg
Challenges

Fragmented validation and reliance on legacy Attacama led to inconsistent, unreliable data across lakehouse layers, directly impacting reporting accuracy and timelines. The absence of pipeline-integrated quality checks meant issues were detected too late, forcing heavy manual reconciliation and reducing trust in reporting outputs.

Outcome

The implementation of a rule-driven framework using Soda improved data accuracy from ~60% to ~85% while identifying ~150 data quality failures within the first 3 months. Early detection saved over ~1,700 hours of manual effort.

Solution

Lakehouse Build & Modernisation

Challenges
Solution
Technology Stack 
Outcomes

For M&S, a multi- source reporting platform was built on Azure lakehouse architecture faced severe data quality issues. This happened due to inconsistent validation across data layers.  


Cloudaeon data quality experts implemented a rule-driven data quality framework using Soda (new data quality tool), replacing the legacy Attacama (legacy tool). We ensured that the new framework applied validation checkpoints across bronze, silver, gold and platinum layers. This improved reporting data accuracy, reducing incidents and significantly lowering manual reconciliation effort. 

Client Problem 

M&S operated a reporting platform built on Azure, where data from multiple source systems was ingested into a lakehouse environment and transformed across all layers before being used for reporting purposes.  

Paint Points

The data ingestion and processing took place as expected. But the real challenge was trust. Data quality validations were not consistently applied across all layers, which resulted in: 

  • Severe reporting issues 

  • Duplication or inconsistency in records 

  • Data files went missing  

  • Manual efforts were taken by reporting teams 

  • Data incidents impacting reporting timelines 

The challenge was not data ingestion or processing. The problem was ensuring data remained reliable as it moved across lakehouse layers. 


Root Cause Analysis 

The key issues identified were: 

  1. Multi-source data inconsistency Data from different source systems produced different schemas, formats and patterns that led to inconsistencies during transformation. 

  2. Lack of validation checkpoints between lakehouse layers Data moved from bronze, silver, to gold layers without structured quality gates. 

  3. Reactive data quality approach Data issues were often discovered only after reaching reporting layers, not before that. 

  4. Manual reconciliation effort Reporting teams spent significant time only on validating and fixing data issues. 

  5. Tool-based rather than pipeline-integrated quality checks Data quality checks existed, but were not fully integrated into the lakehouse architecture. 

Initially, M&S maintained the data quality via Attacama (legacy data quality tool). This did not fulfil the requirements of 100% data quality as it used to execute outside the lakehouse transformation flow and was not fully integrated into layer-wise data processing. This was a reactive approach where data issues were identified after data had already reached reporting datasets. 

Solution Architecture 

Cloudaeon’s solution introduced a rule-driven data quality check integrated into the Azure lakehouse architecture. Each transition between layers became a data quality validation checkpoint. 

Validation by Layer:

Ingestion Layer 

  • Schema validation 

  • Mandatory field checks 

  • Record integrity checks 


Transformation Layers (Bronze → Silver → Gold) 

  • Data standardisation checks 

  • Referential integrity checks 

  • Business rule validations 


Refined Layer (Gold) 

  • Reporting readiness checks 

  • Completeness checks 

  • Cross-dataset consistency checks 


Business Rule Validation Framework 


In order to implement a top-notch data quality framework, Cloudaeon experts automated business-rule validations across revenue, transactions, pricing, stock and operational datasets.

This was done to make sure the reporting accuracy and data reliability. 


Consistency checks 

  • Revenue consistency: To flag abnormal daily revenue variance compared to historical trends. 

  • Transaction volume consistency: To detect unexpected drops or spikes in partner transaction volumes. 


Uniqueness checks 

  • Transaction record uniqueness: Used to prevent duplicate transaction records. 

  • Selling price uniqueness: To make sure there is only one active price per product and site. 


Timeliness checks 

  • Retailer file timeliness: To show alerts when expected retailer files are not received or processed. 


Validity checks 

  • Email format validation: To ensure email addresses meet defined format rules. 

  • Stock quantity validation: To flag abnormal stock reductions beyond expected sales movement. 


Accuracy checks 

  • In-transit quantity accuracy: To validate in-transit stock against transfer order and goods receipt data. 


These business rule checks ensured data quality issues were identified early in the pipeline rather than later at the reporting stage. 

Attacama  

Soda 

Tool-based validation 

Pipeline-integrated validation 

Reactive issue detection 

Proactive validation 

Checks mostly at reporting layer 

Checks at each lakehouse layer 

Manual reconciliation 

Automated rule enforcement 

Low trust in reporting data 

Trusted reporting datasets 

How We Delivered 

Cloudaeon’s Databricks and data quality experts delivered the entire solution in structured phases: 

  1. Mapped end-to-end reporting data flow from source systems to reporting. 

  2. Identified data quality failure points across lakehouse layers. 

  3. Defined validation rule categories included: Integrity, business, timeliness, accuracy, consistency, uniqueness, completeness. 

  4. Implemented Soda checks across ingestion and transformation pipelines. 

  5. Integrated data quality checks with governance and lineage. 

  6. Established monitoring and incident tracking. 

  7. Transitioned the solution into POD ownership and managed operations. 

Technology Stack 

  • Microsoft Azure (Cloud Platform)  

  • Azure Lakehouse Architecture  

  • Bronze / Silver / Gold / Platinum Data Layers  

  • Soda 

  • Attacama (legacy data quality tool)  

  • Microsoft Purview (Governance, Lineage, Monitoring)

Outcomes 

Cloudaeon's implementation of a rule-driven data quality framework delivered measurable improvements in data accuracy and incident reduction, resulting in: 


  • ~150 data quality failures identified within the first 3 months 

  • 5 Sev2 incidents prevented with ~40 hours efforts saved 

  • 130 Sev3 incidents were detected early, thereby saving ~1,500 hours 

  • 15 Sev4 incidents identified saving ~230 hours 

  • Data accuracy improved from ~60% to ~85% 

  • Significant reduction in manual reconciliation and reporting issues 

Cloudaeon’s data quality framework automates validation and rule checks within the data platform itself.  It improved reporting trust and platform reliability.  

 

POD & Managed Operations Transition 

Cloudaeon’s engagement with M&S moved in a structured transition model: 

Solution → POD → Managed Operations 

  • The solution phase implemented the data quality framework. 

  • The POD team took ownership of rule updates, pipeline tuning and new data source onboarding. 

Managed Operations provided SLA-based monitoring, incident response and continuous improvement. 

Conclusion 

Cloudaeon’s implementation of a rule-driven data quality framework and the transition from Attacama to Soda transformed M&S’s approach from a reactive reporting environment into a controlled and reliable system. By introducing validation checkpoints across lakehouse layers and automating business rule checks, their data accuracy improved significantly. This data quality solution implementation established a scalable data quality operating model to support future data growth.

We ready for Help you !

Take the first step with a structured, engineering led approach. 

bottom of page