Simplify Streaming ETL with Databricks Lakeflow Pipelines

Batch ETL still holds its place for historical reporting, many organisations rely on it to date.

However, it still falls short where you need real-time insights that today’s businesses highly demand.

That’s where Lakeflow Pipelines (formerly Delta Live Tables) comes in, a governed and AI-ready framework for streaming and incremental ETL.

With the help of Lakeflow Pipelines, you can design dependable data pipelines in SQL or Python without the burden of hand-coded orchestration, CDC management, or manual cluster tuning.

Data pipeline engineering is a complex task, but it doesn’t have to be. Lakeflow Declarative Pipelines simplify the process, enabling powerful, production-grade ETL with just a few lines of code.

Author

I'm a Data Engineer with 8 years of experience specialising in the Azure data ecosystem. I design and implement scalable data pipelines, lakes and ETL/ELT solutions using tools like ADF, Airflow, Databricks, Synapse and SQL Server. Focused on building high-quality, secure, and optimised cloud data architecture.

Nikhil

Mohod

I'm a Data Engineer with 8 years of experience specialising in the Azure data ecosystem. I design and implement scalable data pipelines, lakes and ETL/ELT solutions using tools like ADF, Airflow, Databricks, Synapse and SQL Server. Focused on building high-quality, secure, and optimised cloud data architecture.

Connect with

Nikhil

Mohod

Get a free recap to share with colleagues

Ready to shape the future of your business?

Let's Talk

What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.

Streaming ETL: How Lakeflow Pipelines Makes a Difference?

Streaming ETL basically processes your data as soon as it shows up, instead of waiting around to batch everything together. It transforms data on the go and almost instantly.

Streaming ETL is often used when you need answers right away. No more waiting hours for your batch jobs to finish.

Lakeflow Pipelines preserves all good features from Delta Live Tables and adds more to it:

With built-in change data capture, it handles SCD1 and SCD2 automatically.
Different tiers to pick from with Core, Pro, or Advanced, depending on what you actually need and want to pay for.
Unity Catalog so your governance and lineage tracking is already sorted.
Flexibility to trigger individual tables, instead of having to refresh everything at once.

How It Works: Technical Walkthrough

Creating Streaming Tables

Python
import dlt

@dlt.table(name="ucsales.bronze.orders_tbl",
comment="Read data from order Table ")
def bronze_orders():
    return spark.readStream
.format("cloudFiles") \
 	.option("cloudFiles.format", "json") \
.option("cloudFiles.schemaLocation", "/mnt/checkpoints/schema/orders")\
.option("rescuedDataColumn", "_rescued_data")\
 	.load("/mnt/raw/orders")

SQL
CREATE OR REFRESH STREAMING TABLE ucsales.bronze.orders_tbl
COMMENT "Read data from order Table "
AS SELECT *
FROM STREAM read_files(
  '/mnt/raw/orders',
  format => 'json',
schemaLocation => '/mnt/checkpoints/schema/orders',
 rescuedDataColumn => '_rescued_data');

Automated Data Quality and Error Handling in Lakeflow Pipelines

Lakeflow Pipelines bakes data quality checks right into your ETL instead of making it an afterthought. They do this with expectations, which are rules you write to check your data as it moves through the pipeline.

Here's how an expectation works:

You give it a name and write a condition that says what good data should look like.
You decide what happens when data doesn't pass, drop those rows, kill the whole pipeline, or just log it so someone can look at it later.
Everything gets recorded in the pipeline event log, so you can see what happened.

Why are expectations useful?

Your data quality checks happen right inside the ETL, no separate process to manage.
You don't need extra tools or scripts to filter out bad data.
Everything gets logged automatically for when auditors come knocking.
Works the same way whether you're doing streaming or batch.

Actions:

Drop Rows That Fail: expect_or_drop

Remove any record that does not meet the expectation.

Python
import dlt

@dlt.table(name="ucsales.silver.orders_clean",
comment="Cleaned orders with positive amounts")
@dlt.expect_or_drop("positive_amount", "total_amount > 0")
def silver_orders_clean():
    return dlt.read_stream ("ucsales.bronze.orders_tbl")

SQL
CREATE OR REFRESH LIVE TABLE ucsales.silver.orders_clean 
COMMENT "Cleaned orders with positive amounts" 
CONSTRAINT positive_amount EXPECT (total_amount > 0) ON 
VIOLATION DROP ROW AS  
SELECT * FROM STREAM(LIVE.ucsales.bronze.orders_tbl);

Fail the Pipeline on Violation: expect_or_fail

If any record fails the check, the pipeline run stops. Use this for critical business rules.

Python
@dlt.table(name=" ucsales.silver.high_value_orders",
comment="Orders with total_amount >= 500")
@dlt.expect_or_fail("min_amount", "total_amount >= 500")
def high_value_orders():
    return dlt. read_stream ("ucsales.bronze.orders_tbl")

SQL
CREATE OR REFRESH LIVE TABLE ucsales.silver.high_value_orders 
COMMENT "Orders with total_amount >= 500" 
CONSTRAINT min_amount EXPECT (total_amount >= 500) ON 
VIOLATION FAIL UPDATE AS SELECT * FROM STREAM(LIVE.ucsales.bronze.orders_tbl);

Combine several expectations into one declaration.

Multiple Checks at Once: expect_all

Python
@dlt.table(name="ucsales.silver.products_clean",
comment="Cleaned products with valid price, category, and SKU")
@dlt.expect_all({
    "valid_price": "price > 0",
    "valid_category": "category IS NOT NULL",
    "valid_sku": "length(sku) = 8"
})
def silver_products_clean():
    return dlt.read_stream("ucsales.bronze.products_tbl")

SQL

CREATE OR REFRESH LIVE TABLE ucsales.silver.products_clean
COMMENT "Cleaned products with valid price, category, and SKU"
CONSTRAINT valid_price EXPECT (price > 0)
CONSTRAINT valid_category EXPECT (category IS NOT NULL)
CONSTRAINT valid_sku EXPECT (length(sku) = 8)
AS
SELECT * 
FROM STREAM(LIVE.ucsales.bronze.products_tbl);

Quarantining Bad Data (Custom Pattern)

Instead of dropping bad records, you can route them to a separate quarantine table for debugging.

@dlt.table(name=" ucsales.quarantine.invalid_orders",
comment="Orders with non-positive total_amount quarantined for review")
def invalid_orders():
    return dlt.read_stream("ucsales.bronze.orders_tbl")
.filter("total_amount <= 0")

Unity Catalog & Data Governance in Lakeflow Pipelines

Lakeflow Pipelines works well with Unity Catalog, which is Databricks' way of keeping everything organised and governed. When you build pipelines through Lakeflow, every table, materialised view and data flow gets automatically registered with full lineage tracking.

Advantages:

Control who sees what: Set permissions at different levels (catalog, schema, table) so you can decide exactly who gets to view or mess with your data.
Track where everything comes from: You can follow your data from where it started all the way to where it ends up, which makes troubleshooting way easier and keeps auditors happy.
Same rules everywhere: Your governance policies work the same whether you're running streaming, batch, or just doing some one-off analysis.

When you combine data quality checks (through expectations) with governance controls in Unity Catalog, Lakeflow Pipelines help you deliver auditable and secure data to whoever needs it downstream.

Per-Table Triggers

Per-table triggers in Lakeflow Pipelines let you set independent refresh intervals for specific tables or materialised views inside the same pipeline.

Python
import dlt
from pyspark.sql import functions as F
@dlt.table( name="daily_summary", 
comment="Daily aggregated order totals", 
spark_conf={"pipelines.trigger.interval": "1 hour"} ) 
def daily_summary(): 
return (
spark.read.table("ucsales.silver.orders_clean") .groupBy("order_date") .sum("total_amount")
)


SQL
SET pipelines.trigger.interval = '1 hour';
 
CREATE OR REFRESH MATERIALIZED VIEW daily_summary
COMMENT "Daily aggregated order totals”
AS
SELECT order_date,
       SUM(total_amount) AS total
FROM STREAM(LIVE.ucsales.silver.orders_clean)
GROUP BY order_date;

Change Data Capture with AUTO CDC

Python
import dlt
from pyspark.sql.functions import col

dlt.create_streaming_table("ucsales.silver.customers", 
comment="Current snapshot of customers (SCD Type 1 updates applied)")
dlt.create_auto_cdc_flow( 
target="ucsales.silver.customers", 
source="ucsales.cdc_raw.customers", 
keys=["customer_id"], 
sequence_by=col("event_time"), 
stored_as_scd_type="1" )

dlt.create_streaming_table("ucsales.silver.customers_history", 
comment="Full customer history with SCD Type 2 tracking (versioned records)")
dlt.create_auto_cdc_flow(
    target="ucsales.silver.customers_history",
    source="ucsales.cdc_raw.customers",
    keys=["customer_id"],
    sequence_by=col("event_time"),
    stored_as_scd_type="2"
)

SQL
CREATE OR REFRESH STREAMING TABLE ucsales.silver.customers
COMMENT "Current snapshot of customers (SCD Type 1 updates applied)";
CREATE FLOW customer_cdc_flow 
COMMENT "CDC flow applying SCD Type 1 logic into customers table"
AS
AUTO CDC INTO ucsales.silver.customers
FROM STREAM ucsales.cdc_raw.customers
KEYS (customer_id)
SEQUENCE BY event_time
STORED AS SCD TYPE 1;
CREATE OR REFRESH STREAMING TABLE ucsales.silver.customers_history
COMMENT "Full customer history with SCD Type 2 tracking (versioned records)";

CREATE FLOW customer_history_flow 
COMMENT "CDC flow applying SCD Type 2 logic into customers_history table"
AS
AUTO CDC INTO ucsales.silver.customers_history
FROM STREAM ucsales.cdc_raw.customers
KEYS (customer_id)
SEQUENCE BY event_time
STORED AS SCD TYPE 2;

Materialised Views and Aggregations

Python
@dlt.table(name="ucsales.gold.sales_by_region",
comment="Aggregated total sales by region from cleaned orders")
def sales_by_region():
    return dlt.read_stream("ucsales.silver.orders_clean") \
        .groupBy("region") \
        .sum("total_amount")

SQL
CREATE OR REFRESH MATERIALIZED VIEW ucsales.gold.sales_by_region
COMMENT "Aggregated total sales by region from cleaned orders"
AS SELECT region, SUM(total_amount) AS total
FROM LIVE.ucsales.silver.orders_clean
GROUP BY region;

Materialised views are useful because they store precomputed query results in memory, which makes repeated queries much faster. They automatically refresh on schedule or on change, ensuring data stays up to date without manual recomputation.

This reduces compute cost and latency for reports and downstream analytics.

Backfill with CREATE FLOW

In Lakeflow Pipelines, backfill with CREATE FLOW is a way to process historical data on-demand, outside of the pipeline’s normal streaming or scheduled refresh.

CREATE FLOW backfill_sales 
COMMENT "One-time backfill of sales_by_region from historical cleaned orders"
AS
INSERT INTO ONCE ucsales.gold.sales_by_region
SELECT region, SUM(total_amount)
FROM ucsales.silver.orders_clean
GROUP BY region;

Pipeline Monitoring and Observability

Lakeflow Pipelines give you visibility into what's happening with your data, both through the UI and when you need to dig deeper:

Watching Your Pipelines Through the UI

The Lakeflow interface lets you track what's going on with your pipelines, like progress updates, the DAG layout, plus details like row counts, runtime and whether your quality checks worked out.
Want to know when something breaks? Just set up email alerts in the pipeline settings and you'll get pinged when runs succeed, fail, or something weird happens.

Event Logs When You Need to Troubleshoot

Everything that happens, quality issues, lineage changes, whether the job is successful or crashed, gets written to event log tables in Delta format. You can query them yourself or feed them into whatever monitoring tools you're already using.

Here's an example of how you'd query them:

SELECT timestamp, event_type, message
FROM system.pipelines.event_log
WHERE pipeline_id = 'my_pipeline'
ORDER BY timestamp DESC;

Watching Streaming Performance

For streaming tables, you can see important information like how much backlog you've got, record counts and processing delays right in the UI (works with Kafka, Auto Loader, that kind of thing).
If you need to dive deeper into performance, just write some SQL to grab whatever metrics matter to you:

SELECT
  (max_t - min_t) / batch_count AS avg_batch_duration,
  SUM(details:flow_progress:metrics:num_output_rows) AS total_rows
FROM system.pipelines.event_log
WHERE event_type = 'flow_progress';

Connecting to Other Tools

Send your event logs and metrics over to things like Datadog, Splunk, or Prometheus - especially if your team is already set up with those for monitoring and alerts.
Add in cluster metrics (CPU, memory, network) plus Spark UI information, and you can see what's happening across everything.

When You Need to Figure Out What Went Wrong

The UI has this clickable interface where you can explore table lineage, see what schemas changed and track how your incremental updates are behaving.
The system tables let you follow your data's path and catch bottlenecks or schema drift before they turn into real headaches.

Integrated Pipeline Development

Lakeflow Pipelines gives you a much cleaner interface for building and managing your ETL pipelines, instead of the old Delta Live Tables setup that had you jumping around between different screens.

What you can actually do:

Work from one screen: Write your SQL or Python code, see how your pipeline connects together and check what your data looks like without opening a million different tabs.
See changes happen live: Your DAG updates right there as you're typing, dependencies, table types (streaming, materialised views), all of it changes in real time.
Quick data checks: Want to see if your output looks right? Just preview the table data right there instead of going somewhere else to check.
Keep your pipelines organised: All your pipeline assets, like scripts and datasets are laid out in folders so you can actually find what’s needed.

Why this actually helps:

Building pipelines gets a lot more visual when you're not constantly switching between tools and you end up iterating way faster. You can take an idea and get it running in production without losing track of governance or monitoring. And since it works well with Lakeflow's expectations and event logging, you get control and visibility all in the same place instead of hunting around for information.

Case Study

Real-Time Retail Analytics with Lakeflow Pipelines

Business Challenge

M&S, the retail giant with hundreds of stores plus a growing online business, was stuck waiting 12-24 hours for sales reports because everything ran on overnight batch jobs.

Batch processing led to:

Marketing couldn't launch same-day promotions when they spotted trends.
Operations had no idea what inventory levels looked like in real time.
Finance was always working with yesterday's revenue numbers.
Data governance was all over the place and tracking where data came from was mostly manual work.

How They Fixed it with Lakeflow Pipelines

They moved everything over to Databricks Lakeflow Pipelines to get a unified streaming ETL setup that actually worked:

Bronze Layer: POS and online sales data flows in continuously from cloud storage and Kafka, so there's no waiting around for data to show up.
Silver Layer: Built-in data quality checks through expectations automatically kick out bad or duplicate records, while flagging vague transactions for someone to look at.
Gold Layer: Set up materialised views for hourly and daily sales numbers, plus added historical customer data that updates automatically through AUTO CDC (keeps the full history).
Governance: Everything gets registered in Unity Catalog, so permissions are centralised, you can trace where data came from and audits are no big deal.
Monitoring : Used Lakeflow's event logs and dashboards to keep an eye on pipeline health, how fast things are running and when schemas change.
Development: The new Lakeflow UI allowed their engineers to write, test, and deploy transformations all from one place instead of jumping between tools.

Results

Cut reporting time from 24 hours down to under 5 minutes.
Marketing can now launch same-day promotions based on what's happening with sales.
Way more trust in the data with 99% fewer manual reconciliations needed.
Complete audit trails through Unity Catalog lineage and event logging

Conclusion

Databricks Lakeflow Pipelines (formerly Delta Live Tables) simplify streaming and incremental ETL with a declarative and AI-ready framework. Instead of complex orchestration and manual tuning teams can build reliable data pipelines in SQL or Python with built-in data quality checks, Unity Catalog governance, per- table triggers and automated CDC.

From real-time insights to faster reporting Lakeflow Pipelines help enterprises replace slow batch jobs with governed and production-grade streaming pipelines.

Ready to modernise your ETL with real-time streaming?

Get in touch with Cloudaeon to accelerate your Databricks Lakeflow journey today.

Don’t forgot to download or share with your colleagues and help your organisation navigate these trends.

Smarter data, smarter decisions.

Let's Talk