Streamlining ETL Workflows with Databricks Lakeflow

Time Date

Nikhil

Mohod

Connect with

Tracey

Wilson

Streaming ETL with Databricks Delta Live Tables Platform

Streaming ETL systems promise real-time insights, but in practice, most pipelines degrade under operational complexity, schema drift, and data quality failures. Databricks Delta Live Tables (DLT) introduces a declarative pipeline model that shifts responsibility for orchestration, reliability, and lineage management into the platform layer, reducing operational entropy in large-scale streaming pipelines.

Failure Modes

Most streaming ETL architectures fail not because of compute limitations but due to operational fragility in pipeline design.

Stateful Pipeline Fragility

Streaming pipelines accumulate hidden state across micro-batches. When the checkpoint state is corrupted or jobs restart, pipelines must replay event streams, introducing delays and inconsistent downstream tables.

Pipeline Orchestration Drift

Traditional Spark streaming implementations rely on external schedulers and loosely coupled jobs. Over time, this leads to:

Dependency ordering issues
Retry logic fragmentation
Partial pipeline refreshes

DLT avoids this by compiling dataset declarations into a managed pipeline graph.

Silent Data Quality Degradation

Without embedded validation, corrupt records propagate silently through downstream datasets. DLT embeds data quality expectations directly inside table definitions, allowing validation logic to live within the pipeline DAG.

Engineering Deep Dive

Delta Live Tables changes the typical Spark pipeline pattern by introducing declarative dataset definitions. Instead of orchestrating jobs manually, engineers define datasets and dependencies and allow the platform to compile the execution graph.

Pipeline Execution Modes

DLT supports two pipeline execution models.

Triggered Pipelines

Triggered pipelines execute once and stop after refreshing datasets.

Used for:

Scheduled refresh pipelines
Incremental batch workloads

Continuous Pipelines

Continuous pipelines process data as it arrives.

This mode minimises latency but requires an always-running cluster to maintain streaming state.

Engineers can tune pipeline frequency using the pipelines.trigger.interval configuration.

Example (Python):

import dlt

@dlt.table(
  name="orders_stream",
  spark_conf={"pipelines.trigger.interval": "10 seconds"}
)
def orders_stream():
    return spark.readStream.format("delta").table("raw_orders")
SQL equivalent:
SET pipelines.trigger.interval = "10 seconds";

This configuration is typically applied at the table level, allowing fine-grained control over streaming update cadence.

Transforming Data with Streaming Tables and Materialised Views

DLT pipelines commonly mix streaming ingestion with batch-style transformations.

Example pipeline pattern: Bronze (Streaming Table) → Silver (Cleansed Table) → Gold (Aggregated View)

Example implementation:

import dlt
from pyspark.sql.functions import col

@dlt.table
def bronze_orders():
    return (
        spark.readStream.format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("/data/orders")
    )

@dlt.table
def silver_orders():
    return (
        dlt.read_stream("bronze_orders")
        .filter(col("order_id").isNotNull())
    )

@dlt.view
def gold_sales_summary():
    return (
        dlt.read("silver_orders")
        .groupBy("region")
        .sum("order_amount")
    )

By combining streaming tables and materialised views in a single pipeline, engineers avoid costly reprocessing while enabling incremental aggregation layers.

Data Ingestion with Auto Loader

For ingestion from cloud object storage, DLT pipelines typically use Auto Loader. Auto Loader incrementally detects new files and guarantees idempotent ingestion of streaming file sources.

Example ingestion from cloud storage:

import dlt

@dlt.table
def bronze_csv_orders():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "csv")
        .option("header", "true")
        .load("s3://data/orders/")
    )
JSON ingestion example:
@dlt.table
def bronze_json_orders():
    return (
        spark.readStream
        .format("cloudFiles")
        .option("cloudFiles.format", "json")
        .load("s3://data/orders_json/")
    )

Auto Loader maintains ingestion metadata, preventing duplicate processing of previously ingested files.

Message Bus Ingestion

DLT can also ingest streaming data from message brokers.

Example using Kafka:

@dlt.table
def kafka_events():
    return (
        spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "broker:9092")
        .option("subscribe", "events")
        .load()
    )

Streaming tables combined with continuous execution allow pipelines to process event streams with minimal latency.

Embedded Data Quality with Expectations

DLT expectations allow engineers to enforce validation rules inside the pipeline DAG.

Example expectation:

@dlt.table(
  comment="Validated orders table"
)
@dlt.expect("valid_amount", "order_amount > 0")
def validated_orders():
    return dlt.read("silver_orders")

DLT supports three enforcement behaviours:

Drop Invalid Records

@dlt.expect_or_drop("valid_customer", "customer_id IS NOT NULL")

Fail Pipeline on Invalid Data

@dlt.expect_or_fail("valid_amount", "order_amount > 0")

Multiple Expectations

@dlt.expect("valid_customer", "customer_id IS NOT NULL")
@dlt.expect("valid_amount", "order_amount > 0")

These expectations enforce quality constraints during ingestion and transformation stages.

Best Practices & Anti-Patterns

What Works

• Use Auto Loader for streaming file ingestion

• Separate ingestion and transformation pipelines

• Apply expectations early in the pipeline

• Configure trigger intervals per table

• Monitor backlog and pipeline health metrics

What Fails

• Combining batch and streaming transformations in a single stage

• Running continuous pipelines unnecessarily

• Ignoring schema evolution in streaming datasets

• Implementing validation logic outside the pipeline graph.

How Cloudaeon Approaches This

Large enterprise streaming platforms rarely fail due to compute limitations. Most failures arise from operational complexity and pipeline drift.

Cloudaeon’s engineering approach emphasises the following:

Pipeline topology discipline

Bronze ingestion isolation
Domain-based pipeline ownership

Operational observability

expectation failure monitoring
data freshness SLAs
streaming backlog tracking

Incremental streaming adoption

Streaming capabilities are typically introduced at ingestion layers first, before extending into downstream transformation systems. This ensures stability while gradually enabling real-time analytics across the platform.

Technology Stack

• Kafka

• Azure Event Hubs

• Databricks Delta Live Tables

• Spark Structured Streaming

• Delta Lake

• Databricks Auto Loader

• Cloud Object Storage

• Unity Catalog

• DLT Expectations

Conclusion

Streaming ETL delivers significant value when real-time data is handled with the right architectural discipline. Platforms like Databricks Delta Live Tables simplify many operational challenges by providing declarative pipeline management, built-in data quality enforcement, and scalable streaming ingestion patterns. However, designing reliable streaming architectures still requires careful planning around pipeline topology, execution modes, and governance. If you're evaluating how to implement or optimise streaming ETL in your environment, speaking with an experienced data engineering expert can help you identify the right architecture and operational patterns for your workloads.