Simplify Streaming ETL with Databricks Delta Live Tables

As businesses increasingly rely on real-time data, traditional batch ETL processes are no longer sufficient to meet modern analytics needs. Streaming ETL offers a transformative approach, allowing organisations to process and transfer data continuously, ensuring instant insights and faster decision-making.

Unlike batch ETL, which processes data in chunks and introduces delays, streaming ETL works on a constant flow of events, making it ideal for use cases such as fraud detection, real-time stock market analysis, and live website traffic monitoring. However, setting up and maintaining streaming ETL pipelines can be complex.

This is where Databricks Delta Live Tables (DLT) comes in. DLT simplifies the process of building and operating ETL pipelines, reducing complexity while ensuring efficiency and scalability.

This article explores how DLT can enhance streaming ETL, it’s key features, and best practices for implementation.

Author

I'm a Data Engineer with 8 years of experience specialising in the Azure data ecosystem. I design and implement scalable data pipelines, lakes and ETL/ELT solutions using tools like ADF, Airflow, Databricks, Synapse and SQL Server. Focused on building high-quality, secure, and optimised cloud data architecture.

Nikhil

Mohod

I'm a Data Engineer with 8 years of experience specialising in the Azure data ecosystem. I design and implement scalable data pipelines, lakes and ETL/ELT solutions using tools like ADF, Airflow, Databricks, Synapse and SQL Server. Focused on building high-quality, secure, and optimised cloud data architecture.

Connect with

Nikhil

Mohod

Get a free recap to share with colleagues

Ready to shape the future of your business?

Let's Talk

What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.

What is Streaming ETL?

Streaming ETL is a way to handle real-time data by continuously processing and transferring it between different systems for analysis. Unlike traditional batch processing, which works in chunks that can cause delays, streaming ETL transforms and loads data as it arrives, ensuring updates happen instantly. This makes it especially valuable for industries that rely on real-time insights, such as banking, e-commerce and IoT applications in manufacturing.

At its core, streaming ETL is an evolved form of the extract, transform and load (ETL) process. Instead of waiting for data to accumulate before processing, it works on a constant flow of events. These events can be anything from a user clicking on a website, a new social media post, or a temperature reading from a sensor.

Whenever data is generated by a source system, the streaming ETL platform picks it up, processes it on the fly and moves it to the target system.

Along the way, it can apply various operations such as filtering, routing and mapping, allowing businesses to gain immediate insights and act on the most up-to-date information.

How is Streaming ETL different from Batch ETL?

When choosing an ETL pipeline, it all comes down to your specific needs and how you plan to use the data. The decision between batch ETL and streaming ETL depends on a few key factors:

Purpose
Batch ETL is great when you need to process large amounts of historical data at scheduled times, like running sales reports at the end of the day or financial data at the end of a quarter. On the other hand, streaming ETL handles data in real-time, processing and moving it as soon as it’s generated.
Speed
Since batch ETL works with data in chunks, there’s always some delay before you can access insights. Streaming ETL, however, delivers data instantly with minimal lag, making it the better choice for real-time updates.
Best Use Cases
If you're working on fraud detection, real-time stock market analysis, or tracking live website traffic, streaming ETL is the way to go. But if you’re generating monthly sales reports, analysing customer trends over time, or managing large historical datasets, batch ETL is the better fit.

Why Streaming ETL?

Streaming ETL speeds up data driven decision making for businesses by processing information as soon as it is received. Streaming ETL provides the following primary advantages for companies that depend on real-time data:

Real-Time Analytics Provide Instant Insights
Because streaming ETL continuously processes data, businesses always have access to the most recent insights. This is especially useful in circumstances where prompt decision-making is crucial, such as when adjusting supply chain operations in real time.
Reliable Data Quality
Streaming ETL enables the consistent detecting and resolution of inconsistencies as they appear. This is key for maintaining accurate, clean data, that reduces errors and ensures that business leads can trust the information they use to make decisions.
Scalability to Handle Growing Amounts of Data
ETL streaming platforms are designed to scale easily. By using in-memory computing and dividing workloads among multiple processing units, they can manage unforeseen spikes in data without compromising performance.
Smooth Integration Across Multiple Platforms
Whether the data comes from traditional databases, cloud apps, or IoT devices, streaming ETL ensures a smooth integration. This unified approach removes steps in data management and keeps everything linked in real-time.
Deeper, More Useful Information
ETL streaming enhances data instead of just processing it. By merging real-time data with old records or external sources, businesses can uncover trends, spot anomalies, and improve predictive analytics for more informed decision making.

How to Simplify Streaming ETL?

Databricks Delta Live Tables (DLT) is the answer

Building and operating data pipelines is not an easy task. It gets simpler with DLT. It is built for easy usage that helps in performing robust ETL with just a few lines of code.

How Databricks DLT help simplify Streaming ETL?

Choose between Pipeline Processing Modes

Databricks DLT offers two pipeline modes:

Triggered 
Continuous

You can easily toggle between processing modes by leveraging Spark’s unified API.

Triggered Pipeline Mode

If the pipeline runs in triggered mode, it stops processing once all tables or selected tables have been successfully refreshed. This ensures that each table is updated based on the data available at the start of the update cycle.

Continuous Pipeline

If the pipeline runs in continuous execution mode, DLT processes new data as it arrives, ensuring that tables remain up to date.

To prevent unnecessary processing, pipelines automatically monitor dependent Delta tables and only update them when their contents change. Continuous pipelines require an always-running cluster to reduce processing latency.

Set trigger interval for continuous pipelines

When setting up pipelines in continuous mode, you can define trigger intervals to control how often the pipeline updates each data flow.

pipelines.trigger.interval allows you to adjust the trigger interval for updating a specific table or the entire pipeline. Since triggered pipelines update each table only once per run, this setting applies only to continuous pipelines.

Databricks recommends configuring at the table level, as streaming and batch queries have different default behaviours. pipelines.trigger.interval. Only set this value at the pipeline level if you need to manage updates across the entire pipeline graph.

You set  pipelines.trigger.interval on a table using spark_conf in Python or  SET in SQL:

Python:
@dlt.table(
  spark_conf={"pipelines.trigger.interval" : "10 seconds"}
)
def <function-name>():
    return (<query>)

SQL:                                                                                                       
SET pipelines.trigger.interval=10 seconds;

CREATE OR REFRESH MATERIALIZED VIEW TABLE_NAME
AS SELECT ...

To set pipelines.trigger.interval on a pipeline, add it to the configuration object in the pipeline settings:

JSON 
{
  "configuration": {
    "pipelines.trigger.interval": "10 seconds"
  }
}

Transform Data with Pipelines

DLT simplifies pipeline optimisation by allowing you to define a complete incremental data pipeline using streaming tables and materialised views.

You can define a dataset using any query that returns a DataFrame. In your DLT pipeline, you can apply Apache Spark's built in operations, UDFs, custom logic, and MLflow models as transformations. Once data is ingested, you can create new datasets from upstream sources to generate streaming tables, materialised views, and standard views.

Combine streaming tables and materialised views in a single pipeline:

By combining streaming tables and materialised views within a single pipeline, you can streamline data processing, reduce the need for costly re-ingestion or re-processing, and leverage SQL for complex aggregations on efficiently encoded and filtered datasets. The following example demonstrates this mixed processing approach:

Python:
@dlt.table
def streaming_bronze():
  return (
    # Since this is a streaming source, this table is incremental.
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("s3://path/to/raw/data")
  )

@dlt.table
def streaming_silver():
  # Since we read the bronze table as a stream, this silver table is also
  # updated incrementally.
  return spark.readStream.table("streaming_bronze").where(...)

@dlt.table
def live_gold():
  # This table will be recomputed completely by reading the whole silver table
  # when it is updated.
  return spark.readStream.table("streaming_silver").groupBy("user_id").count()

SQL:
CREATE OR REFRESH STREAMING TABLE streaming_bronze
AS SELECT * FROM STREAM read_files(
  "s3://path/to/raw/data",
  format => "json"
)

CREATE OR REFRESH STREAMING TABLE streaming_silver
AS SELECT * FROM STREAM(streaming_bronze) WHERE...

CREATE OR REFRESH MATERIALIZED VIEW mv_gold
AS SELECT count(*) FROM streaming_silver GROUP BY user_id

Load Data from anywhere with DLT

With DLT, you can load data from any source supported by Apache Spark on Databricks.

You can define datasets using any query that returns a Spark DataFrame, including streaming DataFrames and Pandas on Spark DataFrames.

Load files from cloud object storage

Databricks recommends using Auto Loader with DLT for most data ingestion tasks from cloud object storage. Auto Loader and DLT work together to incrementally and idempotently load continuously growing data as it arrives. The following examples demonstrate how to use Auto Loader to create datasets from CSV and JSON files:

Python:
@dlt.table
def customers():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "csv")
      .load("/databricks-datasets/retail-org/customers/")
  )

@dlt.table
def sales_orders_raw():
  return (
    spark.readStream.format("cloudFiles")
      .option("cloudFiles.format", "json")
      .load("/databricks-datasets/retail-org/sales_orders/")
  )

SQL:
CREATE OR REFRESH STREAMING TABLE customers
AS SELECT * FROM STREAM read_files(
  "/databricks-datasets/retail-org/customers/",
  format => "csv"
)

CREATE OR REFRESH STREAMING TABLE sales_orders_raw
AS SELECT * FROM STREAM read_files(
  "/databricks-datasets/retail-org/sales_orders/",
  format => "json")

Load data from a message bus

You can set up DLT pipelines to ingest data from message buses using streaming tables. For optimal efficiency and low-latency data loading, Databricks recommends combining streaming tables with continuous execution and enhanced autoscaling.

Python:
import dlt

@dlt.table
def kafka_raw():
  return (
    spark.readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", "<server:ip>")
      .option("subscribe", "topic1")
      .option("startingOffsets", "latest")
      .load()
  )

SQL:
CREATE OR REFRESH STREAMING TABLE streaming_silver_table
AS SELECT
  *
FROM
  STREAM(kafka_raw)
WHERE ...

Load small or static datasets from cloud object storage

You can load small or static datasets using Apache Spark  load syntax. DLT supports all of the file formats supported by Apache Spark on Databricks. 

The following examples demonstrate loading JSON to create DLT tables:

Python:
@dlt.table
def clickstream_raw():
  return (spark.read.format("json").load("/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"))

SQL: 
CREATE OR REFRESH MATERIALIZED VIEW clickstream_raw
AS SELECT * FROM read_files(
  "/databricks-datasets/wikipedia-datasets/data-001/clickstream/raw-uncompressed-json/2015_2_clickstream.json"
)

Data Quality

Manage data quality with pipeline expectations

Use expectations to enforce data quality constraints as data moves through ETL pipelines. They help monitor data quality metrics and enable you to fail updates or filter out invalid records when issues are detected.

What are expectations?

Expectations are optional clauses used in materialised views, streaming tables, or view creation statements to enforce data quality checks on each record processed in a query. They rely on standard SQL Boolean expressions to define constraints. You can apply multiple expectations to a single dataset and configure them across all dataset declarations within a pipeline.

The constraint clause is a SQL conditional statement that evaluates to either true or false for each record. It defines the validation logic for the expectation. If a record does not meet the condition, the expectation is triggered. Constraints must use valid SQL syntax and cannot contain the following:

Custom Python functions
External service calls
Subqueries referencing other tables

The following are examples of constraints that could be added to dataset creation statements:

Python: 
# Simple constraint
@dlt.expect("non_negative_price", "price >= 0")

# SQL functions
@dlt.expect("valid_date", "year(transaction_date) >= 2020")

# CASE statements
@dlt.expect("valid_order_status", """
   CASE
     WHEN type = 'ORDER' THEN status IN ('PENDING', 'COMPLETED', 'CANCELLED')
     WHEN type = 'REFUND' THEN status IN ('PENDING', 'APPROVED', 'REJECTED')
     ELSE false
   END
""")

# Multiple constraints
@dlt.expect("non_negative_price", "price >= 0")
@dlt.expect("valid_purchase_date", "date <= current_date()")

# Complex business logic
@dlt.expect(
  "valid_subscription_dates",
  """start_date <= end_date
    AND end_date <= current_date()
    AND start_date >= '2020-01-01'"""
)

# Complex boolean logic
@dlt.expect("valid_order_state", """
   (status = 'ACTIVE' AND balance > 0)
   OR (status = 'PENDING' AND created_date > current_date() - INTERVAL 7 DAYS)
""")

SQL: 
-- Simple constraint
CONSTRAINT non_negative_price EXPECT (price >= 0)

-- SQL functions
CONSTRAINT valid_date EXPECT (year(transaction_date) >= 2020)

-- CASE statements
CONSTRAINT valid_order_status EXPECT (
  CASE
    WHEN type = 'ORDER' THEN status IN ('PENDING', 'COMPLETED', 'CANCELLED')
    WHEN type = 'REFUND' THEN status IN ('PENDING', 'APPROVED', 'REJECTED')
    ELSE false
  END
)

-- Multiple constraints
CONSTRAINT non_negative_price EXPECT (price >= 0)
CONSTRAINT valid_purchase_date EXPECT (date <= current_date())

-- Complex business logic
CONSTRAINT valid_subscription_dates EXPECT (
  start_date <= end_date
  AND end_date <= current_date()
  AND start_date >= '2020-01-01'
)

-- Complex boolean logic
CONSTRAINT valid_order_state EXPECT (
  (status = 'ACTIVE' AND balance > 0)
  OR (status = 'PENDING' AND created_date > current_date() - INTERVAL 7 DAYS)
)

Retaining invalid records is the default behaviour for expectations

Python: 
@dlt.expect("valid timestamp", "timestamp > '2012-01-01'")

SQL: 
CONSTRAINT valid_timestamp EXPECT (timestamp > '2012-01-01')

Drop invalid records

Python: 
@dlt.expect_or_drop("valid_current_page", "current_page_id IS NOT NULL AND current_page_title IS NOT NULL")

SQL: 
CONSTRAINT valid_current_page EXPECT (current_page_id IS NOT NULL and current_page_title IS NOT NULL) ON VIOLATION DROP ROW

Fail on invalid records

Python: 
@dlt.expect_or_fail("valid_count", "count > 0")

SQL: 
CONSTRAINT valid_count EXPECT (count > 0) ON VIOLATION FAIL UPDATE

Multiple expectations management

Python: 
valid_pages = {"valid_count": "count > 0", "valid_current_page": "current_page_id IS NOT NULL AND current_page_title IS NOT NULL"}

@dlt.table
@dlt.expect_all(valid_pages)
def raw_data():
  # Create a raw dataset

@dlt.table
@dlt.expect_all_or_drop(valid_pages)
def prepared_data():
  # Create a cleaned and prepared dataset

@dlt.table
@dlt.expect_all_or_fail(valid_pages)
def customer_facing_data():
  # Create cleaned and prepared to share the dataset

Case Study

A global London based retailer faced major supply chain challenges, including incongruous data from multiple vendors, poor forecasting and outdated stock information.

Handling this multi-tiered global supply chain presented several challenges, including inconsistencies in order data, stock management and real-time data streaming, among others. 

To address these, Cloudaeon implemented a robust data pipeline leveraging technologies like Databricks Delta Live Table, Kafka, Azure Data Factory, Databricks, and Power BI.

Their streaming ETL was streamlined and simplified by leveraging Databricks DLT to its maximum capacity.

As a result, the retailer achieved higher data accuracy, better forecasting, secure data handling and significantly improved operational efficiency across its global supply chain.

Conclusion

In a world where real-time data is essential for business success, streaming ETL provides a competitive edge by delivering continuous, up-to-date insights. However, managing real-time data pipelines can be challenging without the right tools.

Databricks Delta Live Tables (DLT) simplifies streaming ETL by offering automated pipeline management, real-time data transformations and built-in quality checks. Whether you need real-time analytics, reliable data quality, or seamless integration across platforms, DLT provides a robust solution to streamline ETL processes.

By leveraging DLT, businesses can focus more on deriving actionable insights rather than managing infrastructure, making real-time data processing more accessible and efficient.

Ready to enhance your streaming ETL capabilities with Databricks DLT?

Cloudaeon can help you achieve the most efficient streaming ETL process you have ever imagined. Click here to know more.

Don’t forgot to download or share with your colleagues and help your organisation navigate these trends.

Smarter data, smarter decisions.

Let's Talk

Simplify Streaming ETL with Databricks Delta Live Tables

Author

Nikhil

Mohod

Connect with

Nikhil

Mohod

Get a free recap to share with colleagues

Ready to shape the future of your business?

What is Lorem Ipsum?

What is Streaming ETL?

How is Streaming ETL different from Batch ETL?

Why Streaming ETL?

How to Simplify Streaming ETL?

Databricks Delta Live Tables (DLT) is the answer

How Databricks DLT help simplify Streaming ETL?

Choose between Pipeline Processing Modes

Triggered Pipeline Mode

Continuous Pipeline

Set trigger interval for continuous pipelines

Transform Data with Pipelines

Load Data from anywhere with DLT

Load files from cloud object storage

Load data from a message bus

Load small or static datasets from cloud object storage

Data Quality

Manage data quality with pipeline expectations

What are expectations?

Case Study

Conclusion

Ready to enhance your streaming ETL capabilities with Databricks DLT?

Don’t forgot to download or share with your colleagues and help your organisation navigate these trends.

Smarter data, smarter decisions.

Triggered Pipeline Mode