top of page

Streamlining ETL with Databricks Lakeflow

Databricks Lakeflow streamlines ETL by unifying ingestion, transformation, and orchestration into an AI-powered pipeline framework on the Databricks Data Intelligence Platform. It combines Lakeflow Connect for low-code ingestion, Declarative Pipelines (DLT) for batch/streaming transformations with quality checks, and Lakeflow Jobs for native orchestration with retries, CI/CD, and parameter passing.

With AI-driven optimisation, unified governance via Unity Catalog, and serverless scalability, Lakeflow eliminates infrastructure overhead while cutting costs.

Lakeflow is ideal for data engineers, ML engineers, and platform teams building production-grade pipelines.

Author

I'm a Data Engineer with 8 years of experience specialising in the Azure data ecosystem. I design and implement scalable data pipelines, lakes and ETL/ELT solutions using tools like ADF, Airflow, Databricks, Synapse and SQL Server. Focused on building high-quality, secure, and optimised cloud data architecture.
Nikhil
Mohod

I'm a Data Engineer with 8 years of experience specialising in the Azure data ecosystem. I design and implement scalable data pipelines, lakes and ETL/ELT solutions using tools like ADF, Airflow, Databricks, Synapse and SQL Server. Focused on building high-quality, secure, and optimised cloud data architecture.

Connect with 
Nikhil
Mohod

Get a free recap to share with colleagues

Ready to shape the future of your business?

What is Lorem Ipsum?

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.

Rectangle 4636

Lorem Ipsum is simply dummy text of the printing and typesetting industry. Lorem Ipsum has been the industry's standard dummy text ever since the 1500s, when an unknown printer took a galley of type and scrambled it to make a type specimen book. It has survived not only five centuries, but also the leap into electronic typesetting, remaining essentially unchanged.

Building and maintaining ETL pipelines across ingestion, transformation, and orchestration is complex, especially when data is constantly changing. Databricks Lakeflow solves this by unifying components like Delta Live Tables, workflows and offering a fully native, AI-powered pipeline framework built on the Databricks Data Intelligence Platform.


Architecture


Lakeflow brings together three main pieces that work together:


  • Lakeflow connect: This handles getting your data in from databases, applications, and real-time streams.


  • Declarative pipelines: You might know this as Delta Live Tables (DLT). It takes care of both batch processing and streaming transformations.


  • Lakeflow jobs: This used to be called Workflows. It manages your entire workflow and automatically retries things when they fail.


Built on the Databricks Data Intelligence Platform, Lakeflow streamlines streaming ETL with AI-driven intelligence that optimises scheduling, resolves issues and enhances pipeline visibility. Unified under the Unity Catalog, governance, lineage and access controls remain consistent across the entire workflow, eliminating silos. Its serverless architecture further simplifies operations by automating scaling, removing infrastructure management hassles and ensuring cost efficiency through pay-as-you-go usage.



Technical walkthrough


By coding


  1. Lakeflow connect: Getting data in without the hassle


You can either write code or use the point-and-click interface to pull data from places like Salesforce, ServiceNow, SQL Server, S3 and tons of other sources. The system figures out your schema automatically, reads data incrementally, and handles the writes for you.


from databricks.connectors import ConnectClient

client = ConnectClient()

pipeline = client.create_pipeline(
name="salesforce_ingestion",
source="salesforce",
options={
"enable_cdc": True
"replication_interval": "1h"

},
target = "bronze.salesforce_raw"

)

pipeline.start()

Explanation:


  • The system grabs data from Salesforce bit by bit through change data capture

  • It automatically maps out your data structure and dumps everything into Delta Lake

  • No more building your own ingestion code or paying for expensive third-party connectors


  1. Declarative pipelines: Making transformations simple


Whether you prefer SQL or Python, this component has you covered. It has built-in data quality checks and compute that scales itself. You can handle incremental updates and backfill operations without having to rewrite your code.


import dlt
from pyspark.sql.functions import col

@dlt.table(
comment= "Cleaned and filtered order data"

)
def silver_orders():
return(

spark.read.table("bronze.salesforce_raw")
.filter(col("event_date")>= "2024-01-01")
.select("order_id", "customer_id", "total_amount")

)

Explanation:


  • Defines a Delta Live Table (DLT) pipeline using Python.

  • Supports incremental logic, streaming or batch modes.

  • Includes built in data quality checks and metadata tracking.


These Databricks Declarative Pipelines can be a game changer as it has the capability to manage end to end process of MEDALLION ARCHITECTURE (bronze, silver and gold layers) via these pipelines including ingestion, transformation, data quality management and many more features. A sample of same is shown in the image below.



  1. Lakeflow jobs: Orchestration made native


  • Create a job chaining ingestion → transform → downstream jobs (notebooks, ML scoring, dashboards).

  • Built-in control flow features like retries, partial reruns, and parameter passing via SQL task outputs. Enables CI/CD and modular task orchestration.


{
"name": "lakeflow_etl_job",
"tasks": [
{
"task_key":"run_silver_pipeline",
"pipeline_task": {
"pipeline id": "your-dlt-pipeline-id"
}
},
{
"task_key": "run_validation_notebook",
"depends_on": ["run_silver_pipeline"],
"notebook_task":{
"notebook_path": "/analytics/validate_orders"
}
}
],
"schedule":{
"quartz_cron_expression": "00***?",
"timezone_id": "UTC"
}
}

Explanation:


  • Defines a job with dependent tasks (pipeline → notebook).

  • Supports retry policies, scheduling and parameter passing.

  • Can be created via REST API, CLI, or Terraform.


By Databricks UI


All the functionalities can also be configured via UI of databricks. Now, in the left panel, you can find a consolidated option of ‘Jobs & Piplines’ where we can access all three components:



  1. Lakeflow connect: Can be accessed via the ‘Ingestion Pipeline’ tile in the Databricks workspace. It allows you to easily ingest data from various sources like databases, SaaS apps and streams using a low-code interface. It supports CDC, schema inference, and incremental loads for both batch and streaming pipelines.



  1. Declarative pipelines: can be accessed from via the ‘ETL Pipeline’ tile on the workspace homepage. These pipelines allow you to define transformations using SQL or Python in a simplified, declarative format. They support incremental processing, built in data quality checks and seamless orchestration with Lakeflow.



  1. Lakeflow jobs: can be accessed through the ‘Job’ tile in the Databricks workspace. They enable orchestration of complex workflows by chaining pipelines, notebooks, and other tasks with built in support for retries, scheduling, and parameter passing. This provides a unified and scalable way to manage production grade data workflows.



Key Learnings


  • Unified design eliminates need for multiple third-party tools

  • Declarative style simplifies logic and accelerates development

  • Built in AI improves observability and troubleshooting

  • Unity Catalog ensures end to end data governance and lineage


Results


  • 80 to 90% faster development of ETL pipelines.

  • Reduced dependency on Airflow, custom code, and manual monitoring.

  • Lower compute costs via intelligent scheduling and incremental loads.

  • Real time pipeline readiness with batch streaming unification.



Don’t forgot to download or share with your colleagues and help your organisation navigate these trends.

Mask group.png
Smarter data, smarter decisions.
bottom of page