Orchestrating AI-driven data pipelines with Azure ADF and Databricks: An architectural evolution

Thursday July 17, 2025. 01:21 PM , from InfoWorld

In the fast-evolving landscape of enterprise data management, the integration of artificial intelligence (AI) into data pipelines has become a game-changer. In “Designing a metadata-driven ETL framework with Azure ADF: An architectural perspective,” I laid the groundwork for a scalable, metadata-driven ETL framework using Azure Data Factory (ADF). This approach streamlined data workflows by leveraging metadata to dynamically configure extraction, transformation and loading processes, minimizing manual coding and enhancing adaptability. However, as organizations increasingly rely on AI and machine learning (ML) to unlock insights from their data, the demands on data architectures have grown more complex.

Today’s enterprises face mounting pressure to process vast datasets, deliver real-time analytics and adapt to shifting business needs, all while maintaining scalability and governance. Industry trends, such as the rise of big data, the proliferation of cloud-native technologies and the growing adoption of AI-driven decision-making, underscore the need for pipelines that go beyond traditional ETL.

Here, I am introducing an evolved version of that framework, integrating Azure Databricks for AI capabilities, a metadata-driven approach to MLOps and a feedback loop for continuous analytics. These enhancements transform the architecture into a robust system capable of meeting modern demands. I’ll walk through each component in detail.

AI integration: Extending the metadata schema

The heart of the original framework was its metadata schema, stored in Azure SQL Database, which allowed for dynamic configuration of ETL jobs. To incorporate AI, I extended this schema to orchestrate machine learning tasks alongside data integration, creating a unified pipeline that handles both. This required adding several new tables to the metadata repository:

ML_Models: This table captures details about each ML model, including its type (e.g., regression, clustering), training datasets and inference endpoints. For instance, a forecasting model might reference a specific Databricks notebook and a Delta table containing historical sales data.

Feature_Engineering: Defines preprocessing steps like scaling numerical features or one-hot encoding categorical variables. By encoding these transformations in metadata, the framework automates data preparation for diverse ML models.

Pipeline_Dependencies: Ensures tasks execute in the correct sequence, I.e. ETL before inference, storage after inference, maintaining workflow integrity across stages.

Output_Storage: Specifies destinations for inference results, such as Delta tables for analytics or Azure SQL for reporting, ensuring outputs are readily accessible.

Consider this metadata example for a job combining ETL and ML inference:

{
'job_id': 101,
'stages': [
{
'id': 1,
'type': 'ETL',
'source': 'SQL Server',
'destination': 'ADLS Gen2',
'object': 'customer_transactions'
},
{
'id': 2,
'type': 'Inference',
'source': 'ADLS Gen2',
'script': 'predict_churn.py',
'output': 'Delta Table'
},
{
'id': 3,
'type': 'Storage',
'source': 'Delta Table',
'destination': 'Azure SQL',
'table': 'churn_predictions'
}
]
}

This schema enables ADF to manage a pipeline that extracts transaction data, runs a churn prediction model in Databricks and stores the results, all driven by metadata. The benefits are twofold: it eliminates the need for bespoke coding for each AI use case, and it allows the system to adapt to new models or datasets by simply updating the metadata. This flexibility is crucial for enterprises aiming to scale AI initiatives without incurring significant technical debt.

Metadata-driven MLOps: Simplifying the ML lifecycle

MLOps, or Machine Learning Operations, bridges the gap between model development and production deployment, encompassing training, inference, monitoring and iteration. In large organizations, MLOps often involves multiple teams: data engineers building pipelines, data scientists crafting models and IT ensuring operational stability. To streamline this, I embedded MLOps into the framework using metadata, making the ML lifecycle more manageable and efficient.

Here’s how metadata drives each phase:

Model Training: The ML_Models table can trigger Databricks training jobs based on schedules or data updates. For example, a metadata entry might specify retraining a fraud detection model every month, automating the process entirely.

Inference: Metadata defines the model, input data and output location, allowing seamless execution of predictions. Data scientists can swap models (e.g., from version 1.0 to 2.0) by updating the metadata, avoiding pipeline rewrites.

Monitoring: Integrated with Azure Monitor or Databricks tools, the framework tracks metrics like model accuracy or data drift, with thresholds set in metadata. Alerts can trigger retraining or human review as needed.

This approach delivers significant advantages:

Team collaboration: Metadata acts as a shared interface, enabling engineers and scientists to work independently yet cohesively.

Operational efficiency: New models or use cases can be onboarded rapidly, reducing deployment timelines from weeks to days.

Governance: Centralized metadata ensures version control, compliance and auditability, critical for regulated industries.

By standardizing MLOps through metadata, the framework turns a traditionally fragmented process into a cohesive, scalable system, empowering enterprises to operationalize AI effectively.

Feedback loop: Enabling continuous analytics

A standout feature of this architecture is its feedback loop, which leverages inference outputs to trigger further analysis. Unlike traditional pipelines, where data flows linearly from source to sink, this system treats ML outputs: predictions, scores or classifications as inputs for additional ETL or analytics tasks. This creates a cycle of continuous improvement and insight generation.

Here is a practical scenario:

A demand forecasting model predicts a supply shortage for a product. The prediction, stored in a Delta table, triggers an ETL job to extract inventory and supplier data, enabling procurement teams to act swiftly.

An anomaly detection model identifies unusual network traffic. This output initiates a job to pull logs and user activity data, aiding security teams in investigating potential breaches.

Implementing this required enhancing the Pipeline_Dependencies table with conditional triggers. For instance, a rule might state: “If anomaly_score > 0.9, launch job_id 102.” This automation ensures the pipeline responds dynamically to AI outputs, maximizing their business impact. Over time, this feedback loop refines predictions and uncovers deeper insights, making the system proactive rather than reactive.

Technical implementation: ADF and Databricks integration

The synergy between ADF and Databricks powers this architecture. ADF orchestrates workflows across hybrid environments, while Databricks handles compute-intensive ML tasks. Here’s how they integrate:

ADF parent pipeline: Parameterized by a job ID, it queries the metadata repository and executes tasks in sequence, ETL, inference and storage via child pipelines.

ETL stage: ADF uses linked services to connect to sources (e.g., SQL Server) and sinks (e.g., ADLS Gen2), transforming data as defined in metadata.

Inference stage: ADF invokes Databricks notebooks through the REST API, passing parameters like script paths and data locations. Databricks auto-scaling clusters optimize performance for large jobs.

Storage stage: Post-inference, ADF stores results in Delta tables or Azure SQL, ensuring accessibility for downstream use.

For hybrid setups, ADF self-hosted integration runtimes handle on-premises data, with metadata selecting the appropriate runtime. This integration balances ADF control-flow strengths with Databricks’ analytical prowess, creating a cohesive system.

Why it matters

This architecture addresses key enterprise challenges:

Agility: Metadata-driven design accelerates AI adoption, adapting to new requirements without overhauls.

Scalability: It handles growing data and model complexity effortlessly.

Value: The feedback loop ensures continuous insight generation, enhancing decision-making.

Data as a strategic asset

By extending the original ETL framework with AI, MLOps and a feedback loop, this architecture empowers enterprises to harness data as a strategic asset. It’s a testament to the power of metadata-driven design in bridging data engineering and AI. Explore the Azure Data Factory documentation and Databricks MLflow guide for more.

This article is published as part of the Foundry Expert Contributor Network.Want to join?