Solved: Best Approach for Real-Time Ingestion from Azure S...

PrachiJain_2025 · ‎12-27-2024

I need to ingest data in real-time from Azure SQL Database with CDC enabled into a Microsoft Fabric Lakehouse. The setup involves:

1000+ tables with millions records.
Handling schema evolution dynamically to ensure multi-version compatibility.
Ensuring near real-time performance without relying on pre-built data pipelines due to resource and management constraints.

Key requirements:

Real-Time Ingestion: Data must be ingested continuously or with minimal latency.
Schema Evolution: Must dynamically handle changes in source table schemas (e.g., new columns, type changes).
Scalability: The solution should handle high data volume efficiently.
Automation: Minimal manual intervention, even for new tables.

Questions:

Architecture:
What would be the best architecture within Microsoft Fabric to handle such a large-scale ingestion task efficiently? Is there an established pattern for managing this volume and velocity of data in real-time?
Schema Evolution:
Are there any built-in tools or best practices in Delta Lake or Lakehouse for dynamically handling schema evolution (e.g., column additions, type changes) across multiple tables?
Real-Time Streaming with Fabric Notebooks:
- If I run a Fabric Notebook continuously (24x7) to ingest real-time data , streaming using spark streaming , could this approach introduce bottlenecks or operational challenges at scale?
Parallelism and Automation:
What is the best way to achieve parallelism and automation for processing 1000+ tables in real time? Is there a preferred way to orchestrate multiple table ingestion tasks while maintaining low latency and high performance?

Thank you

DataBard · ‎12-27-2024

Hi @PrachiJain_2025 ,

Let me answer your questions specifically.

For your scenario, I'd consider Fabric Mirroring of Azure SQL DB as your first option. Focusing on changes to your database tables, after an initial population of the Delta Lake tables in Fabric, this option will move your data in near real-time to Fabric.
1. Note that documentation today says you are limited to 500 tables, but if you hit a limitation here you could possibly set up multiple mirrors to solve for this.
2. Also note that Mirroring for Azure SQL DB currently requires using the public endpoint. It cannot be set up using a private endpoint currently.
Mirroring would account for schema evolution across multiple tables.
This approach theoretically would work, but having the Spark clusters up and running 24/7 might require more CUs than alternative options. It would also tie up some percentage of your Spark cluster, meaning other users would be limited in what Spark resources they can use. Without knowing the full details of your situation, Fabric has some other options to solve for this without requiring you to write a notebook. Here are other options I'd suggest:
1. Mirroring - It is the simplest approach if you are working with Azure SQL DB as your source.
2. Streaming Azure SQL DB CDC into a Real-time Hub - If you have CDC already set up, you could configure a Real-time Hub to ingest the CDC activities into a Fabric destination (documentation for this option can be found here: Add Azure SQL Database CDC as source in Real-Time hub - Microsoft Fabric | Microsoft Learn).
Again, I'd encourage you to consider the suggestions above before pursuing building your own solution to ingest data from Azure SQL DB CDC in real-time. This is a scenario Fabric is trying to make easier to solve without you having to write all of the code yourself. It would be fully automated and could run in the full parallelism your capacity allows (which, if you max out that usage in your capacity, you'd have to upgrade to a larger SKU to accomodate).
1. If none of these options work and you have to write the tasks yourself, you'd want to look at Fabric solutions that support triggering activities based on events occuring (i.e. CDC events). Data Activator would be one option that allows you to kick off specific Fabric items when a process event or CDC is achieved. Check out this link for intro information on Data Activator: Introduction to Activator - Microsoft Fabric | Microsoft Learn

If the suggestions above do not provide you an acceptable solution, please provide examples of specific scenarios you are encountering and what challenges you are facing with the above solutions.

View solution in original post

PrachiJain_2025 · ‎12-27-2024

Hello @DataBard ,

Thank you so much for the suggestions!

1 . Database Mirroring will not work as there is 500 tables limitations, we have 500+ tables, I don't want to create 2 subset of main database , I want to setup this for mutiple database.

2. Streaming Azure SQL DB CDC into a Real-time Hub - Will this opton handle Schema Evolution? The event stream can handle row-level changes and will keep updating the data incrementally. From Azure SQL CDC to the event stream, my destination is the Lakehouse. How will it handle schema evolution?"

Thank you so much!

DataBard · ‎12-27-2024

Hi @PrachiJain_2025 ,

Let me answer your questions specifically.

For your scenario, I'd consider Fabric Mirroring of Azure SQL DB as your first option. Focusing on changes to your database tables, after an initial population of the Delta Lake tables in Fabric, this option will move your data in near real-time to Fabric.
1. Note that documentation today says you are limited to 500 tables, but if you hit a limitation here you could possibly set up multiple mirrors to solve for this.
2. Also note that Mirroring for Azure SQL DB currently requires using the public endpoint. It cannot be set up using a private endpoint currently.
Mirroring would account for schema evolution across multiple tables.
This approach theoretically would work, but having the Spark clusters up and running 24/7 might require more CUs than alternative options. It would also tie up some percentage of your Spark cluster, meaning other users would be limited in what Spark resources they can use. Without knowing the full details of your situation, Fabric has some other options to solve for this without requiring you to write a notebook. Here are other options I'd suggest:
1. Mirroring - It is the simplest approach if you are working with Azure SQL DB as your source.
2. Streaming Azure SQL DB CDC into a Real-time Hub - If you have CDC already set up, you could configure a Real-time Hub to ingest the CDC activities into a Fabric destination (documentation for this option can be found here: Add Azure SQL Database CDC as source in Real-Time hub - Microsoft Fabric | Microsoft Learn).
Again, I'd encourage you to consider the suggestions above before pursuing building your own solution to ingest data from Azure SQL DB CDC in real-time. This is a scenario Fabric is trying to make easier to solve without you having to write all of the code yourself. It would be fully automated and could run in the full parallelism your capacity allows (which, if you max out that usage in your capacity, you'd have to upgrade to a larger SKU to accomodate).
1. If none of these options work and you have to write the tasks yourself, you'd want to look at Fabric solutions that support triggering activities based on events occuring (i.e. CDC events). Data Activator would be one option that allows you to kick off specific Fabric items when a process event or CDC is achieved. Check out this link for intro information on Data Activator: Introduction to Activator - Microsoft Fabric | Microsoft Learn

If the suggestions above do not provide you an acceptable solution, please provide examples of specific scenarios you are encountering and what challenges you are facing with the above solutions.

Best Approach for Real-Time Ingestion from Azure SQL CDC to Fabric

Helpful resources

Fabric Monthly Update - September 2025

FabCon Atlanta 2026

FabCon is coming to Atlanta

Best Approach for Real-Time Ingestion from Azure SQL CDC to Fabric

Helpful resources

Fabric Monthly Update - September 2025

FabCon Atlanta 2026