Solved: Re: Suggestions for CDC driven data ingestion

PhilBrown · ‎06-30-2025

We are attempting to drive our data ingestion via CDC and SQL server in support of (near)realtime reporting. Generic mirroring seemed dicey due to various high watermarks in tables being unreliable, and other reasons. Generic Fabric streaming solution also became dicey because large legacy DBs would be locked for extended, and often unacceptible, periods during snapshot reads. Currently what we've developed is a custom Debezium instance writing to Eventhubs, then a Fabric Spark Job to read change events and translate those into the appropriate Delta tables with a high watermark based on the CDC event.

All well and good. Scaling is where this begins to seem shaky. We have several hundred client DBs in Azure, and also incorporating on-prem DBs from other parts of our company. So...

Are long running spark jobs going to be effective here? After 8 hours, it seems to have consumed about 30% of an F4 over 130k change events.
Converting to an Azure function might be an option, but I'd assume the various delta merge actions would still consume the same about of CU on the Fabric Capacity itself. Is landing the data directly in Delta even a good option?
For the amount of work it seems to take in processing all these updates to Delta, would we simply be better off replicating into a Fabric SQL db (with added HighWatermark), and letting the rest of the pipeline/ADF handle the conversion?

v-prasare · ‎07-23-2025

Hi @PhilBrown ,

I’d encourage you to submit your detailed feedback and ideas via Microsoft's official feedback channels, such as the Microsoft Fabric Ideas.

Feedback submitted here is often reviewed by the product teams and can lead to meaningful improvement.

Thanks,

Prashanth Are

MS Fabric community support

View solution in original post

v-prasare · ‎07-16-2025

May I ask if you have resolved this issue? If so, please mark the helpful reply and accept it as the solution. This will be helpful for other community members who have similar problems to solve it faster.

If we don’t hear back, we’ll go ahead and close this thread. For any further discussions or questions, please start a new thread in the Microsoft Fabric Community Forum we’ll be happy to assist.
Thank you for being part of the Microsoft Fabric Community.

v-prasare · ‎07-14-2025

Hi @PhilBrown,

As we have not heard back from last convesation, can you please provide your insights on below suggestions? did these help or you facing any other challenges.

Thanks,

Prashanth Are

MS Fabric community support

v-prasare · ‎07-08-2025

Hi @PhilBrown,

As we have not heard back from last convesation, can you please provide your insights on below suggestions? did these help or you facing any other challenges.

@lbendlin, Thanks for your inputs in this topic.

Thanks,

Prashanth Are

MS Fabric community support

v-prasare · ‎07-01-2025

Hi @PhilBrown ,

Try considering below steps for your usecase and let me know if this helps?

Minimize Spark Job Duration
Long-running Spark jobs are difficult to manage, consume significant Fabric Capacity, and are prone to failure over time. Instead of streaming jobs running for hours, use micro-batch jobs that execute every 5 to 15 minutes. These short-lived jobs process smaller chunks of data, allowing better scaling and monitoring. This approach also enables retry logic and isolates failures to a specific time window.
Decouple Change Capture from Delta Write
Separating the ingestion of CDC data from the logic that writes to Delta tables adds flexibility and improves fault tolerance. First, write raw Debezium change events into a Bronze Delta table as-is, without any transformation or merge logic. Then, in a separate step or job, read from the Bronze table and apply upserts or deletions into a Silver Delta table. This separation allows you to buffer, inspect, and reprocess raw data without needing to re-ingest from Event Hubs.
Leverage Micro-Batching with Delta Lake
Delta Lake performs best when updates and inserts are handled in small batches. Use micro-batch processing to append raw data into Bronze tables and then periodically run merge jobs into Silver tables. This reduces memory pressure and avoids large-scale shuffles. You can also optimize performance by partitioning on client ID or date and using Z-Ordering on frequently queried fields.
Avoid Snapshot Reads on Large Databases
Taking full snapshots from large SQL Server databases — especially on-prem — can result in long table locks and degraded source system performance. Instead, use Debezium in CDC-only mode to read directly from transaction logs without triggering heavy queries. If an initial load is required, perform it once offline or outside business hours, and switch to CDC for all incremental loads thereafter.
Use Delta Lake Only When Necessary
While Delta Lake provides powerful features like versioning and ACID compliance, it’s not needed at every stage of the pipeline. Use it selectively in Silver and Gold layers where data needs merging, analytics, or time travel. For raw ingestion or high-velocity logs, storing as Parquet or using Fabric Lakehouse SQL may be more efficient. Avoid unnecessary overhead by choosing the right format for the right layer.

Thanks,

Prashanth Are

MS Fabric community support

lbendlin · ‎07-02-2025

#1 5 minutes seems to be the sweet/sour spot for spinning up/tearing down Spark sessions. If you need processing more rapidly than every 5 minutes then keeping the session alive is more economical. If less frequently then the spin up/tear down cost is more palatable. Similar to the start/stop feature of some cars.

#3 Not sure what you mean by micro batching but generally Delta Lake likes larger files, around the 1 GB mark. Lots of small files make for messy maintenance. So while CDC is nice for the data source it is not necessarily good for Fabric.

#5 Whatever you do in Fabric will end up in Delta Lake, if you want it or not.

PhilBrown · ‎07-02-2025

Thanks for the reply Prashanth. A couple of notes, while I research.

In general, there always seems to be some minor to significant latency in spinning up batch jobs, often adding between 2 and 4 minutes to any notebook, pipeline execution, or Spark job. I was attempting to avoid this accumulated latency with something closer to a long-running job, but this may be better suited to Azure functions or service.

Regarding points 2 & 3, my existing spark job is already doing this effectively caching in memory and batching table upserts/deletes every 2-5 minutes.

On #4, there are config settings in Debezium that prevent full table-locks during snapshot, yet still provide acceptable concurrancy in a majority of cases. They just aren't made available within the generic EventStream setup in Fabric. Probably a big miss in my opinion.