Re: Pipeline | REST API | High CU % Usage

__maca__ · ‎06-17-2024

Hi,

I have set up a simple process to query an external API and retrieve data using Copy Data Activity. The issue I'm facing is that the CU % Usage keeps increasing linearly and eventually reaches 100% (F64), even though the task seems simple and should not require such high usage. Here are the details of my setup:

I retrieve a list of 1,000 IDs from the REST API using a Copy Data Activity
I then use two ForEach blocks in parallel, each containing a Copy Data Activity, to query two separate endpoints with these IDs in batches of 15
The Intelligent throughput optimization is set to 4, and the Degree of copy parallelism is set to 1 (i.e., all settings to minimum)
Each file, between 100KB and 200KB in size, is saved as a JSON file (raw) in a Lakehouse / Files. Hence, the data is not processed in any way.
The whole process is contained within an Until activity and repeats itself until no more IDs are available (in total, probably about 20,000 IDs)

Despite the really simple nature of these operations, the CU % Usage continues to rise. I expected it to remain stable at a low percentage for the entire run. Is there an auto-scaling function that might be causing this? I believe this Data Pipeline should be able to run efficiently on an F64 capacity for days without issues since the same process can easily run on any laptop, or even a Raspberry Pi.

Any help would be greatly appreciated!

Cheers!

v-cboorla-msft · ‎06-18-2024

Hi @__maca__

Thanks for using Microsoft Fabric Community.

The high and increasing CU % Usage in your Microsoft Fabric Data Pipeline, despite a seemingly simple process, could be due to several factors.

Potential Culprits for High CU Usage:

ForEach Loop Execution: Even with a Degree of parallelism set to 1, executing ForEach loops in parallel might trigger resource allocation for both simultaneously. Consider changing them to sequential execution.

Batch Processing Overhead: Retrieving data in 15-sized batches (67 iterations for 1000 IDs) can lead to context switching overhead. Experiment with larger batch sizes (e.g., 50 or 100) within the ForEach loops.

Accumulating Data Volume: While individual files are small, processing 20,000 IDs might accumulate data volume over time. Monitor memory usage and consider data compression if applicable.

Hidden Processing: Double-check Copy Data Activity settings for unintended transformations or validations that could increase resource consumption.

Metrics Analysis: Utilize Azure Monitor to track specific metrics like API call latency, data transfer, and memory usage. This can pinpoint the resource bottleneck.

Optimizing Your Data Pipeline:

Refine ForEach Loop Execution: Change the ForEach loop execution to sequential to potentially limit the initial resource allocation spike.

Adjust Batch Size: Experiment with larger batch sizes (50 or 100) within the ForEach loop to reduce iterations and context switching.

Explore Bulk Retrieval: If the external API allows, consider retrieving all 1000 IDs in a single call for a simpler logic and potentially less context switching.

Monitor Key Metrics: Use Azure Monitor to identify resource bottlenecks by tracking API call latency, data transfer, and memory usage.

Review Activity Configuration: Double-check Copy Data Activity settings for any unnecessary transformations or validations.

Leverage Compression: Depending on the data format, consider compressing the retrieved JSON data before saving it to the lakehouse to potentially reduce storage and CU usage.

By implementing these suggestions and analyzing the specific metrics, you should be able to pinpoint the root cause of the high CU usage and significantly improve your Data Pipeline's efficiency.

I hope this information helps.

Thank you.

DanielSDavis · ‎10-05-2024

We're having a similar issue for similar cause, and I have to say your response doesn't make all that much sense to me. Woulding increasing the batch size so something runs faster consume MORE compute?

It's not uncommon for APIs to have concurancy limites and record count limits to ensure their own resources aren't exceeded. In OUR case it's NetSuite. Try as we may, the most we can pull at any given time is 10 concurrent requests at 1,000 records each.

We've implemented a FOREACH within Fabric Data Factory that sources the data in 10 batches before iterating the next 10, so at any given time I would expect to only see CU usage for 10 batches. When the ForEach iterates to the next loop it's a new set of 10 requests and the last request is complete. I would expect burndown to handle closing the last exection so the CU shouldn't grow out of control.

That's not what happens. It starts off low and just keeps climbing and climbing. It's like no burndown is happening at all and the staging of the copy step is just holding on to resources.

v-cboorla-msft · ‎06-19-2024

Hi @__maca__

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet.
In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

Thank you.

__maca__ · ‎06-19-2024

Hi @v-cboorla-msft,

Thanks for the detailed answer and sorry for the late reply. I’ve been busy trying out all the ideas you suggested. Unfortunately, I haven’t seen any improvement yet. The API we’re using limits us to 15 IDs at a time, which makes bulk retrieval somewhat impossible. I’ve set up my pipeline so that all the ForEach loops are triggered one after another and tried using higher batch sizes (the max allowed by Fabric is 50, so I couldn’t go beyond that). Do you think setting the ForEach activities to sequential within the loops could solve the problem? It might take forever, though...

There must be another way to make this work. I don’t see how switching to an even higher capacity makes sense for such a basic operation. I’d really appreciate it if you could provide me with other ideas or solutions.

Thanks!

v-cboorla-msft · ‎06-25-2024

Hi @__maca__

Apologies for the inconvenience and the delay in response.

Please reach out to our support team to gain deeper insights and explore potential solutions. It's highly recommended that you reach out to our support team. Their expertise will be invaluable in suggesting the most appropriate approach.

Please go ahead and raise a support ticket to reach our support team:

https://support.fabric.microsoft.com/support

After creating a Support ticket please provide the ticket number as it would help us to track for more information.

Thank you.

v-cboorla-msft · ‎06-27-2024

Hi @__maca__

We haven’t heard from you on the last response and was just checking back to see if you've had a chance to submit a support ticket. If you have, a reference to the ticket number would be greatly appreciated. This will allow us to track the progress of your request and ensure you receive the most efficient support possible.

Thank you.