Solved: Re: Poor parallel processing problem in pipelines ...

dbeavon3 · ‎01-14-2025

When Microsoft was building ADF parallel pipelines, they made the decision to use static partitioning. The workloads that are going to a parallel loop is prepared in advance and they are not adjusted, even if some items are completed faster than others.

In most scheduling engines of parallel work, the load balancing is done dynamically. It is very frustrating that Microsoft did things this way. On many occasions, I have wasted time trying to accommodate the silly limitation in ADF. It would be so much easier for Microsoft to solve this in a centralized way for the benefit of all their customers.

About three years ago I reported this as an ADF bug because it was mind-boggling to me that the partitioning would be so unfriendly.
Is there any plan to fix this issue in "Fabric" now that Microsoft has moved the same technology here from their "ADF" platform?

NOTE: Below is what this problem looks like in Fabric. Notice that there is one item in the parallel loop that is larger than the others, and it causes "stragglers" to be executed long after all the other parallel workers have gone to sleep.

I'm sure this P5 issue is familiar to others. Has anyone else contacted Microsoft? If you are reading this, would you please take a turn opening a support ticket about the scheduling bug? Given that they have been aware of the problem for years, I'm not optimistic that this will be fixed until every customer is complaining. The bug is described in their docs, but they won't fix it for whatever reason. ... I'm guessing here, but fixing this would probably reduce the customer spend in ADF, because integration runtimes and virtual networks would not be active for as long as before. It may seem immaterial, but if all parallel loops for all ADF customers were shortened by just ~10 mins, it would certainly amount to $100's of thousands per year (possibly even $ millions). Normally I won't even start looking at workarounds until my pipelines are running an hour longer than they should. ... What a waste.

dbeavon3 · ‎02-04-2025

Hi @v-kpoloju-msft

The update is from Microsoft:
https://learn.microsoft.com/en-us/azure/data-factory/pipeline-trigger-troubleshoot-guide#degree-of-p...

The problem is that this is an obvious bug and they choose not to fix it, despite the fact that customers have struggled for years:

Any concurrent or threaded programming language nowadays will allow tasks to be re-balanced while processing is underway. Customers of ADF will expect it to perform the dynamic load-balancing, especially given the excessive cost of the underlying compute and the underlying network components.

The workarounds can often be complex, and involve predicting how long something will take to run, before you run it. This prediction is not always accurate, and working on that prediction can take even more programming effort than the work that is done inside the loop.

View solution in original post

v-kpoloju-msft · ‎01-21-2025

Hi @dbeavon3,

We regret the inconvenience you are experiencing and acknowledge your requirements. However, we are unable to raise the support ticket on your behalf.

Kindly submit the support ticket using the link provided below.
https://learn.microsoft.com/en-us/power-bi/support/create-support-ticket

Thank you for your understanding.

v-kpoloju-msft · ‎02-03-2025

Hi @dbeavon3,

Since we haven't heard back from you, we wanted to follow up regarding your ticket.

Could you please provide an update on the status of your ticket ? it will be helpful for other members of the community who have similar problems as yours to solve it faster.

Thankyou.

dbeavon3 · ‎02-04-2025

Hi @v-kpoloju-msft

The update is from Microsoft:
https://learn.microsoft.com/en-us/azure/data-factory/pipeline-trigger-troubleshoot-guide#degree-of-p...

The problem is that this is an obvious bug and they choose not to fix it, despite the fact that customers have struggled for years:

Any concurrent or threaded programming language nowadays will allow tasks to be re-balanced while processing is underway. Customers of ADF will expect it to perform the dynamic load-balancing, especially given the excessive cost of the underlying compute and the underlying network components.

The workarounds can often be complex, and involve predicting how long something will take to run, before you run it. This prediction is not always accurate, and working on that prediction can take even more programming effort than the work that is done inside the loop.

v-kpoloju-msft · ‎02-07-2025

Hi @dbeavon3,

We apologize for the inconvenience. Unfortunately, we do not have an immediate solution currently. However, we will escalate this issue to our internal team to gather insights from various perspectives and resolve it as soon as possible.

Thank you.

v-kpoloju-msft · ‎06-05-2025

Hi @dbeavon3,

If the issue has been resolved, we kindly request you to share the resolution or key insights here to help others in the community. If we don’t hear back, we’ll go ahead and close this thread.

Should you need further assistance in the future, we encourage you to reach out via the Microsoft Fabric Community Forum and create a new thread. We’ll be happy to help.

Thank you for your understanding and participation.