Dataflow gen1 not beginning at scheduled time

dbeavon3 · ‎07-16-2024

I have a dedicated P1 capacity that is brand new and basically idle.

I have a very basic gen 1 dataflow that is supposed to start at midnight.

The portal tells me it starts at midnight (04:00Z)

But I look in the mashup logs and the mashup container doesn't even get launched until 04:29Z!

Why is this product lying to me? Why is it so opaque? What am I waiting 30 minutes for? Why can't it do something as simple as start a job at a predictable time of day? How do we get to the bottom of this? I've opened countless PBI cases and I'm no closer to understanding the principles that guide the development of this product. They seem to create a product that is a black box, with hidden parts on the inside that work in random ways and never seem to have the expected outcome. Keeping a schedule is a well-known concept that has been know to mankind for thousands of years. Why is it so hard for PBI to properly implement job-scheduling software?

I'm running the latest gateway software (on prem enterprise gateway).

Please let me know if there is anything that can be deciphered in the following log. I see that the mashup logs say something happens at 4:04Z but I can't make out what it is, or why it is blocking the other work that I had planned for this time of night. Click to zoom the following.

Anonymous · ‎07-25-2024

Hello,@SaiTejaTalasila ,thanks for your concern about this issue.

Your answer is excellent!
And I would like to share some additional solutions below.
Hi,@dbeavon3 ,I am glad to help you.

I sympathize with your experience.
For the efficiency of refresh on Power BI Service, it is actually affected by many factors: for a scheduled refresh, the real refresh execution will be affected by the performance of the server where the user is located, and if the server in the region is under high load, it may lead to a delay in the refresh. If the server in the region is under high load, it may result in a delayed refresh. Also, for the time period of the refresh, the choice of different time periods may also have an impact on the actual refresh. For scheduled refreshes, refresh delays are frequent because sometimes there may be queuing of refresh tasks, and a scheduled refresh needs to wait for other scheduled refreshes to execute before it is its turn to perform a refresh operation.

Therefore, slow refresh is sometimes unavoidable (even for the user level, there is no problem with the operation set by the user, and the refresh conditions are also met), in this case, it may be a performance problem of the server itself, and it is difficult to avoid this situation from the user's point of view.
I hope you can understand this situation.
For the suggestion made by @SaiTejaTalasila , I think it is feasible, you can try to bypass the previous operation by creating a dataflow A dataflow acts as a data source that preprocesses the data, it's similar to a kind of Online power query, you can try to create the report as a dataflow and perform a refresh on the dataflow and see if the same thing happens.

I hope my suggestions give you good ideas, if you have any more questions, please clarify in a follow-up reply.

Best Regards,

Carson Jian,

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

dbeavon3 · ‎07-26-2024

@Anonymous
Thansk for the reply.

>> performance of the server where the user is located
I'm not sure what server you are referring to.

Can you please be specific about what server and what component you are referring to? We send data to our premium capacity by way of an on-prem gateway. All servers that we can monitor are idle and have massive amounts of excess capacity. There is no apparent explanation about where the source of these bottlenecks may be.

Are you saying I am competing with other users of my capacity? or other users of my gateway? Or are you saying that I am competing with other customers in the same region? It isn't clear to me the scope of the problem. Remember that the whole point of dedicated premium capacity is to avoid conflicting for 30 mins with Microsoft's other customers. We shouldn't be paying for premium capacity and still suffer from bottlenecks on an unknown component.

>> bypass the previous operation by creating a dataflow

My dataflow is the asset which is misbehaving and failing to begin on time.
(please read the title of this post).

I am working with Mindtree engineers. It is taking time ... as of now they have been able to share limited information about failures that I wasn't able to detect on my end.

See below...

Unfortunately I don't have access to see this sort of critical information, but there may be other clues that I'm missing. Can you tell me if there are any clues which customers can directly examine in order to determine if/when the "runwithretries" for a dataflow is failing on one or more iterations? Is there a clue in my gateway logs?

Obviously this information is critical to customers who are trying to determine why refresh operations are delayed.

Anonymous · ‎07-26-2024

Hi,@dbeavon3 Thank you for your reply.
Yes, you can view gateway-related refresh logs by using the Power BI on-premises gateway application

There are three types of gateway logs: GatewayErrors.log, GatewayInfo.log, and GatewayNetwork.log.
You can export the gateway logs after logging in to the application, and query these types of logs by string keywords to see if there are any “vRunWithRetries4” failures in these gateway logs.

I suggest you continue to work with Mindtree engineers to get more detailed information about internal runtime failures.

I hope my suggestions give you good ideas, if you have any more questions, please clarify in a follow-up reply.

Best Regards,

Carson Jian,

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

dbeavon3 · ‎08-20-2024

@Anonymous

I think I have more clues than before.

It appears that our gateway encounters "spooling failures" at a fairly high rate. Are you familiar with them?

... Eg. in the files named "Report\QueryExecutionReport_200123-DESKTOP_20240523T220000.log"

... at location:

systemprofile\AppData\Local\Microsoft\On-premises data gateway\

... I'm finding errors like so:

"[""{\""kind\"":\""Web\"",\""path\"":\
""https://power-reporting.ufpi.com/power-reporting-api\""}""]",

,Spooling failed with error: The operation failed due to an explicit cancellation. Exception: System.Threading.Tasks.TaskCanceledException: A task was canceled. at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.PowerBI.DataMovement.Pipeline.Dataflow.TDFHelpers.<>c__DisplayClass7_0`1.<<GetNextResponseAsync>b__0>d.MoveNext()--- End of stack trace from previous location where exception was thrown --- at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw() at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task task) at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task) at Microsoft.PowerBI.DataMovement.Pipeline.Dataflow.TDFHelpers.<>c__DisplayClass12_0.<<ExecuteBlockOperationAsync>b__0>d.MoveNext()

There is no good way to understand why these failures are happening. I have plenty of RAM, disk, and CPU. I might open a new question in the community. I also have a support ticket open, but it is taking a very long time to get in touch with the Microsoft product team.

dbeavon3 · ‎07-27-2024

@Anonymous

The most important type of log is the "mashup" log, which you omitted in your list. That is where I go to find over 90% of our issues (at least issues that customers are responsible for troubleshooting oursselves). The mashup log is the one showing custom/hosted workloads running on the gateway.

I am not finding the mention of the "retries" in my mashup logs (see original post) or I probably wouldn't have opened a ticket with Mindtree/Microsoft in the first place. You can see in the logs that I posted that there is a thirty minute interval where no custom work is being done (no interactions with custom API, via the web connector, and nothing meaningful in the mashup log).

The PBI customer support has so-far shared very little explanation about the logs. They only say that something failed during the preparation for my refresh. So far they haven't shared the timestamps, the number of retries, or even an explanation of the resource that is blocking. I suspect it is a "shared" resource that is not specific to a single customer, and is probably not a "reasonable" or expectect point-of-failure (considering the expense that we incur for our own gateway and our own dedicated capacity). I suspect the underlying bottleneck is a problem that Microsoft is responsible for themselves, and is the reason why we aren't getting a high level of cooperation and transparency.

I hope you will agree that users/customers have a right to be concerned if our refreshes are delayed by 30 minutes for no particular reason that we can troubleshoot ourselves. Whatever the bottleneck may be, it should not be hidden from customers...

SaiTejaTalasila · ‎07-28-2024

Hi @dbeavon3 ,

What is your source database? Have you tried to restart your gateway app and tried to refresh the dataset/dataflow?

Thanks,

Sai Teja

dbeavon3 · ‎07-28-2024

Hi @Anonymous @SaiTejaTalasila

I appreciate your patience.

The source is a REST api. I use the "web connector".

I spent another hour revisiting my logs, and I think I finally found a new clue on my end.

There is evidence that an activity ID “f2b33c01-5684-4ffb-9aae-a1aa4880757c” was abandoned in an unusual way, and prompted an additional/subsequent gateway mashup operation to be sent to my gateway after about ~30 mins had passed.

Notice that the activity ID (“f2b33c01-5684-4ffb-9aae-a1aa4880757c”) is shown in the “GatewayInfo” log, and the timestamp shows that an “Async ID” was “moved to Expired”.

The frustrating thing to me, is that a second mashup container (new pid is 316428) launched with activity id is 8e9a7d04-cfa5-439f-9d51-084d8fbc3e3d. This second mashup is running concurrently with the one that was supposedly "moved to Expired". We would never have expected that two mashups would be running against the same API! The gateway is too aggressive, and should not allow itself to run two activities against the same source, for the same dataflow refresh operation.

Another thing that bothers me is that the top-level orchestration of this refresh is not actually visible to me. The orchestration of this dataflow refresh is happening on the Microsoft PBI servers. All I get in my logs is the low-level activities that are remotely requested. If the PBI servers in the cloud are triggering multiple activities on my gateway (eg. retries) for the same dataflow, then I need to have a better amount of visibility into this fact. The PBI portal does NOT seem to tell me when a dataflow refresh succeeds as a result of retrying gateway activities over and over.

SaiTejaTalasila · ‎07-17-2024

Hi @dbeavon3 ,

Recreate the same dataflow in any other workspace and see whether the same issue occurs or not.It doesn't make sense you don't have much activity going on in your capacity I don't think resource availability is the problem and I don't think resource manager will take that much time for allocating the memory.

Thanks,

Sai Teja

SaiTejaTalasila · ‎07-16-2024

Hi @dbeavon3 ,

You can check fabric capacity metrics for more details.

Thanks,

Sai Teja

dbeavon3 · ‎07-17-2024

@SaiTejaTalasila

Nothing is happening on the capacity. It is new and is totally idle. The delays seem to be entirely because of whatever is happening in the gateway.

I've opened a support ticket with no luck so far. The team at Mindtree/Microsoft seems to think it is fine if refresh operations run an hour behind schedule with no explanation. See below: