Re: Where to Monitor Spark Pools?

dbeavon3 · ‎01-04-2025

We have some spark workloads going to production in January.

However I haven't found the place where we can monitor our custom spark pool. I don't think there is visibility yet, in this "Fabric" flavor of Spark . I'm mostly worried about the inability to manage or investigate when the pool is overwhelmed, or when costs are elevated.

Our Fabric capacity (F64) runs in the north central region and the so-called "capacity metrics app" has not worked for two months. (This app is very buggy, where "timepoint" information is concerned. The related timepoint screen always appears to be empty). We are a whole month past the original ETA when the support engineers first promised that this would be fixed. I don't have an updated ETA at the moment, and the Mindtree team wasn't able to encourage their PG to track this outage on their "known issues" list.

Even if that app actually worked, it is not tailored to help anyone monitor their spark pool. The Synapse platform's monitoring blade was not great either, but at least they had a minimal U/I experience where we could try to understand how much activity was taking place in the Spark pools. See below.

Can anyone please let me know how to visualize the utilization of our spark pool? It is a dedicated pool, and uses special features like private endpoints (MPE) to reach resources in our Azure vnet. It is not a "starter" pool. At certain hours of the day, I would expect about ten notebooks to be sharing the compute nodes in this custom pool. In order to better understand the workloads that are going thru our spark pool, we really need a management tool of some kind. The Fabric capacity metrics app is not sufficient (assuming they ever get it back up and running again). Does anyone have a tip?

Perhaps I'm overlooking a REST api that could be used to investigate the state of our pool thru-out the day?

Other notes:

The only management screens I've found are specific to a single job or notebook. The docs mention the capacity app. But I have not been able to investigate the behavior of the capacity app for the past two months becauase of the ongoing outage in the north central region. Here are the links I've found in Microsoft's documentation material, but they don't seem to tell me where to manage my spark pool:

https://learn.microsoft.com/en-us/fabric/data-engineering/billing-capacity-management-for-spark

https://learn.microsoft.com/en-us/fabric/data-engineering/monitor-spark-capacity-consumption

Any help would be appreciated. Thanks in advance.

frithjof_v · ‎01-16-2025

Some related info: https://www.reddit.com/r/MicrosoftFabric/s/iYUUhqvvSE

lbendlin · ‎01-17-2025

To add insult to injury - Azure Log Analytics costs extra. A LOT extra.

frithjof_v · ‎01-12-2025

My two cents:

There's no such thing as monitoring a Spark pool in Fabric. Because a Spark pool in Fabric is just a template for instantiating clusters (sessions). A Spark pool is basically a template that can be used to instantiate multiple clusters (sessions), depending on the size of your template (pool) and the size of your Fabric SKU.

So what we would want to monitor is the utilization of any active Spark clusters (sessions).

You can monitor a session by using Monitoring.

https://learn.microsoft.com/en-us/fabric/data-engineering/spark-monitoring-overview

I don't have a lot of practical experience with it myself.

For the Capacity Metrics App, you could try deleting and re-installing it.

If you wish that multiple Notebooks share the same cluster (session), check out NotebookUtils run or runMultiple (preview) or Data Pipeline high concurrency.

dbeavon3 · ‎01-15-2025

Hi @frithjof_v

I agree that the spark "pool" concept (in Fabric) is intended to act as a template/metadata. This is different than the "pool" concept in Synapse.

Furthermore, I would agree that Microsoft is monetizing and billing us for our spark sessions and NOT our spark pools. That is a non-technical concern which I won't dig into (again)....

But none of these things are obviating the need to monitor/manage our custom spark pool. The pool has an important presence in our workspace, and we must interact with them whether we like it or not. Here is our custom pool:

Lets consider some real-world scenarios. Suppose we have a bunch of notebooks running on the cluster (starting and completing) for a period of time. Then all of a sudden we might see that new notebooks will not be launched and they will start piling up (queuing). There are a variety of underlying reasons for this that are directly caused by the underlying spark pool. For example, the pool may be in the process of "autoscaling". Or the pool may have unexpectedly reached its max capacity, because some notebooks grabbed an excessive number of dynamically allocated executors. These are the simplest sorts of scenarios for why notebooks won't launch on a pool. Other scenrios involve PG bugs, configuration issues, operating system issues, "transient network failures", tenant throttling issues, Entra ID errors, and so on.

A wide variety of issues can be encounter in cloud-hosted SaaS and PaaS. Yet it is not possible to troubleshoot for them, since Microsoft gives us no visibility to see the underlying pool itself. It would seem that the PG expects us to open support tickets with Mindtree every time our pyspark notebook environment has become frozen. Since I already have too many of those tickets, and since those tickets will generally take a week of effort (or more), I'm not very eager to continue to follow that path. After that week has passed, I would then be dealing with LOTS of severe problems, instead of that one simple problem that I started with.

I'm guessing that the lack of a monitoring console is either an oversight, or it is still on the roadmap, or Microsoft expects their Mindtree vendor to have the role of a monitoring tool. None of these seem acceptable to me. Spark is already a challenging technology for developers to manage, without these "blinders" that we must wear in the Fabric platform.

I think Microsoft does mean well, in trying to make an "easy" platform in Fabric. But I think they have taken things way too far in that regard. (Spark notebooks should not have gone to GA without having a monitoring tool. It violates a basic principle to “make things as simple as possible, but no simpler.”)

frithjof_v · ‎01-15-2025

I think the Capacity (F SKU) is the limiting factor. The Capacity (F SKU) decides how many Spark VCores you're allowed to use in one point of time. https://learn.microsoft.com/en-us/fabric/data-engineering/spark-job-concurrency-and-queueing

The pool is only a template for creating clusters, and thus decides the min and max limit for how many VCores a cluster created using that pool (template) can use.

The Capacity limits decide how many clusters you can use concurrently, also depending on how many VCores each cluster uses.

I'm not sure if there is a way to monitor how many VCores you're using in total at a single point in time. I only know how to monitor individual clusters (sessions). You would need to add the number of VCores used by each cluster (session), to determine if the Capacity's limits are close to being reached.

Another limit is the CU % limit (also a capacity level limit). That is not only related to Spark, but the combined load from all Fabric workloads. The CU % utilization is found in the Capacity Metrics App.

dbeavon3 · ‎01-15-2025

Hi @frithjof_v

Thanks for the link. That was a missing piece of the puzzle, but there are other pieces missing as well.

Assuming a physical spark cluster is created under the covers, there is overhead to launching one. Spark clusters may have head nodes, zookeeper nodes, and worker nodes. These are all implemented as VM's and have boot times. Generally it is only the worker nodes that appear in the Spark History U/I so those are the ones that most people focus on.

In addition to these VM's, I think there is also a hidden layer of Hadoop software (Yarn) running on the nodes for the sake of resource management. That has additional overhead, when Spark sessions start and finish (yarn containers are launched for executors, and what-not).

I just created another post ("Queued time is elevated - can this trigger a notebook failure")

That post represents a real-world example of where it would be helpful to have a monitoring U/I. There is a Microsoft bug, and if I want to independently find a workaround, I need a bit of visibility.

I don't think I'm running in to the capacity limits of the F sku's. Even if my pool was at the 5 max nodes, it would only be using 8x5 vcores in the workers which is not a lot. I wasn't previously aware of that link you shared, but I think the capacity limits in that documentation were set in place so I wouldn't be able to have a cluster scale up to !50! nodes (causing problems for Microsoft, while not necessarily appearing in my synapse-notebook CU-meter). I think the arbitrary capacity limits (eg. 128 spark cores for F64) were put in place by Microsoft for self-preservation. I think the arbitrary limits are only necessary because of the fact that their CU billing is taking at the notebook level rather than at the cluster level. Else the CU charges themselves would prevent customers from autoscaling to 50.

Not to muddy this discussion, but in that other post, I believe the bug in Livy is due to basic lifecycle events (a sudden autoscaling of the physical cluster from 2 workers to 5 workers or so). I believe those lifecycle events are having buggy side-effects if they take longer than ten minutes. It is just a theory, and is very hard to prove without having some sort of a spark monitoring U/I.

It is theoretically possible for customers to build their own U/I by extracting the VM lists of the all these executors from all these notebooks, and collating by VM-name, to see how long-lived each VM was in the lifecycle of the cluster as a whole. It would be a lot of work, and frustrating to do it when Microsoft should be responsible for working on their own bugs.

cailen_cg · ‎01-08-2025

I'm also struggling with this.

V-yubandi-msft · ‎01-06-2025

Hi @dbeavon3 ,

Thank you for reaching out to Microsoft Fabric Support about the monitoring issues you're facing with your custom Spark pool in Microsoft Fabric. We understand the troubles you're having, particularly with the capacity metrics app and not being able to see Spark pool utilization and performance.

The documents you shared mainly cover monitoring Apache Spark capacity consumption, billing, and utilization reporting. They don't specifically address monitoring Spark pools.

I've attached the official Microsoft documents that directly tackle your concern. These documents explain where and how to monitor your custom Apache Spark pools, and how to view specific details about each Spark pool. Please look at the attached link for clear and detailed guidance.

Link: How to monitor Apache Spark pools in Synapse Studio - Azure Synapse Analytics | Microsoft Learn.

Considering your Microsoft Fabric (F64) capacity, whether you need an upgrade depends on your current workload demands and limitations. It might be a good idea to think about upgrading to a higher SKU (like F64 to F64v2 or another high-performance option).

Since multiple notebooks share your pool during peak hours, it may be beneficial to increase capacity or optimize resource allocation to ensure smoother performance. Adjusting the auto-scaling settings can also help, if applicable.
For a more detailed view of the Spark jobs and tasks in your pool, enable Spark job monitoring through the Spark UI or check Spark logs. This will give you a clearer picture of how the workload is distributed.

Link: Use the Monitoring hub to manage Apache Spark applications - Microsoft Fabric | Microsoft Learn.

You can monitor your Spark pool using custom REST APIs. These APIs provide real-time data on your pool's performance, including resource usage, job execution times, and other metrics. You can integrate this data into a monitoring dashboard for better insights into when the pool might be overwhelmed or when resource costs are high.

Link: Collect Apache Spark applications metrics using APIs - Azure Synapse Analytics | Microsoft Learn.

You may want to look into using Azure Monitor for setting up custom metrics and alerts for your Spark pool. This can help you keep track of the pool's health and resource usage over time. Even though there are some ongoing issues with the Fabric capacity metrics app, Azure Monitor is a solid alternative for collecting resource usage data and creating alerts.

Link: Monitor Apache Spark applications with Azure Log Analytics - Azure Synapse Analytics | Microsoft Lea...

If my answer addressed your query, kindly mark it as the Accepted Solution to assist others.

I'd also be grateful for a 'Kudos' if you found my response useful!

dbeavon3 · ‎01-06-2025

@V-yubandi-msft
Thanks for the reply

>>These documents explain where and how to monitor your custom Apache Spark pools (in synapse)

Yes, I already aware of how the Synapse PaaS works, but this is a question that focuses on Fabric.

>> It is the new Fabric that doesn't seem to be working as expected, or giving visibility to monitor pools.
It might be a good idea to think about upgrading to a higher SKU (like F64 to F64v2 or another high-performance option).

We use F64. There is no such thing as "F64v2". Is that an internal code-word?

>> You can monitor your Spark pool using custom REST APIs. These APIs provide real-time data on your pool's performance, including resource usage, job execution times, and other metrics

Can you share a link to the corresponding thing in Fabric?

Please verify that we are talking about Fabric and not Synapse? Either that API is not in Fabric, or you are assuming that Fabric and Synapse have the exact same set of features (ie. all the docs for Synapse are equally relevant to Fabric). This second scenario seems pretty unlikely.

If there was already some Fabric documentation related to monitoring pools in Fabric, then I probably would have found it. I already saw all of the docs related to the Synapse platform, but those do not apply to Fabric, (AFAIK). I would appreciate if you would only reference the information that pertains directly to Fabric, so we don't introduce additional confusion.... This is a topic that other customers will have in the future, and it would be well if we restricted the discussion to focus only on Fabric and avoid confusing Fabric and Synapse platforms. This discussion may be used by Microsoft partners as well. Eg. I may also refer the Mindtree engineers to it, when asking follow-up questions about your suggestions .

V-yubandi-msft · ‎01-28-2025

Hi @dbeavon3 ,

Thank you for your feedback and apologize for the confusion regarding Synapse and Fabric earlier.

You are correct that, at this moment, there isn't an equivalent of the Synapse monitoring tools in Microsoft Fabric for Spark pools. My apologies for any misunderstanding. The F64 (Fabric) SKU doesn't have an "F64v2" variant like some of the Synapse SKUs. My earlier suggestion was more generic, and we appreciate your patience.

A special thanks to @frithjof_v , for their valuable contribution. Currently, Fabric lacks a dedicated, unified tool for the monitoring of Spark pools. We are closely following the progress that Fabric's monitoring capabilities are making and will keep you updated on any developments that might address your needs. In the meantime, continue to monitor the system.

Thank you again for your patience as we go through this. We continue to look out for any new development that would increase the monitoring options in Microsoft Fabric.

Where to Monitor Spark Pools?

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Where to Monitor Spark Pools?

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025