Solved: Re: Any integration or tutorials for Spark Connect...

dbeavon3 · ‎01-04-2025

The docs say that fabric runtime 1.2 supports "Spark Connect", and that there is a python client. See below.

https://learn.microsoft.com/en-us/fabric/data-engineering/runtime-1-2

... see section:

https://learn.microsoft.com/en-us/fabric/data-engineering/runtime-1-2#new-features-and-improvements-...

Is this a documentation bug? Is there an actual integration with "Spark Connect"? Nothing comes up in my google results so I'm not getting my hopes up yet.

We have some apps that would benefit from using using a spark server. It would be great to initiate our Spark workloads to run on Fabric, by way of our on-premise applications and services. Most of our on-premise applications are .Net -based. But we might be able to use python as a bridge to initate the Spark workloads, if that type of client is available.

I would love to see a tutorial or example, but haven't found any yet. The docs above may not actually apply to Fabric. Maybe someone copy/pasted from the OSS docs for Apache Spark.

If this was available, it seems like it would be incompatible with the approach that Microsoft is using for the billing of our Spark workloads in Fabric. The billing seems to be tied to client notebooks hosted within the Fabric environment. But Spark Connect seems like a client/server technology, and the client notebooks might be running remotely.

govindarajan_d · ‎01-08-2025

Hi @dbeavon3 ,

Based on my understanding, since Spark connect requires remote connectivity, it needs a hostname which would be the IP address of the Spark Context. And since there is no authentication mechanism invovled with Spark-connect unless you manually setup a re-direction URL mechanism (authentication proxy), I don't believe Fabric will allow that level of configuration in their cloud system.

Using Managed Virtual networks with Fabric, you might get the URL of the Spark context and use it, but again this is just my assumption and as you said, there is no documentation, it is difficult to validate unless we do a PoC.

The following seems true after I read the description from MS site and Spark site.

Maybe someone copy/pasted from the OSS docs for Apache Spark.

View solution in original post

v-prasare · ‎01-07-2025

@dbeavon3, Sorry for delayed response and thanks for your patience. I realized that I misunderstood your original question about Spark Connect and shared a video that focused more on Spark Runtime updates rather than addressing the specific feature you were inquiring about. I apologize for any confusion this may have caused.

You're absolutely right that Spark Connect is a significant feature introduced in Apache Spark 3.4, allowing remote clients to interact with Spark clusters. However, the available Microsoft Fabric documentation and release notes (including the Runtime 1.3 update), there hasn’t been any explicit mention of Spark Connect being officially supported in Fabric yet.

The video and docs focus primarily on performance optimizations and runtime upgrades, but Spark Connect as a feature seems to be more of an upstream Apache Spark development at this stage, without clear Fabric implementation details. I’ll keep a close eye on any future announcements or documentation updates from Microsoft that might confirm Spark Connect support. Please write your feedback and ideas on fabric here.

In the meantime, exploring alternative connectors and APIs that are currently available within Fabric such as the SQL Analytics endpoint for remote access to Spark datasets.

Thanks,

Prashanth Are

MS Fabric community support.

Did we answer your question? Mark post as a solution, this will help others!

your feedback is valuable to us, don't forget to drop me a "Kudos"

dbeavon3 · ‎01-08-2025

Hi @v-prasare

Thanks for the reply.

Assuming I explored the remote connectivity to our Spark environment (pools) via "Spark Connect", how likely would I find a path forward? It sounds like you have very little hope that this would work. Is that right?

It is too bad they would include that announcement about "Spark Connect" in a Fabric offering, if it was not actually available or supported.

I saw that there is a way to download and run python notebooks on the desktop in VS code. Is it possible that this is using "Spark Connect" under the hood? Do you know how that is implemented? Does it support advanced features like UDF declarations in the user's local VS code notebooks?

The main reason why I don't have that much hope for a path forward with "Spark Connect" is because of the way Microsoft is monetizing Spark in Fabric. There is not a tangible cluster that customers can interact with , outside of the context of a notebook. Nor is there any console for monitoring/managing the so-called "pool" that is created behind the scenes of our notebooks. I believe that the only way that the "Spark Connect" would be viable on Fabric is if the Spark cluster had a life of its own. The cluster would need to exist independently of any python notebooks that are running within Fabric . However that would probably conflict with the monetization, and with the way the Spark integration has been introduced into this product. IE. There appears to be no mechanism for Microsoft to send us billing for a Spark cluster, independent from the notebooks that are using the cluster.

govindarajan_d · ‎01-08-2025

Hi @dbeavon3 ,

Based on my understanding, since Spark connect requires remote connectivity, it needs a hostname which would be the IP address of the Spark Context. And since there is no authentication mechanism invovled with Spark-connect unless you manually setup a re-direction URL mechanism (authentication proxy), I don't believe Fabric will allow that level of configuration in their cloud system.

Using Managed Virtual networks with Fabric, you might get the URL of the Spark context and use it, but again this is just my assumption and as you said, there is no documentation, it is difficult to validate unless we do a PoC.

The following seems true after I read the description from MS site and Spark site.

Maybe someone copy/pasted from the OSS docs for Apache Spark.

dbeavon3 · ‎01-09-2025

@govindarajan_d

Hi thanks for the update. I was hoping to play with Spark Connect on Fabric, but I may have to revert to Databricks.

I actually use HDI in Azure more than anything else, but it is stuck on an earlier version of Spark. Hoping that changes in the near future. I love HDI, and wish Microsoft would give it a bit more TLC!

If I had time, I would work on a PoC. But even if it worked, it would probably be fragile, and would not be future-proof. The problem with Fabric Spark is that the cluster is not really a first-class member of the workspace. It is only created behind the scenes for the purpose of a notebook execution ("clients"). All the monetization is handled per notebook. So any of the "server" -oriented features of Spark are probably tucked away under the hood, and there is no surface-area for customers to interact with these features. It is unlikely that "Spark Connect" will be supported, until Microsoft finds a way to bill us for it, and make even more money than what they earn from notebooks. There is a tipping point where I'm guessing they would lose money, if customers are able to manage their own cluster, and successfully optimize the number of notebooks that are able to run on a given day.

govindarajan_d · ‎01-09-2025

Hi @dbeavon3,

Spark-connect is good when you are running your own Spark cluster in local network. For example, organizations can run spark cluster at 192.168.1.6 and then an application which wants to use Spark-connect will be present in the same network and can access Spark-connect with local IP address.

But in a propriterary cloud, implementing Spark-connect would be way more difficult to use because of security and data privacy concerns. The monetization is based on CUs used, so it still ties to the Spark Cluster usage rather than notebook.

dbeavon3 · ‎01-10-2025

Hi @govindarajan_d

Hi thanks for the post.

>> ... so it still ties to the Spark Cluster usage rather than notebook.

I don't think this is the case. If cost was tied to the cluster then there would be a cluster monitoring U/I. But in Fabric there is NOT any independent visibility to monitor the cluster, from what I can see. I think the monetization (meter) in Fabric is measured by the notebook. IE. The cost of the CU's will scale up based on the notebook-hours. From what I can tell, it does NOT scale up and down based on vcore-hours (ie. relative to the size of the cluster and time the cluster is running).

The notebook-hours may actually be charged as notebook-compute-hours but the idea is the same. They don't let you cost-optimize at the cluster level, since they want to charge you in relation to your notebooks.

At the end of the day, the spark clusters will cost a lot more than in another simple spark service, like HDInsight or Databricks standard tier.

govindarajan_d · ‎01-10-2025

Hi @dbeavon3,

Well, the reason I said it is cluster because you can have a high-concurrency cluster to which you can attach multiple notebooks and run them parallely. The reason clusters don't have separate monitoring for cost is everything is included in a single billing which Fabric capacity used per second.

You can think of this as a game card that you buy in a mall and you have preloaded points that you can spend on various games. Some games cost you higher than other games and it is upto you how you spend that preloaded points.

In the same way, F2 capacity has 7200 CU-seconds per hour. So, let's say you spin up a standard cluster and assume it costs about 5000 CU seconds every hour. You can attach only one notebook at a time to it. Let's say you spin up a high-concurrency cluster with same size (so same 5000 CU seconds every hour). Now you can attach multiple notebooks to that cluster and run them. But there might be a performance difference depending on the compute requirements of the notebook. If multiple notebooks run with same performanceas standard cluster ( which means standard cluster wasunderutilized) , then there would be a cost-saving. So it would be cluster usage rather than notebook level.

dbeavon3 · ‎01-10-2025

@govindarajan_d

You said:

>> So, let's say you spin up a standard cluster and assume it costs about 5000 CU seconds every hour.

I think there are some bad assumptions here. Do you have any supporting links to say that a given-sized "cluster" has a fixed CU cost per hour? I have not found that because I don't think it exists in Fabric. (It would be true in all the other Spark platforms but it is not true in Fabric.)

I suspect the information you have is wrong, and/or it is subject to change. It is probably something that was told to you verbally by a Microsoft salesperson...

First of all the word "cluster" in Fabric is replaced with "pool" which is presented to users as metadata, rather than a physical entity. The subtle change in terminology will create ambiguity and that works in Microsoft's favor. Secondly there is no management console for a "cluster" in Fabric. If CU costs could truly be optimized in the way that you described then having a management console would be a very high priority. There is such a console in the Synapse and Databricks and HDI platforms. But in Fabric there is NOT likely to be one soon, since it is NOT directly relevant to Fabric cost-management, and since Microsoft wants their platform to be "easy", and don't want users to concern themselves with these superfluous implementation details.

Thirdly, the CU meters for Spark notebooks are NEVER presented in the terms that you described ("CU seconds per hour per cluster"). As I said, the accounting is decrementing CU's is based on notebook-hours, or notebook-compute-hours. You can visualize this in their "Capacity Metrics App" (... such as it is, given that it seems to be the only administration tool available for Spark in Fabric). See below.

We can probably agree that optimization is accomplished by reducing the length of time that notebooks run, and using fewer/smaller executors for the synapse notebooks. Doing these things will allow us to see the positive changes in the "capacity metrics app".

Where we don't agree is when you say that there is a fixed cost at the "cluster" level, and that customers can optimize our workloads at that level. There is no way for us to get operating leverage, by way of a fixed cluster cost; because Fabric billing does not happen that way. The billing happens exclusively via the variable costs per notebook-hour.

govindarajan_d · ‎01-10-2025

Hi @dbeavon3,

I can be wrong on this since there is no definite documentation on how actually the CUs are calculated for Spark! This is my own logical interpretation based on my experience with Spark.

The logic of Spark session time or notebook/job running time is same in case of non-interactive mode as session is terminated soon after finishing the task.

But in interactive mode, how do you think it will work in terms of CU calculation. If I run my first command and after it finishes the notebook is not in running state anymo, only the session is active. So, in that case is there any CU consumption? If no, then it means MS is running a compute without CU consumption. (Coming from databricks environment, as soon as you switch on a compute, you are charged no matter whether you attach a notebook or not 🙂 ) I am just wondering how would MS run a compute without CU consumption, because the session needs to be active for the notebook to maintain state.

A Microsoft person can have a better idea about this and due to lack of proper documentation, speculation is what I can do at this moment!

dbeavon3 · ‎01-10-2025

>> interactive mode, how do you think it will work in terms of CU calculation. If I run my first command and after it finishes the notebook is not in running state anymo, only the session is active

If the session is active and connected to the cluster then I am 100% certain it would keep accumulating CU's. Ideally the cluster would scale down (via autoscale) to save Microsoft some money. And ideally the dynamically allocated executors woud die off as well to save the customer a bit of money in their notebooks.

... in short, the cluster (custom pool) and VM's are the resources which Microsoft has to keep running at their own expense. It is somewhat fixed. But the CU-meter is accumulated via notebook-compute which is a highly "variable cost". Microsoft probably needs to significantly increase this variable cost that they charge the customer, to ensure that it always covers their own fixed expenses. That is how I understand it.

The notebook will become idle after a period of time and both the cluster and the executors will die. That will stop the billing. And it will stop the expense to Microsoft, in regards to their cluster (custom pool)

govindarajan_d · ‎01-10-2025

Hi @dbeavon3,

I did this experiment:

I created a new F64 capacity (so that there is no noise in the capacity metrics app) . I created a new notebook, started a standard session and set the session time out period to be 45 minutes. But I did not run anything on the notebook like you can see below.

In the below metrics app, you can see the consumption which is around 24K CU (s) and the duration to be 2758 seconds which is ~ 46 minutes. So even if we don't run the notebook, there is a CU consumption because the spark session is running (I don't think MS owns this expense as I can see the CU(s) consumed in the Capacity metrics app)

And I agree with your point of running a notebook would be a variable cost ( because there is autoscale and dynamically allocated executors), but there is a fixed cost when running the spark session and based on my understanding, CU(s) is consumed and MS doesn't own it. Unlike Databricks, where you have the option to start the cluster directly, here in Fabric the only way to turn it on is to start a notebook or run a notebook/spark job. In that way, you wouldn't be inadvertently starting a session and accumulating CU(s).

dbeavon3 · ‎01-12-2025

@govindarajan_d

>> ... fixed cost when running the spark session and...
When I refer to a "fixed cost". I'm basically referring to the cost of the underlying cluster. We see this on all the other Spark products, but not Fabric.

IE. In the context of HDInsight I may have a cluster that autoscales from 8 to 10 nodes. In such a cluster, my fixed costs are going to be the 8 minimum nodes. No matter what happens, I'm always paying for 8 minimum nodes. (lets say they are ~$10/hr or so, out the door including zookeeper and head nodes). Or about $240 for a day

In this HDInsight cluster with a fixed cost, I can execute 1000 notebooks in the day that each use 8 executors, or I can execute 1 notebooks for the day that uses 2 executors. Either way I always pay the fixed $10/hr. The costs to my Azure bill will NOT be variable depending on the number of notebooks I may choose to run

But Fabric is extremely different in that Microsoft wants to withold the responsibility for managing cluster; they want to hide that from the customer entirely. They wish to charge us for notebooks instead. I never get to see my cluster cost, because that is not my business anymore. Instead of the cluster itself, Microsoft wants their Fabric customers to focus only on the optimization of their notebooks . Our costs are proportional (variable in a linear way) and they increase based on the number of notebooks we run. Going back to the original example, if we use 1000x8 notebook-hours in Fabric then our CU usage will be about 4000 times more than 1x2 notebook-hours. These CU costs have virtually nothing to do with the underlying cluster anymore. We can see that Microsoft created many layers of abstraction where the customer no longer needs to be aware that VM's are being used, or that there is a physical Spark/Yarn cluster. The only thing they want us to focus on is paying the variable charges that are proportional to the number of notebook-hours. And those variable charges to the customer will be very high, so that it ensures that Microsoft will cover all their own fixed costs that they are incurring behind the scenes.

>> even if we don't run the notebook, there is a CU consumption because the spark session is running

Right ... for the sake of discussing CU costs, you should assume the notebook is "running" whenever the session is connected. It is irrelevant to Microsoft that you don't happen to be evaluating a cell. Eg. you could run a cell for 45 mins that says "time.sleep()" and, assuming the session is connected in either case, it will cost you exactly the same amount as if you were NOT running that sleep cell. Microsoft is charging you for the active executors and drivers in the notebook. I'm guessing you could extend your testing to leverage x1 or x2 or x3 executors and that would increase your costs proportionally for the same duration of 45 mins.

govindarajan_d · ‎01-10-2025

Hi @dbeavon3,

Thanks for the explanation. I hope we get more detailed documentation on the Spark CU consumption model!

I will try few more experiments on my side to get a better understanding of what you said!

lbendlin · ‎01-10-2025

We see that in our tenant. "finished" notebooks keep consuming (lots of) CUs until we forcibly close the session.

dbeavon3 · ‎01-12-2025

Hi @lbendlin
I'm likely to open the exact same support ticket if this experience happens on our side. I don't suppose you could share your ticket/SR number ( or maybe some tips that I can refer my engineer to find the related ICM ?)

We should probably audit the start and stop times of our notebooks, along with the number of executors used. That should allow us to account for our own notebook-hours, and get refunds when the CU's are calculated improperly. It should be easy in python notebooks to retrieve the start and stop events, and log them to a simple deltatable or something like that.

> until we forcibly close the session.

Yuck. Managing at the session/notebook level is a pain because there can be 100's or 1000's a day. This is the exact reason why customers need a management console for watching our cluster. Assuming the cluster starts and stops when we expect it to, then the notebooks will take care of themselves (they cannot run without a cluster). We shouldn't have to micro-manage them at that level of granularity. Yet I don't really trust Microsoft to micro-manage the notebook-hours as closely as required. They don't care if a session is closed; and they will typically point the finger back at their customer and blame the problem on some python library we used, or some "time.sleep()" statement or whatever.

v-prasare · ‎01-06-2025

Hi @dbeavon3,

Thanks for actively participating in MS Fabric Community Support.

Please refer to below video related to spark runtime and latest updates in Fabric:

Apache Spark Runtimes in Microsoft Fabric – Runtime 1.3 based on Apache Spark 3.5 - YouTube

let me know if this helps? looking forward to your feedback!

Thanks,

Prashanth Are

MS Fabric community support.

Did we answer your question? Mark post as a solution, this will help others!

If our response(s) assisted you in any way, don't forget to drop me a "Kudos"

dbeavon3 · ‎01-06-2025

Hi @v-prasare

Are you familiar with the "Spark Connect" functionality in Apache Spark?

This is a question about a specific feature of a spark environment.

The Fabric announcement that I shared seems to be indicating that this is now a capability of Fabric. But I have seen no other evidence of that yet. Have you ever tried to use Spark Connect, within a Fabric environment?

Here are the docs, on the OSS/Apache side of things:

https://spark.apache.org/docs/3.4.0/spark-connect-overview.html#spark-connect-overview

Any tips would be appreciated. I checked the video, and it doesn't appear to mention any information that is relevant to "spark connect".