Solved: Stopping Spark Session inside/outside ForEach

smpa01 · ‎08-09-2024

Usually, spark.stop() is recommended as a best practice to releasing resources like memory, CPU, and network connections back to the cluster and for other reasons. I usually do that for all my notebooks.

I am calling a sequential execution of a notebook inside forEach. If I don't use spark.stop(), does it keep on taking advantage of the same session and if I do spark.stop() is it an overkill as it shut down and restarts session forEach execution?

//pesudocode
[{id: 1}, {id: 2}, {id: 3}].forEach((_, i) => execute(NB1))

If I don't use spark.stop() which I am currently doing, is there any way to shut the session at the completion f the loop

@frithjof_v

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

frithjof_v · ‎08-10-2024

Yes, it sounds like the best option is to not use ForEach in this case, instead have a master notebook and execute all other notebook runs from the master notebook.

Look for mssparkutils.notebook.run(), mssparkutils.notebook.runMultiple() or Threadpooling in the Reddit discussion:

https://www.reddit.com/r/MicrosoftFabric/comments/1eolfda/sparkstop_is_it_needed/.

I noticed mssparkutils.notebook.runMultiple() is a preview feature. I haven't checked the status of the other mentioned features.

Also, I think mssparkutils will be replaced by notebookutils going forward:

NotebookUtils (former MSSparkUtils) for Fabric - Microsoft Fabric | Microsoft Learn

Microsoft Spark Utilities (MSSparkUtils) for Fabric - Microsoft Fabric | Microsoft Learn

View solution in original post

frithjof_v · ‎08-09-2024

I don't have enough knowledge about how sessions work in Fabric to answer this properly. Interesting question, though! I will try to learn more about this.

Just to be clear, I understand your current setup like this:

You are using a Data Factory Data Pipeline.
Inside the Data Pipeline, you have a ForEach activity with the Sequential option selected.
Inside the ForEach Activity, you are executing a Notebook.

I think I need to learn more about topics like concurrency, and whether it is necessary to use spark.stop() in Fabric or does Fabric manage the stop of a session when a Notebook run is finished.

Perhaps this blog post is relevant: https://www.fourmoo.com/2024/01/10/microsoft-fabric-notebook-session-usage-explained-and-how-to-save...

frithjof_v · ‎08-09-2024

I am guessing you don't need to use spark.stop() in Fabric.

Are you starting the Spark session also by using code? Something like this:

from pyspark.sql import SparkSession

spark = SparkSession.builder \

.appName("Simple DataFrame Example") \

.getOrCreate()

I don't think that is necessary in Fabric also. I guess sessions are managed by Fabric. When running a notebook interactively (using the notebook editor interface) I guess it's a good idea to click 'Stop session' when finished. However when running a Notebook in a Data pipeline, I think Fabric manages the session and stops the session when it's not needed anymore. Ref. the blog post link in my previous comment.

I am also guessing that using spark.stop() inside the Notebook can make you unable to take advantage of high concurrency spark sessions.

However I'm not sure about any of this, as I don't have enough knowledge or experience with this.

Hoping to get others' insights and thoughts on this 😃

frithjof_v · ‎08-10-2024

I started a discussion on Reddit to try to learn more about the topic:

spark.stop() - is it needed? : r/MicrosoftFabric (reddit.com)

I also noticed there is an alterantive to spark.stop(), which is mssparkutils.session.stop()

Anyway, I'm not sure if it's necessary.

I am still not entirely sure what to believe regarding session start/stop in Fabric.

There is also the option to use a Master notebook and use that notebook to call other notebooks. Then I think you can share the same session among notebooks. I think this approach utilizes the high concurrency feature.

EDIT: I think the Reddit discussion has made me understand more about it. I recommend checking out the Reddit discussion (link above).

smpa01 · ‎08-10-2024

@frithjof_v thanks for this.

High concurrency is not shipped yet(off-topic)

Therefore, if measures are not taken for a large_array, the pipeline will error out if you are calling a notebook inside forEach for a large array to perform operation on the same table. E.g.

//pseudo code 
const large_array = [1,2,...20]

//updates to be utilize in upsert
const updates = updates

//target
const target = delta_fact

//forEach activity in pipeline sequntial execution on a subset of target
forEach eleemnt of large_array {
perform Delta Table Merge sequentially
      where each large_array[element] = target[element]

}