The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Usually, spark.stop() is recommended as a best practice to releasing resources like memory, CPU, and network connections back to the cluster and for other reasons. I usually do that for all my notebooks.
I am calling a sequential execution of a notebook inside forEach. If I don't use spark.stop(), does it keep on taking advantage of the same session and if I do spark.stop() is it an overkill as it shut down and restarts session forEach execution?
//pesudocode
[{id: 1}, {id: 2}, {id: 3}].forEach((_, i) => execute(NB1))
If I don't use spark.stop() which I am currently doing, is there any way to shut the session at the completion f the loop
Solved! Go to Solution.
Yes, it sounds like the best option is to not use ForEach in this case, instead have a master notebook and execute all other notebook runs from the master notebook.
Look for mssparkutils.notebook.run(), mssparkutils.notebook.runMultiple() or Threadpooling in the Reddit discussion:
https://www.reddit.com/r/MicrosoftFabric/comments/1eolfda/sparkstop_is_it_needed/.
I noticed mssparkutils.notebook.runMultiple() is a preview feature. I haven't checked the status of the other mentioned features.
Also, I think mssparkutils will be replaced by notebookutils going forward:
NotebookUtils (former MSSparkUtils) for Fabric - Microsoft Fabric | Microsoft Learn
Microsoft Spark Utilities (MSSparkUtils) for Fabric - Microsoft Fabric | Microsoft Learn
I don't have enough knowledge about how sessions work in Fabric to answer this properly. Interesting question, though! I will try to learn more about this.
Just to be clear, I understand your current setup like this:
I think I need to learn more about topics like concurrency, and whether it is necessary to use spark.stop() in Fabric or does Fabric manage the stop of a session when a Notebook run is finished.
Perhaps this blog post is relevant: https://www.fourmoo.com/2024/01/10/microsoft-fabric-notebook-session-usage-explained-and-how-to-save...
I am guessing you don't need to use spark.stop() in Fabric.
Are you starting the Spark session also by using code? Something like this:
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("Simple DataFrame Example") \
.getOrCreate()
I don't think that is necessary in Fabric also. I guess sessions are managed by Fabric. When running a notebook interactively (using the notebook editor interface) I guess it's a good idea to click 'Stop session' when finished. However when running a Notebook in a Data pipeline, I think Fabric manages the session and stops the session when it's not needed anymore. Ref. the blog post link in my previous comment.
I am also guessing that using spark.stop() inside the Notebook can make you unable to take advantage of high concurrency spark sessions.
However I'm not sure about any of this, as I don't have enough knowledge or experience with this.
Hoping to get others' insights and thoughts on this 😃
I started a discussion on Reddit to try to learn more about the topic:
spark.stop() - is it needed? : r/MicrosoftFabric (reddit.com)
I also noticed there is an alterantive to spark.stop(), which is mssparkutils.session.stop()
Anyway, I'm not sure if it's necessary.
I am still not entirely sure what to believe regarding session start/stop in Fabric.
There is also the option to use a Master notebook and use that notebook to call other notebooks. Then I think you can share the same session among notebooks. I think this approach utilizes the high concurrency feature.
EDIT: I think the Reddit discussion has made me understand more about it. I recommend checking out the Reddit discussion (link above).
@frithjof_v thanks for this.
High concurrency is not shipped yet(off-topic)
Therefore, if measures are not taken for a large_array, the pipeline will error out if you are calling a notebook inside forEach for a large array to perform operation on the same table. E.g.
//pseudo code
const large_array = [1,2,...20]
//updates to be utilize in upsert
const updates = updates
//target
const target = delta_fact
//forEach activity in pipeline sequntial execution on a subset of target
forEach eleemnt of large_array {
perform Delta Table Merge sequentially
where each large_array[element] = target[element]
}
Yes, it sounds like the best option is to not use ForEach in this case, instead have a master notebook and execute all other notebook runs from the master notebook.
Look for mssparkutils.notebook.run(), mssparkutils.notebook.runMultiple() or Threadpooling in the Reddit discussion:
https://www.reddit.com/r/MicrosoftFabric/comments/1eolfda/sparkstop_is_it_needed/.
I noticed mssparkutils.notebook.runMultiple() is a preview feature. I haven't checked the status of the other mentioned features.
Also, I think mssparkutils will be replaced by notebookutils going forward:
NotebookUtils (former MSSparkUtils) for Fabric - Microsoft Fabric | Microsoft Learn
Microsoft Spark Utilities (MSSparkUtils) for Fabric - Microsoft Fabric | Microsoft Learn
User | Count |
---|---|
3 | |
2 | |
2 | |
1 | |
1 |
User | Count |
---|---|
5 | |
4 | |
3 | |
2 | |
2 |