Advance your Data & AI career with 50 days of live learning, dataviz contests, hands-on challenges, study groups & certifications and more!
Get registeredGet Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now
I have a notebook in my Fabric Workspace that transforms data and loads it into lakehouse tables. At the moment there's about 1700 tables to copy across, they aren't that big. But it takes the notebook a long time to work through the list (took over 2 hours).
At the moment I'm iterating through a dictionary of dataframes and using the code below to write them to the Lakeshouse
for table_name, df in tables_dict.items():
df.write.format("delta").mode("overwrite").saveAsTable(table_name)
Being new to this, it would be good to hear if there is a quicker way to achieve this.
Solved! Go to Solution.
Hi @russellhq
Not sure if this is the best approach but one idea that comes to my mind since I used the function in a different context:
With Microsoft's mssparkutils package you can run notebooks out of another notebook (like an orchestrator) with the runMultiple function (https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-mu.... The advantage from my point of view is that this function executes the called notebooks in parallel (with respect to available resources) and I assume that this might help you with your problem since the sequential execution via a for loop might be the problem here.
So what would you have to do:
As I said, not sure if this is the easiest possibility but I think this at least might be one since the compolexity seems to come from sequential execution and not the size of copied tables itself. Keep in mind that this solution uses the spark session from the "outer" (orchestration) notebook so you should also get a point to scale the performance via the settings of the used spark cluster.
Since I used runMultiple() in a different context I have not tested this or had the problem myself and so I would be happy to hear from you if you tested the solution and if it worked for you. 🙂
BR
Martin
@Hofpower Hi Martin, thanks for the suggestion and links. I'll take a look into this and see if there are any performance gains to be had.
I had a look at the resource graph when the cell was running and could see it was using 8/8 allocated cores out of a max instance of 72. So I'm guessing what you're suggesting would use more of those 72 cores.
Hi @russellhq,
I think you can try to use ThreadPool library and use multiple dataframes at the same time to batch processing in for loop on these list of tasks.
For example: here is simple parallel processing with four dataframes to write data.
from pyspark.sql import SparkSession
from multiprocessing.pool import ThreadPool
def write_table(data):
spark, df, table_name = data
df.write.format("delta").mode("overwrite").saveAsTable(table_name)
def Parallel_Processing():
spark = SparkSession.builder.appName("Parallel Processing").getOrCreate()
#define dataframes
df1 = spark.read.csv("path_to_csv1")
df2 = spark.read.csv("path_to_csv2")
df3 = spark.read.csv("path_to_csv3")
df4 = spark.read.csv("path_to_csv4")
#define table names
tb1 = "table_name1 read from dict"
tb2 = "table_name2 read from dict"
tb3 = "table_name3 read from dict"
tb4 = "table_name4 read from dict"
dataframes = [(spark, tb1, df1), (spark, tb2, df2), (spark, tb3, df3), (spark, tb4, df4)]
pool = ThreadPool(processes=len(dataframes))
# Use ThreadPool to write DataFrames in parallel
results = []
for data in dataframes:
result = pool.apply_async(write_table, (data,))
results.append(result)
# Wait for all threads to complete
for result in results:
result.wait()
# Close the pool
pool.close()
pool.join()
You can modify these to execute the function in the 'for' loop and it may help to improve the performance.
multiprocessing — Process-based parallelism
python - multiprocessing.Pool: When to use apply, apply_async or map? - Stack Overflow
Regards,
Xiaoxin Sheng
Hi @russellhq
Not sure if this is the best approach but one idea that comes to my mind since I used the function in a different context:
With Microsoft's mssparkutils package you can run notebooks out of another notebook (like an orchestrator) with the runMultiple function (https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-mu.... The advantage from my point of view is that this function executes the called notebooks in parallel (with respect to available resources) and I assume that this might help you with your problem since the sequential execution via a for loop might be the problem here.
So what would you have to do:
As I said, not sure if this is the easiest possibility but I think this at least might be one since the compolexity seems to come from sequential execution and not the size of copied tables itself. Keep in mind that this solution uses the spark session from the "outer" (orchestration) notebook so you should also get a point to scale the performance via the settings of the used spark cluster.
Since I used runMultiple() in a different context I have not tested this or had the problem myself and so I would be happy to hear from you if you tested the solution and if it worked for you. 🙂
BR
Martin
Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!
Check out the October 2025 Fabric update to learn about new features.
| User | Count |
|---|---|
| 3 | |
| 2 | |
| 1 | |
| 1 | |
| 1 |