Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Get Fabric Certified for FREE during Fabric Data Days. Don't miss your chance! Request now

Reply
russellhq
Regular Visitor

Faster Way to Copy Dataframes to Lakehouse?

I have a notebook in my Fabric Workspace that transforms data and loads it into lakehouse tables. At the moment there's about 1700 tables to copy across, they aren't that big. But it takes the notebook a long time to work through the list (took over 2 hours).

 

At the moment I'm iterating through a dictionary of dataframes and using the code below to write them to the Lakeshouse

for table_name, df in tables_dict.items():
 df.write.format("delta").mode("overwrite").saveAsTable(table_name)

 

Being new to this, it would be good to hear if there is a quicker way to achieve this.

1 ACCEPTED SOLUTION
Hofpower
Frequent Visitor

Hi @russellhq 

 

Not sure if this is the best approach but one idea that comes to my mind since I used the function in a different context:

 

With Microsoft's mssparkutils package you can run notebooks out of another notebook (like an orchestrator) with the runMultiple function (https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-mu.... The advantage from my point of view is that this function executes the called notebooks in parallel (with respect to available resources) and I assume that this might help you with your problem since the sequential execution via a for loop might be the problem here. 

 

So what would you have to do: 

As I said, not sure if this is the easiest possibility but I think this at least might be one since the compolexity seems to come from sequential execution and not the size of copied tables itself. Keep in mind that this solution uses the spark session from the "outer" (orchestration) notebook so you should also get a point to scale the performance via the settings of the used spark cluster. 

 

Since I used runMultiple() in a different context I have not tested this or had the problem myself and so I would be happy to hear from you if you tested the solution and if it worked for you. 🙂

 

BR 
Martin

View solution in original post

3 REPLIES 3
russellhq
Regular Visitor

@Hofpower  Hi Martin, thanks for the suggestion and links. I'll take a look into this and see if there are any performance gains to be had.

 

I had a look at the resource graph when the cell was running and could see it was using 8/8 allocated cores out of a max instance of 72. So I'm guessing what you're suggesting would use more of those 72 cores.

Anonymous
Not applicable

Hi @russellhq,

I think you can try to use ThreadPool library and use multiple dataframes at the same time to batch processing in for loop on these list of tasks.

For example: here is simple parallel processing with four dataframes to write data.

 

from pyspark.sql import SparkSession
from multiprocessing.pool import ThreadPool

def write_table(data):
    spark, df, table_name = data
    df.write.format("delta").mode("overwrite").saveAsTable(table_name)

def Parallel_Processing():
    spark = SparkSession.builder.appName("Parallel Processing").getOrCreate()
        
    #define dataframes
    df1 = spark.read.csv("path_to_csv1")
    df2 = spark.read.csv("path_to_csv2")
    df3 = spark.read.csv("path_to_csv3")
    df4 = spark.read.csv("path_to_csv4")

    #define table names
    tb1 = "table_name1 read from dict"
    tb2 = "table_name2 read from dict"
    tb3 = "table_name3 read from dict"
    tb4 = "table_name4 read from dict"

    dataframes = [(spark, tb1, df1), (spark, tb2, df2), (spark, tb3, df3), (spark, tb4, df4)]

    pool = ThreadPool(processes=len(dataframes))

    # Use ThreadPool to write DataFrames in parallel
    results = []
    for data in dataframes:
        result = pool.apply_async(write_table, (data,))
        results.append(result)

    # Wait for all threads to complete
    for result in results:
        result.wait()

    # Close the pool
    pool.close()
    pool.join()

 

You can modify these to execute the function in the 'for' loop and it may help to improve the performance.

multiprocessing — Process-based parallelism 

python - multiprocessing.Pool: When to use apply, apply_async or map? - Stack Overflow
Regards,

Xiaoxin Sheng

Hofpower
Frequent Visitor

Hi @russellhq 

 

Not sure if this is the best approach but one idea that comes to my mind since I used the function in a different context:

 

With Microsoft's mssparkutils package you can run notebooks out of another notebook (like an orchestrator) with the runMultiple function (https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-mu.... The advantage from my point of view is that this function executes the called notebooks in parallel (with respect to available resources) and I assume that this might help you with your problem since the sequential execution via a for loop might be the problem here. 

 

So what would you have to do: 

As I said, not sure if this is the easiest possibility but I think this at least might be one since the compolexity seems to come from sequential execution and not the size of copied tables itself. Keep in mind that this solution uses the spark session from the "outer" (orchestration) notebook so you should also get a point to scale the performance via the settings of the used spark cluster. 

 

Since I used runMultiple() in a different context I have not tested this or had the problem myself and so I would be happy to hear from you if you tested the solution and if it worked for you. 🙂

 

BR 
Martin

Helpful resources

Announcements
Fabric Data Days Carousel

Fabric Data Days

Advance your Data & AI career with 50 days of live learning, contests, hands-on challenges, study groups & certifications and more!

October Fabric Update Carousel

Fabric Monthly Update - October 2025

Check out the October 2025 Fabric update to learn about new features.

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.