Solved: Faster Way to Copy Dataframes to Lakehouse?

russellhq · ‎08-01-2024

I have a notebook in my Fabric Workspace that transforms data and loads it into lakehouse tables. At the moment there's about 1700 tables to copy across, they aren't that big. But it takes the notebook a long time to work through the list (took over 2 hours).

At the moment I'm iterating through a dictionary of dataframes and using the code below to write them to the Lakeshouse

for table_name, df in tables_dict.items():
 df.write.format("delta").mode("overwrite").saveAsTable(table_name)

Being new to this, it would be good to hear if there is a quicker way to achieve this.

Hofpower · ‎08-01-2024

Hi @russellhq

Not sure if this is the best approach but one idea that comes to my mind since I used the function in a different context:

With Microsoft's mssparkutils package you can run notebooks out of another notebook (like an orchestrator) with the runMultiple function (https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-mu.... The advantage from my point of view is that this function executes the called notebooks in parallel (with respect to available resources) and I assume that this might help you with your problem since the sequential execution via a for loop might be the problem here.

So what would you have to do:

Rewrite your notebook to facilitate parameters (https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#designate-a-parame... so that it copies just one table that is defined as a parameter in the notebook.
Write another "orchstrator" notebook in which you automatically generate a DAG as dictionary with your 1700 tables each represented by a call to the first notebook. And the run runMultiple with this DAG.

As I said, not sure if this is the easiest possibility but I think this at least might be one since the compolexity seems to come from sequential execution and not the size of copied tables itself. Keep in mind that this solution uses the spark session from the "outer" (orchestration) notebook so you should also get a point to scale the performance via the settings of the used spark cluster.

Since I used runMultiple() in a different context I have not tested this or had the problem myself and so I would be happy to hear from you if you tested the solution and if it worked for you. 🙂

BR
Martin

View solution in original post

russellhq · ‎08-01-2024

@Hofpower Hi Martin, thanks for the suggestion and links. I'll take a look into this and see if there are any performance gains to be had.

I had a look at the resource graph when the cell was running and could see it was using 8/8 allocated cores out of a max instance of 72. So I'm guessing what you're suggesting would use more of those 72 cores.

Anonymous · ‎08-01-2024

Hi @russellhq,

I think you can try to use ThreadPool library and use multiple dataframes at the same time to batch processing in for loop on these list of tasks.

For example: here is simple parallel processing with four dataframes to write data.

from pyspark.sql import SparkSession
from multiprocessing.pool import ThreadPool

def write_table(data):
    spark, df, table_name = data
    df.write.format("delta").mode("overwrite").saveAsTable(table_name)

def Parallel_Processing():
    spark = SparkSession.builder.appName("Parallel Processing").getOrCreate()
        
    #define dataframes
    df1 = spark.read.csv("path_to_csv1")
    df2 = spark.read.csv("path_to_csv2")
    df3 = spark.read.csv("path_to_csv3")
    df4 = spark.read.csv("path_to_csv4")

    #define table names
    tb1 = "table_name1 read from dict"
    tb2 = "table_name2 read from dict"
    tb3 = "table_name3 read from dict"
    tb4 = "table_name4 read from dict"

    dataframes = [(spark, tb1, df1), (spark, tb2, df2), (spark, tb3, df3), (spark, tb4, df4)]

    pool = ThreadPool(processes=len(dataframes))

    # Use ThreadPool to write DataFrames in parallel
    results = []
    for data in dataframes:
        result = pool.apply_async(write_table, (data,))
        results.append(result)

    # Wait for all threads to complete
    for result in results:
        result.wait()

    # Close the pool
    pool.close()
    pool.join()

You can modify these to execute the function in the 'for' loop and it may help to improve the performance.

multiprocessing — Process-based parallelism

python - multiprocessing.Pool: When to use apply, apply_async or map? - Stack Overflow
Regards,

Xiaoxin Sheng

Hofpower · ‎08-01-2024

Hi @russellhq

Not sure if this is the best approach but one idea that comes to my mind since I used the function in a different context:

With Microsoft's mssparkutils package you can run notebooks out of another notebook (like an orchestrator) with the runMultiple function (https://learn.microsoft.com/en-us/fabric/data-engineering/microsoft-spark-utilities#reference-run-mu.... The advantage from my point of view is that this function executes the called notebooks in parallel (with respect to available resources) and I assume that this might help you with your problem since the sequential execution via a for loop might be the problem here.

So what would you have to do:

Rewrite your notebook to facilitate parameters (https://learn.microsoft.com/en-us/fabric/data-engineering/author-execute-notebook#designate-a-parame... so that it copies just one table that is defined as a parameter in the notebook.
Write another "orchstrator" notebook in which you automatically generate a DAG as dictionary with your 1700 tables each represented by a call to the first notebook. And the run runMultiple with this DAG.

As I said, not sure if this is the easiest possibility but I think this at least might be one since the compolexity seems to come from sequential execution and not the size of copied tables itself. Keep in mind that this solution uses the spark session from the "outer" (orchestration) notebook so you should also get a point to scale the performance via the settings of the used spark cluster.

Since I used runMultiple() in a different context I have not tested this or had the problem myself and so I would be happy to hear from you if you tested the solution and if it worked for you. 🙂

BR
Martin