Solved: Saving small delta table via notebook extremely sl...

ntimmerman · ‎08-25-2024

For the project I'm currently working on, I need to copy data into our bronze layer for a relatively large number of dimension tables. When I do this with a notebook, I notice that this is extremely slow.

Even tiny tables with only 10 records or so, take about 6 seconds to save. It doesn't matter if I do a complete overwrite or merge the new data with existing data.

All options are still at the defaults, So I haven't enabled any write optimization but I assumed that this would only make a difference for larger tables. please advise

Anonymous · ‎08-26-2024

Hi @ntimmerman,

In fact, I think this cost of processing almost upper to limit and changing to use other functions may not obviously improve the performance when you operate with small tables.
For optimized on the fabric notebook performance, you can take a look at the following document if it helps:

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Regards,

Xiaoxin Sheng

View solution in original post

ntimmerman · ‎08-25-2024

Sure!

We are on a F4 Fabric capacity.

I added some logging to my pyspark notebook. One example is this table:

Imported 8 records for table aircraft_hours_history query time: 1.9679827690124512, import time: 9.02900743484497 seconds

This table only has 6 columns and only numeric and date-time columns

The table only has 9 records in it:

But it still takes more than 9 seconds to merge the 9 records from the source.

The code for merging the table is pretty simple:

#get a reference to the destination table

deltaTable = DeltaTable.forPath(spark, 'Tables/'+row['abbr'] +f"_{tableName}")

#merge new data with existing data, don't delete anything, only update and insert

deltaTable.alias('target').merge(dfUpdated.alias('source'),"source.id = target.id").whenMatchedUpdateAll().whenNotMatchedInsertAll().execute()

Any clues on how I can improve the performance of this?

A sidenote is that upserting data in larger tables costs roughly the same amount of time.

So it seems like there's just a big overhead cost in trying to access the delta table or so?

Anonymous · ‎08-26-2024

HI @ntimmerman,

I think these times may also include the spend on initialize components and resource. (I test a table with more columns and records, and they haven't spent much more time than the small table)
For your scenario, I'd like to suggest you use other functions to handle import data and initialize delta table steps, and you can use 'DeltaTable' function to handle complex records compare and merge operations.(at these ranges, 'DeltaTable' function has better performance than other functions )

Regards,

Xiaoxin Sheng

ntimmerman · ‎08-26-2024

Hi @Anonymous , thanks for your quick reply! That's greatly appreciated! Can you elaborate a bit on what other functions are available which would have better performance for those small tables?

Thanks

Anonymous · ‎08-26-2024

Hi @ntimmerman,

In fact, I think this cost of processing almost upper to limit and changing to use other functions may not obviously improve the performance when you operate with small tables.
For optimized on the fabric notebook performance, you can take a look at the following document if it helps:

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Regards,

Xiaoxin Sheng