Re: How to choose optimal spark environment settin...

Scott_Powell · ‎11-08-2024

Hi all, I'm working on a notebook with Spark SQL to move and transform data between our bronze and silver lakehouses. In general, most of our ETL is joining multiple tables and writing an output table with the results (see sample below).

In our specific case, none of these tables have more than 10 or 20 million rows...many are MUCH smaller. But we find some SQL statements still taking upwards of 10 minutes to complete.

My question is: how do I determine "the best" spark settings to improve performance? The Fabric documentation tells you the mouse clicks needed to change spark settings, but doesn't give any guidance on figuring out what the settings should be. I'm completely new to spark in general...but I feel like these queries should run 10x faster at least based on other Fabric performance I'm seeing.

Does anyone know a good guide for how to look at the output of the notebook steps and use those to figure out how the spark settings could be adjusted?

Thanks!

Scott

CREATE OR REPLACE TABLE SCOTT_PO_HDR
AS
select  NULLIF(POFactKeys.POFactKey, -1) as POFactKey,
        NULLIF(BU.BusinessUnitKey, -1) as BusinessUnitKey,
        NULLIF(VEND.VendorKey, -1) as VendorKey,
        NULLIF(po_dt.DateKey, -1) as PODateKey,
        NULLIF(acct_dt.DateKey, -1) as AccountingDateKey,
        NULLIF(enter_dt.DateKey, -1) as EnteredDateKey,
        NULLIF(appr_dt.DateKey, -1) as ApprovalDateKey,
        HDR.BUSINESS_UNIT, 
        HDR.PO_ID, 
        HDR.PO_TYPE,
        po_type.XLATLONGNAME as PO_TYPE_DESC,
        HDR.PO_STATUS, 
        stat.XLATLONGNAME as STATUS_DESC,
        HDR.DISP_METHOD, 
        HDR.BUYER_ID, 
        buyer.OPRDEFNDESC as BUYER_NAME,
        HDR.OPRID_ENTERED_BY as ENTERED_BY_ID, 
        enter.OPRDEFNDESC as ENTERED_BY_NAME,
        amt.TOTAL_PO_AMT
from    Bronze_Peoplesoft_Finance_LH.PS_PO_HDR HDR
inner join Silver_FinanceLH.POFactKeys POFactKeys on HDR.BUSINESS_UNIT = POFactKeys.BUSINESS_UNIT and HDR.PO_ID = POFactKeys.PO_ID
inner join Silver_FinanceLH.BusinessUnitDimKeys BU on HDR.BUSINESS_UNIT = BU.BUSINESS_UNIT
inner join SILVER_FinanceLH.VendorDimKeys VEND on HDR.VENDOR_SETID = VEND.SETID and HDR.VENDOR_ID = VEND.VENDOR_ID
inner join SILVER_FinanceLH.DateDimKeys po_dt on HDR.PO_DT = po_dt.DateTime
inner join SILVER_FinanceLH.DateDimKeys enter_dt on HDR.ENTERED_DT = enter_dt.DateTime
inner join SILVER_FinanceLH.DateDimKeys appr_dt on HDR.APPROVAL_DT = appr_dt.DateTime
inner join SILVER_FinanceLH.DateDimKeys acct_dt on HDR.ACCOUNTING_DT = acct_dt.DateTime
inner join Bronze_Peoplesoft_Finance_LH.PSOPRDEFN buyer on HDR.BUYER_ID = buyer.OPRID
inner join Bronze_Peoplesoft_Finance_LH.PSOPRDEFN enter on HDR.OPRID_ENTERED_BY = enter.OPRID
inner join Bronze_Peoplesoft_Finance_LH.PSXLATITEM stat on HDR.PO_STATUS = stat.FIELDVALUE and stat.FIELDNAME = 'PO_STATUS'
inner join Bronze_Peoplesoft_Finance_LH.PSXLATITEM po_type on HDR.PO_TYPE = po_type.FIELDVALUE and po_type.FIELDNAME = 'PO_TYPE'
inner join TEMP_PO_HDR_AMT amt on HDR.BUSINESS_UNIT = amt.BUSINESS_UNIT and amt.PO_ID = HDR.PO_ID

where   HDR.PO_STATUS <> 'X'

Srisakthi · ‎11-13-2024

Hi @Scott_Powell ,

Please try with these properties,

spark.conf.set("spark.sql.parquet.vorder.enabled","true")
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled","true")
spark.conf.set("spark.databricks.delta.optimizeWrite.binSize","1073741824")

For spark optimization and configuration related things, please check your notebook execution logs. There you can find information on how many stages used, executors used, idle executors etc. based on all these factors we can tune node configuration

Srisakthi · ‎11-13-2024

Hi @Scott_Powell ,

There are few ways where you can optimize the performance,

1. Using Table Maintenance properties for your lakehouse tables.

2. Enable vorder and optimizewrite and bin size value as 1GB to minimize no of files and optimize for read performance.

3. Make sure you are using nodes required for processing your data. Adjust spark setting like the cores, executors to be used.

4. Upgrade your Fabric runtime to the latest version(1.3).

Regards,

Srisakthi

Scott_Powell · ‎11-13-2024

Hi @Srisakthi ,

I have a job developed that will go through each lakehouse table and optimize, vacuum, and delta log commit each table, but haven't started running it yet. I'll give this a try and see if it makes a difference on performance.

Re: Enable vorder and optimizewrite and bin size value as 1GB to minimize no of files and optimize for read performance - almost all of the SQL runs against our bronze layer lakehouse, where the tables were created via pipeline copy activities. Is it possible to set vordering and otpmizewrite on those - I thought that was the default? And re: the 1 GB file sizes - I'm completely unsure how to set that when a copy activity is doing the loads. I definitely see output from the copy activities that says things like 8 GB copied but created 100+ files, meaning they're too small. Just not sure how to stop this from happening. p.s. we also use dynamic partitioning on the copy activity to improve performance, I suspect this also causes the files to be split up "too small" but not sure.

Re: Make sure you are using nodes required for processing your data. Adjust spark setting like the cores, executors to be used - that's the crux of my question. Is there a way to make an intelligent guess and whether I need more cores with less memory vs. less cores with more memory?

And we are using the 1.3 runtime.

I really appreciate the help - sorry for the ongoing questions.

Scott

frithjof_v · ‎11-13-2024

I guess running OPTIMIZE on all the source tables of the query will help to achieve improved read performance.

I would try that as the first step.

%%sql

OPTIMIZE tableName;

https://docs.delta.io/latest/optimizations-oss.html

AndyDDC · ‎11-08-2024

Hi @Scott_Powell there's a blog here which may help visualise the Spark execution plan:

Visualizing Spark Execution Plans – Gerhard Brueckl on BI & Data

Scott_Powell · ‎11-13-2024

hi @AndyDDC , I just wanted to reply back - the article you linked seems to have an easier / more graphical way of seeing the explain plans for a query, but that still doesn't give much info on how to adjust spark settings to make it faster. I'm finding it really hard to get good info on that. I'm going to explore the autotuning @frithjof_v mentioned, hoping that helps.

Thanks!

Scott

Scott_Powell · ‎11-08-2024

Thanks @AndyDDC , really appreciate it! I'll check it out and see if it helps!

Microsoft - would be awesome if I could run a query in a notebook, and then have CoPilot suggest optimal settings to make it run quickly....

Thanks,

Scott

frithjof_v · ‎11-10-2024

There is something called Autotune in preview.

I must be honest, I haven't tested it. I haven't even read about it. I just heard the name. Perhaps it is relevant for what you're asking for: https://learn.microsoft.com/en-us/fabric/data-engineering/autotune?tabs=sparksql

frithjof_v · ‎11-14-2024

Here's a Fabric Espresso video on the Autotune feature:

https://youtu.be/iv1kAz_ShEo?si=arP4FA7wYhJisnvc

Unfortunately, I haven't had the time to watch it myself yet.

Scott_Powell · ‎11-13-2024

hi @frithjof_v , I had heard about the autotune back at the Fabric conference in Vegas, but didn't realize it was actually available yet, and that it was disabled by default. I'll definitely turn this on and see what happens, and post back.

Thanks for the heads up!

Scott

How to choose optimal spark environment settings?

Helpful resources

Join us at the Microsoft Fabric Community Conference

Join our Community Sticker Challenge 2025

Fabric Monthly Update - January 2025

Fabric Community Update - February 2025

Join us at the 2025 Microsoft Fabric Community Conference

How to choose optimal spark environment settings?

Helpful resources

Join us at the Microsoft Fabric Community Conference

Join our Community Sticker Challenge 2025

Fabric Monthly Update - January 2025

Fabric Community Update - February 2025