Spark in Notebook taking more time to process the ...

NagaRK · ‎07-09-2025

Hi all,

I'm working on a diagnostic log ingestion engine built with PySpark and Delta Lake on Microsoft Fabric. My setup parses incoming ZIP logs from a server, transforms signal data per file into rows, and ingests them into different Delta tables—sometimes writing 50–100 tables per log file.

Current bottleneck:
I use below code to save the data to the lakehouse.

for tableName, rows in _logsBuffer.items():
    spark.createDataFrame(rows, schema).write.format("delta").mode("append").saveAsTable(tableName)

Each log takes ~20 minutes (each write is taking around 7-10 seconds), and most of that time is spent in table writes. I want to bring that down to 5 minutes or less.

I've already explored:

Using ThreadPoolExecutor to parallelize table writes. - Felt like it is again executing the writes sequentially.
Is there any way I can reduce processing time of these files records saving to Lakehouse? I know spark is powerful engine but its taking more time around 20mins to save data for 40 to 50 tables with varrying number of records in each table. The records are like some tables are having 10 records couple tables are having some 2000 records.

v-sdhruv · ‎07-29-2025

Hi @NagaRK ,

Were you able to resolve the issue? Let us know if you need any assistance.

Thank You

NagaRK · ‎07-18-2025

Whatever we do, the park is taking more time to store data directly to the Lakehouse tables. Its faster when its storing as files in lakehouse. So we went ahead and stored as files and then moved the data to warehouse from files. However I will check more and mark as solution.

v-sdhruv · ‎07-17-2025

Hi @NagaRK ,

Just wanted to check if you had the opportunity to review the suggestions provided and able to implement the solution?
Thank you @Shreya_Barhate @BhaveshPatel and @Srisakthi for your valuable inputs.

Shreya_Barhate · ‎07-14-2025

Hi @NagaRK ,

If your Spark notebook is taking longer to process data in Microsoft Fabric, one effective way to optimize performance is by configuring the Spark environment with resource profiles.

In Microsoft Fabric, resource profiles are integrated directly into the Spark environment. You can apply them by setting the appropriate configuration before executing your workload:

spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")

This sets the Spark environment to use the readHeavyForSpark profile, which is optimized for read-intensive operations. Other available profiles include:

readHeavyForPBI
writeHeavy
custom
These profiles adjust Spark's internal resource allocation—like executor memory, cores, and task parallelism—based on workload type.

Reference Documentation: Configure Resource Profile Configurations in Microsoft Fabric - Microsoft Fabric | Microsoft Learn

Additionally, consider increasing Spark pool capacity by scaling up nodes or adjusting pool size in Fabric settings.

BhaveshPatel · ‎07-14-2025

Hi @NagaRK

Always use single delta table ( opmistic concurrency control ) when writing to a delta table. Also, Remove append and use overwrite table in delta.

 spark.createDataFrame(rows, schema).write.format("delta").mode("overwrite").saveAsTable(tableName)

We are not using EventHub ( streaming data) so use overwrite ( batch data ). This way its near real time data....minimal latency

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.

Srisakthi · ‎07-14-2025

Hi @NagaRK ,

The performance is based on your spark setting,the capacity that you are using.

Make sure these settings are available in your spark environment and which is attached to your notebook

1. Native Execution Engine - uses vectorized engine and helps improves query performance

2. Apache Spark latest runtime

3. Leverage Autotune properties - helps in speed up workload execution and performance

4. Leverage Spark Resource Profiles - Microsoft fabric provides flexibility to utilise these resource profiles. By default all the workspaces is atatched to writeHeavy profile.

5. Make sure to enable v-order

You can refer this article on threadpool

https://learn-it-all.medium.com/parallelism-in-spark-notebook-execution-in-microsoft-fabric-8fb6ac3f...

Regards,

Srisakthi

v-sdhruv · ‎07-11-2025

Hi @NagaRK ,

You can try to optimize the synapse that reduces the number of small files and increases individual file sizes, improving write performance and reducing I/O overhead.

You can enable it using:

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

Use V-Order is a write-time optimization for Parquet files that improves read performance and compression. While it adds ~15% overhead to write time, it can significantly reduce read latency and storage costs.Use this if you want to reduce total file size.

spark.conf.set("spark.sql.parquet.vorder.default", "true")

You can even Parallelize Writes with Spark Native Execution Engine

spark.conf.set("spark.native.enabled", "true")

Hope this helps!

Spark in Notebook taking more time to process the data.

Helpful resources

Fabric Monthly Update - September 2025

Fabric Community Update - August 2025

FabCon is coming to Atlanta

Spark in Notebook taking more time to process the data.

Helpful resources

Fabric Monthly Update - September 2025

Fabric Community Update - August 2025