Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
NagaRK
Advocate I
Advocate I

Spark in Notebook taking more time to process the data.

Hi all,

 

I'm working on a diagnostic log ingestion engine built with PySpark and Delta Lake on Microsoft Fabric. My setup parses incoming ZIP logs from a server, transforms signal data per file into rows, and ingests them into different Delta tables—sometimes writing 50–100 tables per log file.

Current bottleneck:
I use below code to save the data to the lakehouse. 

for tableName, rows in _logsBuffer.items():
    spark.createDataFrame(rows, schema).write.format("delta").mode("append").saveAsTable(tableName)

Each log takes ~20 minutes (each write is taking around 7-10 seconds), and most of that time is spent in table writes. I want to bring that down to 5 minutes or less.

I've already explored:

  • Using ThreadPoolExecutor to parallelize table writes. - Felt like it is again executing the writes sequentially.

    Is there any way I can reduce processing time of these files records saving to Lakehouse? I know spark is powerful engine but its taking more time around 20mins to save data for 40 to 50 tables with varrying number of records in each table. The records are like some tables are having 10 records couple tables are having some 2000 records.

7 REPLIES 7
v-sdhruv
Community Support
Community Support

Hi @NagaRK ,

Were you able to resolve the issue? Let us know if you need any assistance.

Thank You

NagaRK
Advocate I
Advocate I

Whatever we do, the park is taking more time to store data directly to the Lakehouse tables. Its faster when its storing as files in lakehouse. So we went ahead and stored as files and then moved the data to warehouse from files. However I will check more and mark as solution.

v-sdhruv
Community Support
Community Support

Hi @NagaRK ,

Just wanted to check if you had the opportunity to review the suggestions provided and able to implement the solution?
Thank you @Shreya_Barhate @BhaveshPatel and @Srisakthi  for your valuable inputs.

Shreya_Barhate
Advocate I
Advocate I

Hi @NagaRK ,

If your Spark notebook is taking longer to process data in Microsoft Fabric, one effective way to optimize performance is by configuring the Spark environment with resource profiles.

In Microsoft Fabric, resource profiles are integrated directly into the Spark environment. You can apply them by setting the appropriate configuration before executing your workload:

spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")

This sets the Spark environment to use the readHeavyForSpark profile, which is optimized for read-intensive operations. Other available profiles include:

  • readHeavyForPBI
  • writeHeavy
  • custom
    These profiles adjust Spark's internal resource allocation—like executor memory, cores, and task parallelism—based on workload type.

 Reference Documentation: Configure Resource Profile Configurations in Microsoft Fabric - Microsoft Fabric | Microsoft Learn

Additionally, consider increasing Spark pool capacity by scaling up nodes or adjusting pool size in Fabric settings.


 



 

BhaveshPatel
Community Champion
Community Champion

Hi @NagaRK 

 

Always use single delta table ( opmistic concurrency control ) when writing to a delta table. Also, Remove append and use overwrite table in delta.

 spark.createDataFrame(rows, schema).write.format("delta").mode("overwrite").saveAsTable(tableName)

 We are not using EventHub ( streaming data) so use overwrite ( batch data ). This way its near real time data....minimal latency

Thanks & Regards,
Bhavesh

Love the Self Service BI.
Please use the 'Mark as answer' link to mark a post that answers your question. If you find a reply helpful, please remember to give Kudos.
Srisakthi
Super User
Super User

Hi @NagaRK ,

 

The performance is based on your spark setting,the capacity that you are using.

Make sure these settings are available in your spark environment and which is attached to your notebook

1. Native Execution Engine - uses vectorized engine and helps improves query performance

2. Apache Spark latest runtime

3. Leverage Autotune properties - helps in speed up workload execution and performance

4. Leverage Spark Resource Profiles - Microsoft fabric provides flexibility to utilise these resource profiles. By default all the workspaces is atatched to writeHeavy profile. 

5. Make sure to enable v-order

 

You can refer this article on threadpool

https://learn-it-all.medium.com/parallelism-in-spark-notebook-execution-in-microsoft-fabric-8fb6ac3f...

 

Regards,

Srisakthi

 

v-sdhruv
Community Support
Community Support

Hi @NagaRK ,

You can try to optimize the synapse that reduces the number of small files and increases individual file sizes, improving write performance and reducing I/O overhead.

You can enable it using:

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

Use V-Order is a write-time optimization for Parquet files that improves read performance and compression. While it adds ~15% overhead to write time, it can significantly reduce read latency and storage costs.Use this if you want to reduce total file size.


spark.conf.set("spark.sql.parquet.vorder.default", "true")
 

You can even Parallelize Writes with Spark Native Execution Engine


spark.conf.set("spark.native.enabled", "true")

Hope this helps!

Helpful resources

Announcements
Fabric July 2025 Monthly Update Carousel

Fabric Monthly Update - July 2025

Check out the July 2025 Fabric update to learn about new features.

August 2025 community update carousel

Fabric Community Update - August 2025

Find out what's new and trending in the Fabric community.