The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi all,
I'm working on a diagnostic log ingestion engine built with PySpark and Delta Lake on Microsoft Fabric. My setup parses incoming ZIP logs from a server, transforms signal data per file into rows, and ingests them into different Delta tables—sometimes writing 50–100 tables per log file.
Current bottleneck:
I use below code to save the data to the lakehouse.
for tableName, rows in _logsBuffer.items(): spark.createDataFrame(rows, schema).write.format("delta").mode("append").saveAsTable(tableName)
Each log takes ~20 minutes (each write is taking around 7-10 seconds), and most of that time is spent in table writes. I want to bring that down to 5 minutes or less.
I've already explored:
Is there any way I can reduce processing time of these files records saving to Lakehouse? I know spark is powerful engine but its taking more time around 20mins to save data for 40 to 50 tables with varrying number of records in each table. The records are like some tables are having 10 records couple tables are having some 2000 records.
Hi @NagaRK ,
Were you able to resolve the issue? Let us know if you need any assistance.
Thank You
Whatever we do, the park is taking more time to store data directly to the Lakehouse tables. Its faster when its storing as files in lakehouse. So we went ahead and stored as files and then moved the data to warehouse from files. However I will check more and mark as solution.
Hi @NagaRK ,
Just wanted to check if you had the opportunity to review the suggestions provided and able to implement the solution?
Thank you @Shreya_Barhate @BhaveshPatel and @Srisakthi for your valuable inputs.
Hi @NagaRK ,
If your Spark notebook is taking longer to process data in Microsoft Fabric, one effective way to optimize performance is by configuring the Spark environment with resource profiles.
In Microsoft Fabric, resource profiles are integrated directly into the Spark environment. You can apply them by setting the appropriate configuration before executing your workload:
spark.conf.set("spark.fabric.resourceProfile", "readHeavyForSpark")
This sets the Spark environment to use the readHeavyForSpark profile, which is optimized for read-intensive operations. Other available profiles include:
Reference Documentation: Configure Resource Profile Configurations in Microsoft Fabric - Microsoft Fabric | Microsoft Learn
Additionally, consider increasing Spark pool capacity by scaling up nodes or adjusting pool size in Fabric settings.
Hi @NagaRK
Always use single delta table ( opmistic concurrency control ) when writing to a delta table. Also, Remove append and use overwrite table in delta.
spark.createDataFrame(rows, schema).write.format("delta").mode("overwrite").saveAsTable(tableName)
We are not using EventHub ( streaming data) so use overwrite ( batch data ). This way its near real time data....minimal latency
Hi @NagaRK ,
The performance is based on your spark setting,the capacity that you are using.
Make sure these settings are available in your spark environment and which is attached to your notebook
1. Native Execution Engine - uses vectorized engine and helps improves query performance
2. Apache Spark latest runtime
3. Leverage Autotune properties - helps in speed up workload execution and performance
4. Leverage Spark Resource Profiles - Microsoft fabric provides flexibility to utilise these resource profiles. By default all the workspaces is atatched to writeHeavy profile.
5. Make sure to enable v-order
You can refer this article on threadpool
Regards,
Srisakthi
Hi @NagaRK ,
You can try to optimize the synapse that reduces the number of small files and increases individual file sizes, improving write performance and reducing I/O overhead.
You can enable it using:
spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")
Use V-Order is a write-time optimization for Parquet files that improves read performance and compression. While it adds ~15% overhead to write time, it can significantly reduce read latency and storage costs.Use this if you want to reduce total file size.
spark.conf.set("spark.sql.parquet.vorder.default", "true")
You can even Parallelize Writes with Spark Native Execution Engine
spark.conf.set("spark.native.enabled", "true")
Hope this helps!