March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi
I have several Fabric pipelines which run a number of dataflows gen2 which call a rest API and save to Azure SQL tables in the dataflow gen2.
I have 2 issues:
1) The dataflow staging lakehouse tables have grown exponentially and are huge. I do not require retention of these dataflow staging tables in the lakehouse since i believe these are used by the dataflows during the dataflow activity.
The remove older / unecessary files & tables maintenance spark script in the lakehouse doesnt do anything - it does not remove tables with no activity for 12 months nor does it remove files older than 7 days. This option also requires ticking each table and waiting for the script to run on each table - there are hundreds of tables that need to be purged in the staging lakehouse that are not required. ticking each table and waiting 5 minutes for each table to be cleaned is not a viable solution.
Can i simply delete the dataflow staging tables in the dataflow staging lakehouse (obviously not when dataflows are running) since i dont require retention of these or is there some maintenance that can be set to clean these tables / remove unrequired files on a scheduled basis?
2) the dataflows staging lakehouse CU has increased by 10x over the past week whilst the dataflows CU is unchanged. there was a spike in dataflow CU activity on Friday which resolved by end of day, and after that dataflow staging lakehouse CU activity did not settle back and remained 10x higher than previous.
Since the staging lakehouse activity and tables is managed and controlled by Fabric - how do i investigate what is causing this spike - the dataflows and azure SQL environment is unchanged from last week to this week.
(I do have a support ticket with Microsoft on this)
Thanks
Some interesting blogs about the topic:
https://fabric.guru/calculating-the-size-of-the-dataflow-gen2-staging-lakehouses
https://itsnotaboutthecell.com/2024/07/10/cleaning-the-staging-lakeside/
Hi @shaunmw
Can you tell me if your problem is solved? If yes, please accept it as solution.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
Hi @shaunmw
Since you don't need to keep a staging table, you have several options:
Manual Delete: You can manually delete these tables when they are not in use. This is a straightforward approach, but can be time-consuming if there are many tables.
Automatic maintenance: You can set up an automated process to clean these tables. Since the built-in maintenance script doesn't work as expected, you might consider creating a custom script. For example:
from datetime import datetime, timedelta
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("CleanupStagingTables").getOrCreate()
unused_threshold_days = 30
unused_threshold_date = datetime.now() - timedelta(days=unused_threshold_days)
all_tables = spark.catalog.listTables("your_database_name")
for table in all_tables:
table_name = table.name
table_metadata = spark.sql(f"DESCRIBE DETAIL your_database_name.{table_name}").collect()
last_access_time = table_metadata['lastAccessTime']
if last_access_time and datetime.strptime(last_access_time, '%Y-%m-%d %H:%M:%S') < unused_threshold_date:
spark.sql(f"DROP TABLE your_database_name.{table_name}")
print(f"Deleted table: {table_name}")
spark.stop()
The sudden increase in CU activity in temporary Lakehouse can be caused by a number of factors, you can try:
Use Azure Monitor to track resource usage in the staging Lakehouse. Look for any unusual patterns or spikes in activity.
Azure Monitor overview - Azure Monitor | Microsoft Learn
Look at the log of the data flow to see if there were any changes or anomalies during the peak of activity.
Since your data flow calls the REST API, check whether the number of API calls has increased or the API response time has changed. This may affect CU usage.
Even if you don't make any changes to your data flow or Azure SQL environment, consider external factors that may affect performance, such as network issues or changes in data volumes.
You mentioned that you submitted the support ticket, and I look forward to sharing with you if there are any answers.
Regards,
Nono Chen
If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.