Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
shaunmw
Regular Visitor

Dataflow staging lakehouse table retention

Hi

 

I have several Fabric pipelines which run a number of dataflows gen2 which call a rest API and save to Azure SQL tables in the dataflow gen2.

 

I have 2 issues:

 

1) The dataflow staging lakehouse tables have grown exponentially and are huge. I do not require retention of these dataflow staging tables in the lakehouse since i believe these are used by the dataflows during the dataflow activity.

 

The remove older / unecessary files & tables maintenance spark script in the lakehouse doesnt do anything - it does not remove tables with no activity for 12 months nor does it remove files older than 7 days. This option also requires ticking each table and waiting for the script to run on each table - there are hundreds of tables that need to be purged in the staging lakehouse that are not required. ticking each table and waiting 5 minutes for each table to be cleaned is not a viable solution.

 

Can i simply delete the dataflow staging tables in the dataflow staging lakehouse (obviously not when dataflows are running) since i dont require retention of these or is there some maintenance that can be set to clean these tables / remove unrequired files on a scheduled basis?

 

2) the dataflows staging lakehouse CU has increased by 10x over the past week whilst the dataflows CU is unchanged. there was a spike in dataflow CU activity on Friday which resolved by end of day, and after that dataflow staging lakehouse CU activity did not settle back and remained 10x higher than previous.

 

Since the staging lakehouse activity and tables is managed and controlled by Fabric - how do i investigate what is causing this spike - the dataflows and azure SQL environment is unchanged from last week to this week.

 

(I do have a support ticket with Microsoft on this)

 

Thanks

3 REPLIES 3
v-nuoc-msft
Community Support
Community Support

Hi @shaunmw 

 

Can you tell me if your problem is solved? If yes, please accept it as solution.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

v-nuoc-msft
Community Support
Community Support

Hi @shaunmw 

 

Since you don't need to keep a staging table, you have several options:

 

Manual Delete: You can manually delete these tables when they are not in use. This is a straightforward approach, but can be time-consuming if there are many tables.

 

Automatic maintenance: You can set up an automated process to clean these tables. Since the built-in maintenance script doesn't work as expected, you might consider creating a custom script. For example:

 

 

 

from datetime import datetime, timedelta
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CleanupStagingTables").getOrCreate()

unused_threshold_days = 30
unused_threshold_date = datetime.now() - timedelta(days=unused_threshold_days)

all_tables = spark.catalog.listTables("your_database_name")

for table in all_tables:
    table_name = table.name
    table_metadata = spark.sql(f"DESCRIBE DETAIL your_database_name.{table_name}").collect()
    last_access_time = table_metadata['lastAccessTime']

    if last_access_time and datetime.strptime(last_access_time, '%Y-%m-%d %H:%M:%S') < unused_threshold_date:
        spark.sql(f"DROP TABLE your_database_name.{table_name}")
        print(f"Deleted table: {table_name}")

spark.stop()

 

 

 

The sudden increase in CU activity in temporary Lakehouse can be caused by a number of factors, you can try:

 

Use Azure Monitor to track resource usage in the staging Lakehouse. Look for any unusual patterns or spikes in activity.

 

Azure Monitor overview - Azure Monitor | Microsoft Learn

 

Look at the log of the data flow to see if there were any changes or anomalies during the peak of activity.

 

Since your data flow calls the REST API, check whether the number of API calls has increased or the API response time has changed. This may affect CU usage.

 

Even if you don't make any changes to your data flow or Azure SQL environment, consider external factors that may affect performance, such as network issues or changes in data volumes.

 

You mentioned that you submitted the support ticket, and I look forward to sharing with you if there are any answers.

 

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.

Top Solution Authors