Dataflow staging lakehouse table retention

shaunmw · ‎09-24-2024

Hi

I have several Fabric pipelines which run a number of dataflows gen2 which call a rest API and save to Azure SQL tables in the dataflow gen2.

I have 2 issues:

1) The dataflow staging lakehouse tables have grown exponentially and are huge. I do not require retention of these dataflow staging tables in the lakehouse since i believe these are used by the dataflows during the dataflow activity.

The remove older / unecessary files & tables maintenance spark script in the lakehouse doesnt do anything - it does not remove tables with no activity for 12 months nor does it remove files older than 7 days. This option also requires ticking each table and waiting for the script to run on each table - there are hundreds of tables that need to be purged in the staging lakehouse that are not required. ticking each table and waiting 5 minutes for each table to be cleaned is not a viable solution.

Can i simply delete the dataflow staging tables in the dataflow staging lakehouse (obviously not when dataflows are running) since i dont require retention of these or is there some maintenance that can be set to clean these tables / remove unrequired files on a scheduled basis?

2) the dataflows staging lakehouse CU has increased by 10x over the past week whilst the dataflows CU is unchanged. there was a spike in dataflow CU activity on Friday which resolved by end of day, and after that dataflow staging lakehouse CU activity did not settle back and remained 10x higher than previous.

Since the staging lakehouse activity and tables is managed and controlled by Fabric - how do i investigate what is causing this spike - the dataflows and azure SQL environment is unchanged from last week to this week.

(I do have a support ticket with Microsoft on this)

Thanks

frithjof_v · ‎10-15-2024

Some interesting blogs about the topic:

https://fabric.guru/calculating-the-size-of-the-dataflow-gen2-staging-lakehouses

https://itsnotaboutthecell.com/2024/07/10/cleaning-the-staging-lakeside/

v-nuoc-msft · ‎10-03-2024

Hi @shaunmw

Can you tell me if your problem is solved? If yes, please accept it as solution.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

v-nuoc-msft · ‎09-25-2024

Hi @shaunmw

Since you don't need to keep a staging table, you have several options:

Manual Delete: You can manually delete these tables when they are not in use. This is a straightforward approach, but can be time-consuming if there are many tables.

Automatic maintenance: You can set up an automated process to clean these tables. Since the built-in maintenance script doesn't work as expected, you might consider creating a custom script. For example:

from datetime import datetime, timedelta
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("CleanupStagingTables").getOrCreate()

unused_threshold_days = 30
unused_threshold_date = datetime.now() - timedelta(days=unused_threshold_days)

all_tables = spark.catalog.listTables("your_database_name")

for table in all_tables:
    table_name = table.name
    table_metadata = spark.sql(f"DESCRIBE DETAIL your_database_name.{table_name}").collect()
    last_access_time = table_metadata['lastAccessTime']

    if last_access_time and datetime.strptime(last_access_time, '%Y-%m-%d %H:%M:%S') < unused_threshold_date:
        spark.sql(f"DROP TABLE your_database_name.{table_name}")
        print(f"Deleted table: {table_name}")

spark.stop()

The sudden increase in CU activity in temporary Lakehouse can be caused by a number of factors, you can try:

Use Azure Monitor to track resource usage in the staging Lakehouse. Look for any unusual patterns or spikes in activity.

Azure Monitor overview - Azure Monitor | Microsoft Learn

Look at the log of the data flow to see if there were any changes or anomalies during the peak of activity.

Since your data flow calls the REST API, check whether the number of API calls has increased or the API response time has changed. This may affect CU usage.

Even if you don't make any changes to your data flow or Azure SQL environment, consider external factors that may affect performance, such as network issues or changes in data volumes.

You mentioned that you submitted the support ticket, and I look forward to sharing with you if there are any answers.

Regards,

Nono Chen

If this post helps, then please consider Accept it as the solution to help the other members find it more quickly.

Dataflow staging lakehouse table retention

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024

Join us at the 2025 Microsoft Fabric Community Conference

Dataflow staging lakehouse table retention

Helpful resources

Join us at the Microsoft Fabric Community Conference

We want your feedback!

Microsoft Fabric Community Conference 2025

A Year in Review - December 2024