Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hi,
I am testing the delta.logRetentionDuration setting in Lakehouse Delta table.
https://docs.delta.io/latest/delta-batch.html#data-retention
My aim is to clean up log .json files. However, the .json files are not deleted after a new checkpoint has been created.
I have set the logRetentionDuration to 0 days.
people_table.logRetentionDuration = "interval 0 days"
As the docs state: "Each time a checkpoint is written, Delta automatically cleans up log entries older than the retention interval."
However this doesn't seem to be the case. I am probably missing something or doing something wrong.
I am using the below code to test this behaviour, this is my entire code.
I am using a loop to append data to the delta table 32 times (thus creating 32 parquet files and a lot of log .json files). I optimize and vacuum inside each iteration, so the small parquet files get compacted to a single parquet file.
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
from pyspark.sql import Row
from delta.tables import DeltaTable
# Define the schema for the Delta table using StructType
schema = StructType([
StructField("id", IntegerType(), True),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Create a list of tuples representing initial data to populate the Delta table
initial_data = [
(1, "Alice", 30), # Initial row with id=1, name="Alice", age=30
(2, "Bob", 25) # Initial row with id=2, name="Bob", age=25
]
# Create a DataFrame from the initial data using the defined schema
initial_df = spark.createDataFrame(initial_data, schema)
# Define the abfss path for your Delta table
table_abfss_path = "abfss://<workspaceName>@onelake.dfs.fabric.microsoft.com/<lakehouseName>.Lakehouse/Tables/<tableName>"
# Write the initial DataFrame to a Delta table
initial_df.write.format("delta").option("overwriteSchema", "true").mode("overwrite").save(table_abfss_path)
# Load the Delta table for further operations using DeltaTable API
people_table = DeltaTable.forPath(spark, table_abfss_path)
# Set the log retention duration to zero days, indicating immediate log cleanup
people_table.logRetentionDuration = "interval 0 days"
# Disable the retention duration check for Delta Lake, allowing aggressive vacuuming
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "false")
# Loop to insert new rows into the Delta table and perform optimization and vacuuming
for i in range(32): # Iterates 32 times, inserting rows with IDs 3 to 34
# Create a new row with an incremental ID, name as "Person_i", and age starting from 20
new_row = [(i + 3, f"Person_{i + 3}", 20 + i)]
new_df = spark.createDataFrame(new_row, schema) # Convert the new row into a DataFrame
# Append the new row to the Delta table
new_df.write.format("delta").mode("append").save(table_abfss_path)
print(f"Inserted row {i}: {new_row}") # Print the inserted row for logging
# Perform optimization to compact small files into larger ones
people_table.optimize().executeCompaction()
# Vacuum with '0' hours removes all old parquet files not referenced by the current version of the delta table
people_table.vacuum(0)
# Re-enable the retention duration check to enforce retention rules
spark.conf.set("spark.databricks.delta.retentionDurationCheck.enabled", "true")
# Read the data from the Delta table into a DataFrame to display it
delta_df = spark.read.format("delta").load(table_abfss_path)
display(delta_df) # Display the contents of the Delta table
However, no .json files get deleted (cleaned up) after a checkpoint is created.
Of course, my code example is a special case, because I am doing everything inside a loop to quickly generate many files, just to show this example. Still, why aren't the .json files getting deleted?
Anyone have experience with cleaning (deleting) the .json files?
Does adjusting the logRetentionDuration property really work?
Can the number of .json files in the _delta_log folder affect the performance of read operations significantly?
Or does it not matter, because the checkpoints improve the performance?
Do I need to care about how many .json files there are in the _delta_log folder, if I am concerned about performance?
Hi @frithjof_v
I didn't find any more detailed or helpful documentation on this "logRetentionDuration" property other than the one you've already read. I tried the following commands but it told me the delta table doesn't have "logRetentionDuration" property no matter I enable or disable "retentionDurationCheck" property in Spark configuration. This is probably why it didn't remove the log files.
From the documentation you linked, it says "Each time a checkpoint is written, Delta automatically cleans up log entries older than the retention interval. If you set this config to a large enough value, many log entries are retained. This should not impact performance as operations against the log are constant time. Operations on history are parallel but will become more expensive as the log size increases."
It seems the performance of read operations won't be affected significantly when checkpoints exist. A possible impact may be that when there are more and more log files, they will occupy more storage space.
Here are some blogs that may be helpful:
Microsoft Fabric Table Maintenance - Checkpoint and Statistics (mssqltips.com)
Microsoft Fabric Lakehouse OPTIMIZE and VACUUM for Table Maintenance (mssqltips.com)
Best Regards,
Jing
If this post helps, please Accept it as Solution to help other members find it. Appreciate your Kudos!
Thanks for sharing!
I'm still looking for guidance on the logRetentionDuration property. Either I don't understand accurately how it's supposed to work, or this property might seem a bit buggy.