This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
Compaction is one the most necessary but also challenging aspects of managing a Lakehouse architecture. Similar to file systems and even relational databases, unless closely managed, data will get fragmented over time, and can lead to excessive compute costs. The OPTIMIZE command exists to solve for this challenge: small files are grouped into bins targeting a specific ideal file size and then rewritten to blob storage. The result is the same data, but contained in fewer files that are larger.
However, imagine this scenario: you have a nightly OPTIMIZE job which runs to keep your tables, all under 1GB, nicely compacted. Upon inspection of the Delta table transaction log, you find that most of your data is being rewritten after every ELT cycle, leading to expensive OPTIMIZE jobs, even though you are only changing a small portion of the overall data every night. Meanwhile, as business requirements lead to more frequent Delta table updates, in between ELT cycles, it appears that jobs get slower and slower until the next scheduled OPTIMIZE job is run. Sound familiar?
If you’ve felt like OPTIMIZE is too slow, rewrites too much data, or in general should be automatically triggered, you’re not alone. We're introducing three features that will transform the efficiency, efficacy, and performance impact of compaction jobs: Fast Optimize, File Level Compaction Targets, and Auto Compaction.
Traditional Delta table maintenance carries hidden costs that tend to compound over time:
Write Amplification: Files can get recompacted repeatedly as target file size configs change or as optimize jobs produce suboptimal files that still qualify as being 'uncompacted'. Tables that have files smaller than 1GB might be recompacted hundreds or even many thousands of times in its lifetime, wasting compute resources and storage I/O.
Manual Intervention Required: Teams spend valuable time scheduling, monitoring, and troubleshooting compaction jobs instead of focusing on business logic.
Performance Degradation: Small files accumulate between maintenance windows, causing query performance to degrade until the next scheduled optimization.
Unpredictable Costs: Without intelligent short-circuiting, users need to self-code logic to evaluate if compaction might be beneficial and without doing so, compaction jobs can run longer than expected, impacting both performance and cost predictability.
Diagram_showing_that_fast_optimize_adds_additional_checks_to_evaluate_if_a_bin_o
Fast Optimize intelligently analyzes your Delta table's files and short-circuits compaction operations that aren’t estimated to meaningfully improve performance.
Instead of blindly proceeding to compact files anytime more than 1 small file exists, Fast Optimize evaluates whether each candidate bin (group of small files) is estimated to meets your compaction goals or if too many small files exist. If the compaction job isn’t estimated to produce compacted files meeting the defined minimum target file size (i.e. delta.databricks.delta.optimize.minFileSize) and doesn’t have too many small files (delta.microsoft.delta.optimize.fast.minNumFiles), the operation short-circuits or reduces the compaction scope.
While Fast Optimize is disabled by default in Runtime 1.3, Microsoft recommends enabling it at the session level:
spark.conf.set('spark.microsoft.delta.optimize.fast.enabled', True)
With the Fast Optimize session configuration enabled, all existing `OPTIMIZE` code paths are supported, with the following limitations:
In a study mimicking a real-world scenario where `OPTIMIZE` was run at the end of every ELT cycle, Fast Optimize reduced the time spent doing compaction by 80% over 200 ELT cycles without even the slightest regression in performance.
Fast_optimize_resulted_in_5x_faster_compaction_over_200_ELT_iterations
The magic of Fast Optimize is in the long-term avoidance of write amplification.
Example: Illustrates suboptimal bin skipping:
Diagram_showing_how_fast_optimize_resulted_in_suboptimal_bins_being_skipped
What it does: This feature tags files with the compaction target used when they were created, preventing already-optimized files from being unnecessarily recompacted if the target file size changes over time.
The problem it solves: Imagine you compact a table with a 128MB target, then later as the table gets bigger, you change your target to 512MB. Without this feature, those perfectly good 128MB files would be recompacted again, despite being well-sized when they were originally compacted.
How it works: Delta automatically stores metadata about the compaction target alongside file statistics (OPTIMIZE_TARGET_SIZE). Future OPTIMIZE operations use this tag to determine if the file is compacted or not. If the files size is at least one half of the OPTIMIZE_TARGET_SIZE value, it is considered compacted.
The result: Dramatic reduction in write-amplification and more predictable compaction job performance as the target file size changes over time.
While disabled by default in Runtime 1.3, Microsoft recommends enabling file level targets at the session level:
spark.conf.set('spark.microsoft.delta.optimize.fileLevelTarget.enabled', True)
While this feature isn’t new, we recently revamped the OSS Delta implementation and now recommend Auto Compaction for Spark customers wanting a hands-off approach to table maintenance.
What it does: Auto Compaction monitors your table's file distribution as part of every write operation and automatically triggers compaction when small file accumulation crosses defined thresholds.
Why it matters: Instead of waiting for scheduled maintenance windows, your tables maintain optimal performance automatically. Small files get compacted before they impact query performance.
Smart triggering: The feature uses table-specific heuristics to determine when a table is bordering on having too many small files, thus eliminating the necessity to manually trigger or scheduled compaction jobs.
Session level:
spark.conf.set('spark.databricks.delta.autoCompact.enabled', True)
Table level:
CREATE TABLE dbo.ac_enabled_table
TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')
It can also be enabled on existing tables with:
ALTER TABLE dbo.ac_enabled_table
SET TBLPROPERTIES ('delta.autoOptimize.autoCompact' = 'true')
Performance benefit: Queries maintain consistent performance without the traditional sawtooth pattern of degradation between maintenance windows.
Cost benefit: Auto compaction utilizes the same amount of compute as scheduled optimize operations, but it automatically runs at precisely the right intervals. For most customers who do not schedule optimize at the optimal frequency for each table, auto compaction can lead to substantial cost savings by reducing unnecessary compute usage and minimizing manual intervention.
In a study comparing the performance impact of running 200 small-file-generating merge operations into a table, auto compaction was 5x faster by the last iteration. This was a result of a growing 'small-file problem' being mitigated via the automatically triggered synchronous compaction operations.
Chart_showing_that_after_200_iterations_merge_executed_5x_faster_when_auto_compa
These features work together to create a compaction strategy that's both intelligent and hands-off:
The result? Tables that maintain optimal performance with minimal operational overhead and predictable resource usage.
All features mentioned are compatible with tables created with other Delta writers.
Ready to eliminate write amplification and automate your compaction strategy? These features are available now in Microsoft Fabric Spark. Check out the Compacting Delta tables documentation for detailed configuration options and best practices.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.