Get certified for free when you join Fabric Data Days 2026 and dive into Fabric, Power BI, SQL, AI, and other essential data skills.
Join nowData Days is here! Join us now for 60+ days of learning, challenges, and connection. Learn more
Miles Cole — Principal Program Manager, Microsoft
Available now in Fabric Runtime 2.0
Liquid Clustering is transforming how data teams manage table layouts in the lakehouse, often replacing rigid Hive-style partitioning with flexible clustering that adapts to your workloads. But until now, every `OPTIMIZE` job that performs clustering rewrote far more data than necessary, making clustering cost unpredictable and increasingly expensive as tables grew.
Today, Microsoft Fabric is introducing Incremental Liquid Clustering. Here's what it looks like in practice — clustering time over 200 ELT (extract, load, transform) iterations comparing the standard algorithm to incremental mode, both on Fabric Runtime 2.0:
Figure: ELT Pipeline (merge, partial-overlapping with 2 cluster columns). Standard clustering time grows linearly with table size. Incremental stays flat.
Liquid Clustering reduces query costs by organizing data so that queries only read the files they need. A well-clustered table means fewer files scanned, less I/O, and faster results. But producing and maintaining that layout has a cost: every `OPTIMIZE` spends compute rewriting files.
The standard algorithm from open-source Delta Lake rewrites all data within a groups of clustered files until they exceed 100GB in size, on every `OPTIMIZE`, regardless of whether files are already well-clustered. Append 1 KB to a 99 GB table? It rewrites all 99+ GB. This makes clustering cost grow with table size, not with the amount of new data — and for large tables, the cost of maintaining clustering can outweigh the query savings it provides.
Incremental Liquid Clustering fixes this balance: same query performance, fraction of the maintenance cost.
Instead of rewriting everything, `OPTIMIZE` now identifies and processes only files that actually need clustering:
Already well-clustered, healthy-sized files are skipped entirely. New data is routed into existing groups of clustered files, maintaining layout continuity without rewriting data that's already in the right place. The result is a constant-time clustering operation that scales with the size of new data, not the size of the table.
If you only cluster new files, quality can silently degrade as new data overlaps with existing file ranges. Incremental Liquid Clustering includes Auto Reclustering to handle this: the algorithm identifies clusters of overlapping files and selectively reclusters degraded files as overlap thresholds are exceeded. Quality stays high as data evolves, with no manual intervention required.
The following benchmarks cover three workload patterns that represent the most common ways data lands in Delta tables, each running 200 iterations of write, OPTIMIZE, and ending with a selective query to test clustering quality:
|
Workload |
Standard (Avg.) |
Incremental (Avg.) |
Speedup |
|
Streaming Ingest (append, non-overlapping) |
32.3s |
3.7s |
8.9x |
|
Analytics Table (append, full-overlapping |
34.3s |
6.3s |
5.5x |
|
ETL Pipeline (merge, partial-overlapping) |
29.7s |
6.1s |
4.9x |
These averages understate the problem. The standard algorithm exhibits unbounded growth within 100GB groups of files — as tables accumulate data, optimize time scales linearly with table size rather than batch size. By iterations 150–200, the standard algorithm peaks at 54–65 seconds per run while incremental stays under 10 seconds.
Clustering quality remains comparable. Rewriting already-clustered files doesn't improve data skipping; it just wastes compute.
Fabric's Incremental Liquid Clustering was also benchmarked against another vendor's liquid clustering implementation across the same three workload patterns, measuring both optimize duration and clustering quality.
The source code for reproducing the benchmark can be found on GitHub.
Across all three workloads, Fabric clusters data 1.6x faster on average while scanning 11% fewer files per selective query. The speed difference is most pronounced in merge-heavy workloads, where Fabric is over 3x faster — the workload pattern a majority of Apache Spark users rely on.
Figure: Average seconds per OPTIMIZE across 200 iterations. Lower is better.
For merge workloads, the per-iteration scatter plot shows the full picture. The other vendor's clustering time dramatically degrades starting around iteration 100 and continues to grows as the system tries to maintain clustering quality, while Fabric stays consistent and fast:
Figure: ETL Pipeline (merge, partial-overlapping) — 5.5x faster at iteration 200.
Figure: Average files scanned for selective queries over 200 iterations. Lower is better.
Fabric achieves comparable or better data skipping across all three workload shapes. Both engines produce tight file layouts for streaming and ETL patterns. For the full-overlapping append workload, Fabric scans > 11% fewer files per query, a result of investing more compute to produce a higher quality layout.
Incremental Liquid Clustering is available in Fabric Runtime 2.0 with Delta 4.1 and Spark 4.1. It's enabled by default with no configuration changes required. Simply run `OPTIMIZE` on your Liquid Clustered tables and the new algorithm takes over automatically.
# Create a liquid clustered table (or alter an existing one)
spark.sql("""
CREATE TABLE events CLUSTER BY (event_date, region)
AS SELECT * FROM ...
""")
# OPTIMIZE now uses Incremental Liquid Clustering automatically
spark.sql("""
OPTIMIZE events
""")
If you ever need a full recluster (e.g., after changing clustering keys), you can still run:
spark.sql("""
OPTIMIZE events FULL
""")
Incremental Liquid Clustering is a major step forward in making the lakehouse self-optimizing. By combining intelligent file selection with automatic quality maintenance when optimize is run, Fabric eliminates the tradeoff between clustering quality and clustering cost.
Try it today in Fabric Runtime 2.0 and experience clustering that scales with your data, not against it.
To learn more about Liquid Clustering in Microsoft Fabric, see the documentation.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.