This time we’re going bigger than ever. Fabric, Power BI, SQL, AI and more. We're covering it all. You won't want to miss it.
Learn moreDid you hear? There's a new SQL AI Developer certification (DP-800). Start preparing now and be one of the first to get certified. Register now
Data Clustering in Microsoft Fabric Data Warehouse, is a game-changing feature designed to supercharge your queries. Whether you are a data engineer, analyst, or IT professional, Data Clustering is here to help you unlock faster insights, optimize resource usage, and maximize the value of your data.
Data Clustering in Microsoft Fabric Data Warehouse is a technique that organizes and stores your data based on similarity. By grouping similar records together at the storage level, Data Clustering makes queries run faster and more efficiently.
The Data Warehouse engine maintains your Data Clustering tables automatically, just like it does with regular tables. As a result, queries that consume tables with Data Clustering will run much more efficiently, requiring less resources, reducing consumption usage, and making your queries run faster.
Retrieving data from cold storage is one of the most expensive (if not the most expensive) tasks for a database query. Large queries often require scanning massive amounts of data, so if the engine cannot prune irrelevant data early, it ends up reading far more data than necessary.
Imagine your queries scanning only the data that matters, with storage optimized for quick retrieval. Data Clustering makes this possible by keeping related rows close together, allowing the engine to skip irrelevant files and row groups for queries that use WHERE predicates. The result? Faster performance, lower compute costs, and reduced storage access, all thanks to more efficient storage.
The following illustrates an example of a query that uses a regular table (left) compared to a query that uses a table with Data Clustering (right). Note that, in this example, both tables have the exact same data, but the table on the right is ordered for better efficiency.
Side-by-side_comparison_of_two_Microsoft_Fabric_query_windows._The_left_panel_la
Data Clustering introduces a new syntax option to the CREATE TABLE statement: CLUSTER BY.
CREATE TABLE Bands ( ID BIGINT, Name VARCHAR(MAX), Genre VARCHAR(75), Country VARCHAR(56), DateFounded DATE ) WITH (CLUSTER BY (ID, DateFounded))
During data ingestion, Data Clustering applies a sophisticated algorithm to arrange rows with similar values in adjacent storage locations. This method offers advantages over typical clustering implementations that use standard lexicographical sorting, which arranges rows sequentially based on dictionary order beginning with Column A, followed by Column B, and continuing in that manner. When multiple columns are used in Data Clustering (you can use up to four columns), our algorithm considers the value of all columns involved to determine how to store a particular row in storage, keeping rows with similar values close together.
When the Warehouse engine receives a query, it uses column metadata to make smart decisions about which files to access for each query. The process happens automatically, with no extra steps required. By skipping files that do not contain data needed for your query, queries consume less compute resources, retrieve less data from storage, and run faster – especially for highly selective queries.
As an example, consider a simple table Order items table that uses Data Clustering on the primary key column. We can observe how Data Clustering optimizes queries by comparing how it performs on a table that uses Data Clustering versus a query that uses an identical table, but does not use Data Clustering. This comparison can be made by investigating the exec_requests_history view of Query Insights. Using a query such as the following, you can check resource usage for the 'Clustered' and 'Regular' versions of your queries:
SELECT label, row_count, total_elapsed_time_ms, allocated_cpu_time_ms, data_scanned_disk_mb + data_scanned_memory_mb + data_scanned_remote_storage_mb AS total_data_scanned_mb FROM queryinsights.exec_requests_history WHERE label IN ('Clustered','Regular')
In this example, this query produces the following results:
Screenshot_of_a_query_results_table_with_two_rows_labeled_Clustered_and_Regular
Note: both tables in this example (the Clustered and the Regular version) have the exact same data with almost sixty billion rows and approximately 1TB.
From these results, we can observe the following:
Bar_chart_comparing_query_performance_for_a_regular_table_versus_a_Data_Clusteri
The query we used in this example uses predicates that align well with the clustering column, and it is also a highly selective query - which is how it was able to yield the results observed on this chart.
The key here is the total data scanned. Because of how the data is organized in the table that uses Data Clustering, how this particular query aligns its query predicate with the clustering column, and finally, due to the fact that this query is highly selective, the query engine was able to resolve this query by scanning just a fraction of the total data in the table. In this particular case, these factors helped achieve a massive reduction in the time to process the query as well as its cost.
Storage in Fabric Data Warehouse automatically maintains data efficiently in storage to improve performance, reliability, and scalability. We are working to make it even better and more automated every day. Data Clustering is yet another step we are taking to optimizing data layout in disk to align with your workload, and we will continue to invest in innovations to make these optimizations more and more automated, allowing you to focus on your data – not on managing it.
Ready to experience the benefits of Data Clustering for yourself? Check out our in-depth documentation and hands-on tutorial to get started today. Dive into practical examples, explore best practices, and see how Data Clustering can help make your queries run faster and more efficiently.
Your feedback is important to us. As this feature continues to evolve in Preview, we invite you to share your experiences, report any issues, and suggest improvements. Help us make Fabric Data Warehouse even better for everyone!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.