Solved: DataflowGen2 on MS Fabric: Optimize column profili...

SofiaGinger · ‎06-04-2024

Hello,

I have datasets residing on hive_metastore of Azure Databricks storage, in the range of 1-5 millions of records. I wish to use the Column Profiling feature on the entire dataset after loading the data using Get Data experience on a DataflowGen2 object .

When I am changing from "Column profiling based on top 1000 rows" to "Column profiling based on entire dataset" , the processing takes forever when applied on the 1 million dataset, and happens instataneously when applied on the 1000 rows one.

What steps shall I take to optimize the performance here, to perform column profiling on entire data set?

miguel · ‎06-04-2024

This is something that we don't have full control over. It relies on 3 specific components:

Data source: and the resources available for it
Connector: owned by the Databricks team
Power Query editor: it effectively runs the queries that the connector tells it to run and it tries to later cache the results if they're fully computed

I can pass your feedback to the Databricks team, but there's nothing beyond what you're doing today that can impact the performance of the profiling for such scenario.

View solution in original post

miguel · ‎06-04-2024

This is something that we don't have full control over. It relies on 3 specific components:

Data source: and the resources available for it
Connector: owned by the Databricks team
Power Query editor: it effectively runs the queries that the connector tells it to run and it tries to later cache the results if they're fully computed

I can pass your feedback to the Databricks team, but there's nothing beyond what you're doing today that can impact the performance of the profiling for such scenario.

DataflowGen2 on MS Fabric: Optimize column profiling on millions of rows (1-5mil)

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

DataflowGen2 on MS Fabric: Optimize column profiling on millions of rows (1-5mil)

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025