Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedJoin us at the 2025 Microsoft Fabric Community Conference. March 31 - April 2, Las Vegas, Nevada. Use code FABINSIDER for $400 discount. Register now
Hello,
I have datasets residing on hive_metastore of Azure Databricks storage, in the range of 1-5 millions of records. I wish to use the Column Profiling feature on the entire dataset after loading the data using Get Data experience on a DataflowGen2 object .
When I am changing from "Column profiling based on top 1000 rows" to "Column profiling based on entire dataset" , the processing takes forever when applied on the 1 million dataset, and happens instataneously when applied on the 1000 rows one.
What steps shall I take to optimize the performance here, to perform column profiling on entire data set?
Solved! Go to Solution.
This is something that we don't have full control over. It relies on 3 specific components:
I can pass your feedback to the Databricks team, but there's nothing beyond what you're doing today that can impact the performance of the profiling for such scenario.
This is something that we don't have full control over. It relies on 3 specific components:
I can pass your feedback to the Databricks team, but there's nothing beyond what you're doing today that can impact the performance of the profiling for such scenario.
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Check out the February 2025 Fabric update to learn about new features.
User | Count |
---|---|
6 | |
5 | |
2 | |
2 | |
2 |