Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
I am in early phase of ingestion-building and delta table creation. I know that having partition even increases the performance. Therefore, I am seeking a partitioning strategy on the delta tables.
Should I do it on the column with highest/lowest cardinality?
Highest cardinality - more partitions. But at what cost?
Lowest cardinality - lesser partitions
Also, I am ingesting data incrementally to managed delta tables. Hence, how does the code handle the new partitions from new data coming everyday?
Solved! Go to Solution.
Hi @smpa01 ,
Thanks for the reply from frithjof_v .
To verify partitions on a managed delta table, there are the following methods:
1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.
2. Check for the existence of partitions by using a SQL query.
SELECT DISTINCT emp_id FROM people
3. Use the Delta Lake API to check for partitions.
from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "people")
delta_table.toDF().select("emp_id ").distinct().show()
If you have any other questions please feel free to contact me.
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
Here is some discussion about partitioning, I found it interesting: https://www.reddit.com/r/MicrosoftFabric/s/FF0ZV8JzzI
Hi @smpa01 ,
Thanks for the reply from lbendlin .
Partitioning by columns with a low cardinality is often recommended.
High cardinality : more partitions result in better parallelism but at the cost of increased overhead of managing many small files and the performance degradation of too many small partitions.
Low cardinality : fewer partitions are easier to manage.
Here are some tests I've done on partitions that you can refer to. Here's my data:
Use the following code to add a new cell; save the DataFrame and partition the data by Year and Month:
orders_df.write.partitionBy("Year", "Month").mode("overwrite").parquet("Files/partitioned_data")
print ("Transformed data saved!")
Check the Files folder to see if the partition folder was created successfully.
Use the following code to add a new cell to load a new data frame from the orders.parquet file:
orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*")
display(orders_2021_df)
Note that the partitioned columns specified in the paths (Year and Month) are not included in the DataFrame.
When you append new data to a Delta table, Delta Lake automatically creates a new partition based on the specified partition column. If the partition already exists, the data is appended to the existing partition.
More information on partitioning can be found in the following documentation:
Adding and Deleting Partitions in Delta Lake tables | Delta Lake
If you have any other questions please feel free to contact me.
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
thanks I have managed to create partitons on maged delta tables using both SQL and Delta table API.
But I see that you are using Dataframe API. Can you use partition by in this API when saveAsTable to Delta to create partitions within table (as SQL / Delta table API would do)
@Anonymous
Also, do you know how can I verify the partitions on managed delta tables? I tried the following it did not work
//create table with partition
spark.sql(f"""
CREATE TABLE IF NOT EXISTS {table_name} (
{query_string}
)
USING DELTA
PARTITIONED BY ({partition_definition})
LOCATION '{table_location}'
//verification
spark.sql("SHOW PARTITIONS StagingLakehouse.tbl").show()
//the above shows following
AnalysisException: Table spark_catalog.StagingLakehouse.tbl does not support partition management.;
Hi @smpa01 ,
Thanks for the reply from frithjof_v .
To verify partitions on a managed delta table, there are the following methods:
1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.
2. Check for the existence of partitions by using a SQL query.
SELECT DISTINCT emp_id FROM people
3. Use the Delta Lake API to check for partitions.
from delta.tables import DeltaTable
delta_table = DeltaTable.forName(spark, "people")
delta_table.toDF().select("emp_id ").distinct().show()
If you have any other questions please feel free to contact me.
Best Regards,
Yang
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
I found this article quite helpful
This article is not relatable to this question.
Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn
Best practices — Delta Lake Documentation
You can partition a Delta table by a column. The most commonly used partition column is date. Follow these two rules of thumb for deciding on what column to partition by:
If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.
This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.
User | Count |
---|---|
5 | |
4 | |
2 | |
2 | |
2 |