Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
smpa01
Super User
Super User

Seeking partition strategy

I am in early phase of ingestion-building and delta table creation. I know that having partition even increases the performance. Therefore, I am seeking a partitioning strategy on the delta tables.

 

Should I do it on the column with highest/lowest cardinality?

 

Highest cardinality - more partitions. But at what cost?

Lowest cardinality - lesser partitions

 

Also, I am ingesting data incrementally to managed delta tables. Hence, how does the code handle the new partitions from new data coming everyday?

 

 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs
1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @smpa01 ,

 

Thanks for the reply from frithjof_v .

 

To verify partitions on a managed delta table, there are the following methods:

 

1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.

vhuijieymsft_0-1723443439953.png

 

2. Check for the existence of partitions by using a SQL query.

 

SELECT DISTINCT emp_id FROM people

 

 

3. Use the Delta Lake API to check for partitions.

 

from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "people")
delta_table.toDF().select("emp_id ").distinct().show()

 

vhuijieymsft_1-1723443439958.png

 

If you have any other questions please feel free to contact me.

 

Best Regards,
Yang
Community Support Team

 

If there is any post helps, then please consider Accept it as the solution  to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

View solution in original post

7 REPLIES 7
frithjof_v
Super User
Super User

Here is some discussion about partitioning, I found it interesting: https://www.reddit.com/r/MicrosoftFabric/s/FF0ZV8JzzI

Anonymous
Not applicable

Hi @smpa01 ,

 

Thanks for the reply from lbendlin .

 

Partitioning by columns with a low cardinality is often recommended.

 

High cardinality : more partitions result in better parallelism but at the cost of increased overhead of managing many small files and the performance degradation of too many small partitions.


Low cardinality : fewer partitions are easier to manage.

 

Here are some tests I've done on partitions that you can refer to. Here's my data:

vhuijieymsft_0-1723170928572.png

 

Use the following code to add a new cell; save the DataFrame and partition the data by Year and Month:

 

orders_df.write.partitionBy("Year", "Month").mode("overwrite").parquet("Files/partitioned_data")
print ("Transformed data saved!")

 

 

Check the Files folder to see if the partition folder was created successfully.

vhuijieymsft_1-1723170928575.png

 

Use the following code to add a new cell to load a new data frame from the orders.parquet file:

 

orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*")
display(orders_2021_df)

 

vhuijieymsft_2-1723170969378.png

 

Note that the partitioned columns specified in the paths (Year and Month) are not included in the DataFrame.

 

When you append new data to a Delta table, Delta Lake automatically creates a new partition based on the specified partition column. If the partition already exists, the data is appended to the existing partition.

 

More information on partitioning can be found in the following documentation:

Adding and Deleting Partitions in Delta Lake tables | Delta Lake

 

If you have any other questions please feel free to contact me.

 

Best Regards,
Yang
Community Support Team

 

If there is any post helps, then please consider Accept it as the solution  to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

  thanks I have managed to create partitons on maged delta tables using both SQL and Delta table API. 

But I see that you are using Dataframe API. Can you use partition by in this API when saveAsTable to Delta to create partitions within table (as SQL / Delta table API would do)

 

df.write.partitionBy('emp_id').mode('append').format('delta').saveAsTable('people')

@Anonymous

 

Also, do you know how can I verify the partitions on managed delta tables? I tried the following it did not work

 

//create table with partition
    spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {table_name} (
        {query_string}
    )
    USING DELTA
    PARTITIONED BY ({partition_definition})
    LOCATION '{table_location}'

//verification
spark.sql("SHOW PARTITIONS StagingLakehouse.tbl").show()

//the above shows following
AnalysisException: Table spark_catalog.StagingLakehouse.tbl does not support partition management.;

 

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs
Anonymous
Not applicable

Hi @smpa01 ,

 

Thanks for the reply from frithjof_v .

 

To verify partitions on a managed delta table, there are the following methods:

 

1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.

vhuijieymsft_0-1723443439953.png

 

2. Check for the existence of partitions by using a SQL query.

 

SELECT DISTINCT emp_id FROM people

 

 

3. Use the Delta Lake API to check for partitions.

 

from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "people")
delta_table.toDF().select("emp_id ").distinct().show()

 

vhuijieymsft_1-1723443439958.png

 

If you have any other questions please feel free to contact me.

 

Best Regards,
Yang
Community Support Team

 

If there is any post helps, then please consider Accept it as the solution  to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

lbendlin
Super User
Super User

This article is not relatable to this question.

Did I answer your question? Mark my post as a solution!
Proud to be a Super User!
My custom visualization projects
Plotting Live Sound: Viz1
Beautiful News:Viz1, Viz2, Viz3
Visual Capitalist: Working Hrs

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Best practices — Delta Lake Documentation  

 

Choose the right partition column

You can partition a Delta table by a column. The most commonly used partition column is date. Follow these two rules of thumb for deciding on what column to partition by:

  • If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.

  • Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

     

Helpful resources

Announcements
July 2025 community update carousel

Fabric Community Update - July 2025

Find out what's new and trending in the Fabric community.

Join our Fabric User Panel

Join our Fabric User Panel

This is your chance to engage directly with the engineering team behind Fabric and Power BI. Share your experiences and shape the future.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.