Solved: Re: Seeking partition strategy

smpa01 · ‎08-08-2024

I am in early phase of ingestion-building and delta table creation. I know that having partition even increases the performance. Therefore, I am seeking a partitioning strategy on the delta tables.

Should I do it on the column with highest/lowest cardinality?

Highest cardinality - more partitions. But at what cost?

Lowest cardinality - lesser partitions

Also, I am ingesting data incrementally to managed delta tables. Hence, how does the code handle the new partitions from new data coming everyday?

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

Anonymous · ‎08-11-2024

Hi @smpa01 ,

Thanks for the reply from frithjof_v .

To verify partitions on a managed delta table, there are the following methods:

1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.

2. Check for the existence of partitions by using a SQL query.

SELECT DISTINCT emp_id FROM people

3. Use the Delta Lake API to check for partitions.

from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "people")
delta_table.toDF().select("emp_id ").distinct().show()

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

View solution in original post

frithjof_v · ‎08-10-2024

Here is some discussion about partitioning, I found it interesting: https://www.reddit.com/r/MicrosoftFabric/s/FF0ZV8JzzI

Anonymous · ‎08-08-2024

Hi @smpa01 ,

Thanks for the reply from lbendlin .

Partitioning by columns with a low cardinality is often recommended.

High cardinality : more partitions result in better parallelism but at the cost of increased overhead of managing many small files and the performance degradation of too many small partitions.

Low cardinality : fewer partitions are easier to manage.

Here are some tests I've done on partitions that you can refer to. Here's my data:

Use the following code to add a new cell; save the DataFrame and partition the data by Year and Month:

orders_df.write.partitionBy("Year", "Month").mode("overwrite").parquet("Files/partitioned_data")
print ("Transformed data saved!")

Check the Files folder to see if the partition folder was created successfully.

Use the following code to add a new cell to load a new data frame from the orders.parquet file:

orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*")
display(orders_2021_df)

Note that the partitioned columns specified in the paths (Year and Month) are not included in the DataFrame.

When you append new data to a Delta table, Delta Lake automatically creates a new partition based on the specified partition column. If the partition already exists, the data is appended to the existing partition.

More information on partitioning can be found in the following documentation:

Adding and Deleting Partitions in Delta Lake tables | Delta Lake

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

smpa01 · ‎08-09-2024

thanks I have managed to create partitons on maged delta tables using both SQL and Delta table API.

But I see that you are using Dataframe API. Can you use partition by in this API when saveAsTable to Delta to create partitions within table (as SQL / Delta table API would do)

df.write.partitionBy('emp_id').mode('append').format('delta').saveAsTable('people')

@Anonymous

Also, do you know how can I verify the partitions on managed delta tables? I tried the following it did not work

//create table with partition
    spark.sql(f"""
    CREATE TABLE IF NOT EXISTS {table_name} (
        {query_string}
    )
    USING DELTA
    PARTITIONED BY ({partition_definition})
    LOCATION '{table_location}'

//verification
spark.sql("SHOW PARTITIONS StagingLakehouse.tbl").show()

//the above shows following
AnalysisException: Table spark_catalog.StagingLakehouse.tbl does not support partition management.;

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

Anonymous · ‎08-11-2024

Hi @smpa01 ,

Thanks for the reply from frithjof_v .

To verify partitions on a managed delta table, there are the following methods:

1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.

2. Check for the existence of partitions by using a SQL query.

SELECT DISTINCT emp_id FROM people

3. Use the Delta Lake API to check for partitions.

from delta.tables import DeltaTable

delta_table = DeltaTable.forName(spark, "people")
delta_table.toDF().select("emp_id ").distinct().show()

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

lbendlin · ‎08-08-2024

I found this article quite helpful

https://www.linkedin.com/pulse/50-shades-direct-lake-everything-you-need-know-new-power-nikola-ilic-...

smpa01 · ‎08-08-2024

This article is not relatable to this question.

Did I answer your question? Mark my post as a solution!

Proud to be a Super User!

My custom visualization projects

Plotting Live Sound: Viz1

Beautiful News:Viz1, Viz2, Viz3

Visual Capitalist: Working Hrs

Others:Easing Graph, Animated Calendar

lbendlin · ‎08-08-2024

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Best practices — Delta Lake Documentation

Choose the right partition column

You can partition a Delta table by a column. The most commonly used partition column is date. Follow these two rules of thumb for deciding on what column to partition by:

If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

Seeking partition strategy

Choose the right partition column

Helpful resources

Fabric Community Update - July 2025

Join our Fabric User Panel

Fabric Monthly Update - June 2025

Party with Power BI’s own Guy in a Cube

Seeking partition strategy

Choose the right partition column

Helpful resources

Fabric Community Update - July 2025

Join our Fabric User Panel

Fabric Monthly Update - June 2025