topic Re: Seeking partition strategy in Data Engineering

Seeking partition strategy

smpa01 — Thu, 08 Aug 2024 19:40:33 GMT

I am in early phase of ingestion-building and delta table creation. I know that having partition even increases the performance. Therefore, I am seeking a partitioning strategy on the delta tables.

Should I do it on the column with highest/lowest cardinality?

Highest cardinality - more partitions. But at what cost?

Lowest cardinality - lesser partitions

Also, I am ingesting data incrementally to managed delta tables. Hence, how does the code handle the new partitions from new data coming everyday?

Re: Seeking partition strategy

lbendlin — Thu, 08 Aug 2024 20:37:48 GMT

I found this article quite helpful

https://www.linkedin.com/pulse/50-shades-direct-lake-everything-you-need-know-new-power-nikola-ilic-we2if/

Re: Seeking partition strategy

smpa01 — Thu, 08 Aug 2024 21:50:03 GMT

This article is not relatable to this question.

Re: Seeking partition strategy

lbendlin — Thu, 08 Aug 2024 21:53:46 GMT

Delta Lake table optimization and V-Order - Microsoft Fabric | Microsoft Learn

Best practices — Delta Lake Documentation

Choose the right partition column

You can partition a Delta table by a column. The most commonly used partition column is date. Follow these two rules of thumb for deciding on what column to partition by:

If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column userId and if there can be 1M distinct user IDs, then that is a bad partitioning strategy.
Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.

Re: Seeking partition strategy

Anonymous — Fri, 09 Aug 2024 02:43:23 GMT

Hi @smpa01 ,

Thanks for the reply from lbendlin .

Partitioning by columns with a low cardinality is often recommended.

High cardinality : more partitions result in better parallelism but at the cost of increased overhead of managing many small files and the performance degradation of too many small partitions.

Low cardinality : fewer partitions are easier to manage.

Here are some tests I've done on partitions that you can refer to. Here's my data:

Use the following code to add a new cell; save the DataFrame and partition the data by Year and Month:

orders_df.write.partitionBy("Year", "Month").mode("overwrite").parquet("Files/partitioned_data") print ("Transformed data saved!")

Check the Files folder to see if the partition folder was created successfully.

Use the following code to add a new cell to load a new data frame from the orders.parquet file:

orders_2021_df = spark.read.format("parquet").load("Files/partitioned_data/Year=2021/Month=*") display(orders_2021_df)

Note that the partitioned columns specified in the paths (Year and Month) are not included in the DataFrame.

When you append new data to a Delta table, Delta Lake automatically creates a new partition based on the specified partition column. If the partition already exists, the data is appended to the existing partition.

More information on partitioning can be found in the following documentation:

Adding and Deleting Partitions in Delta Lake tables | Delta Lake

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team

If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!

Re: Seeking partition strategy

smpa01 — Sat, 10 Aug 2024 00:40:29 GMT

thanks I have managed to create partitons on maged delta tables using both SQL and Delta table API.

But I see that you are using Dataframe API. Can you use partition by in this API when saveAsTable to Delta to create partitions within table (as SQL / Delta table API would do)

df.write.partitionBy('emp_id').mode('append').format('delta').saveAsTable('people')

@Anonymous

Also, do you know how can I verify the partitions on managed delta tables? I tried the following it did not work

//create table with partition spark.sql(f""" CREATE TABLE IF NOT EXISTS {table_name} ( {query_string} ) USING DELTA PARTITIONED BY ({partition_definition}) LOCATION '{table_location}' //verification spark.sql("SHOW PARTITIONS StagingLakehouse.tbl").show() //the above shows following AnalysisException: Table spark_catalog.StagingLakehouse.tbl does not support partition management.;

Re: Seeking partition strategy

frithjof_v — Sat, 10 Aug 2024 07:31:03 GMT

Here is some discussion about partitioning, I found it interesting: https://www.reddit.com/r/MicrosoftFabric/s/FF0ZV8JzzI

Re: Seeking partition strategy

Anonymous — Mon, 12 Aug 2024 06:19:30 GMT

Hi @smpa01 ,

Thanks for the reply from frithjof_v .

To verify partitions on a managed delta table, there are the following methods:

1. Delta Lake stores partitioned data in a nested catalog structure. You can navigate to where the table is stored and examine the catalog structure to view the partitions.

2. Check for the existence of partitions by using a SQL query.

SELECT DISTINCT emp_id FROM people

3. Use the Delta Lake API to check for partitions.

from delta.tables import DeltaTable delta_table = DeltaTable.forName(spark, "people") delta_table.toDF().select("emp_id ").distinct().show()

If you have any other questions please feel free to contact me.

Best Regards,
Yang
Community Support Team