March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hello,
When looking to the optimized size of a partition, everyone mention the size of 1 GB per partition, however i am not sure whether they mean that the sum of all files in a partition is 1 GB or each file in the partition should not exceed the 1 GB. For example Let's consider the following partition of a certain table in my lakehouse:
Another question is about the parquet files themselves. I can see that i have multiple of them being written at the same second for example the ones at 8/27/2024 7:53:03 AM, do these correspond to new data or are these different versions of the same data with extra rows being appended, which is resulting in the slight increase in the file that can be observed sometimes?
Solved! Go to Solution.
Hi @mkj1213 in terms of size, 1GB is optimal per file not per partition. Having a 1GB "limit" for a partition would not make sense as the partition would be constrained by that limit. PLease see here Delta Lake Small File Compaction with OPTIMIZE | Delta Lake and look at the section "Delta Lake target file size"
In terms of Parquet files, they are immutable and cannot be modified once written. Any INSERTS or UPDATES will be handled by generating new Parquet files with the relevant data in, the Delta log is then updated to reflect these new files. DELETEs are handled seperaly in a "deletion vectors" file.
E.G if you start with a new table and insert 3 rows, you'll have 1 Parquet file. If you then INSERT another row into the table, this be be added to the table but actually written as a new Parquet file. The table will now have 4 rows across 2 Parquet files.
Here is some discussion regarding partitioning and also the upcoming liquid clustering feature:
Thanks @frithjof_v for sharing. it seems very promising, as mentionned in the announcment done in DataBricks: Announcing General Availability of Liquid Clustering | Databricks Blog
For me partitioning is doing good whenever i am using pyspark to query the data. My table can be considered as a large data table (currently have around 776 billion rows). The problem that i was trying to solve by trying different partitioning options was more related to querying the table from power bi. In power bi, i have around 1500 partition folders (correspond to daily partitions for 4 years of data). Practially what i am seeing is that power bi is not able to discover what partition it should go to to read when specifying a certain minute of a day (via the filter pane). I had used M dynamic parameters, and as well as the AzureStorage.DataLake with not observed performance enhancment.
I must admit I have never worked with so big amounts of data (I am usually below 10 million rows), and I have never used partitioning in practice.
However, I have read about it so I can try to answer.
I'm trying to understand your scenario.
Are you using Power BI in import mode, and importing data from the Lakehouse to the Power BI model?
Do you want to import only specific time periods of data into your model, or do you want to import all the data into your model (I'm not sure if the latter would be feasible, given the size of your data)?
Or are you using Direct Lake mode?
https://www.sqlbi.com/blog/marco/2024/04/06/direct-lake-vs-import-mode-in-power-bi/
Have you partitioned the Lakehouse table on a column, e.g. a date column?
I think this blog article is interesting, I'm not sure if it is perfectly relevant to your case, but I find it interesting:
https://blog.crossjoin.co.uk/2023/12/17/incremental-refresh-on-delta-tables-in-power-bi/
If you are using import mode, you can also consider connecting to the Lake part of the Lakehouse, or the SQL Analytics Endpoint.
Liquid clustering is going to be an interesting feature.
According to these docs, the recommendation is to have partitions which are 1GB or larger:
"Our recommendation is to use a column that would result in a partition of at least (or close to) 1 GB."
The docs provide a script to check partition sizes, I haven't tried it myself but you could check it out here:
Hi @mkj1213 in terms of size, 1GB is optimal per file not per partition. Having a 1GB "limit" for a partition would not make sense as the partition would be constrained by that limit. PLease see here Delta Lake Small File Compaction with OPTIMIZE | Delta Lake and look at the section "Delta Lake target file size"
In terms of Parquet files, they are immutable and cannot be modified once written. Any INSERTS or UPDATES will be handled by generating new Parquet files with the relevant data in, the Delta log is then updated to reflect these new files. DELETEs are handled seperaly in a "deletion vectors" file.
E.G if you start with a new table and insert 3 rows, you'll have 1 Parquet file. If you then INSERT another row into the table, this be be added to the table but actually written as a new Parquet file. The table will now have 4 rows across 2 Parquet files.
I'm wondering if UPDATE and MERGE will also be using deletion vectors in the future (if not already):
https://docs.delta.io/latest/delta-deletion-vectors.html
https://delta.io/blog/2023-07-05-deletion-vectors/
I guess deletion vectors improve the write performance, but have a negative effect on the read performance.
Thanks for your reply @AndyDDC i did run the optimize command on my delta table and i can see that the a new file has been created that is approximately equals to the combination of the smaller files in the partition folder. As an example:
the first file (file created after optimization) has been created on 9/1 while the other files that were written as an output of the transformation process has been written on August 27. One last question do you recommend using vaccum on the optimized partitions? is that doable in pyspark or should that be run from SQL?
Hi @mkj1213 ,
1. Everyone mention the size of 1 GB per partition, typically mean that the sum of all files within a partition should be around 1 GB. For example, if you had two Parquet files in a partition, one being 500 MB and another being 600 MB, the total size of that partition would be 1.1 GB. This helps in achieving a balance between the number of partitions and the size of each partition, optimizing performance for both read and write operations.
---------------------------------------------------------------------------------------------------------------------------
update:
After reading Andy's response, I think I made a mistake.😢
Thank you once more for your response. @AndyDDC I will continue to expand my knowledge on this subject.
When to partition tables on Azure Databricks - Azure Databricks | Microsoft Learn
Comprehensive Guide to Optimize Data Workloads | Databricks
2. From the picture you provided, it appears that multiple Parquet files were written in a very short period of time (specifically within a second or two). The size of each file varies, hovering around 700 MB.
However, in the Parquet format, once a file is written, it is not usually modified by appending more lines. So I presume these files are new data. (Just speculating, you can check them further by going to notebook)
Best Regards,
Gao
Community Support Team
If there is any post helps, then please consider Accept it as the solution to help the other members find it more quickly.
If I misunderstand your needs or you still have problems on it, please feel free to let us know. Thanks a lot!
How to get your questions answered quickly -- How to provide sample data in the Power BI Forum
Thanks for your reply, i am adding this comment here about your point 2 above, just for future refrence of any one visiting this question. What i did practically observed is that i have found the same records (ID related to specific time) in two separate parquet file that were written at the same moment. This has been observed for some of the records and not all. so i am not sure whether data written in files at the same moment are new data
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
4 | |
1 | |
1 | |
1 | |
1 |
User | Count |
---|---|
7 | |
3 | |
3 | |
3 | |
2 |