Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

The Power BI Data Visualization World Championships is back! Get ahead of the game and start preparing now! Learn more

Reply
Jeanxyz
Power Participant
Power Participant

partitioning concept

I'm studying Apache Spark concepts, the concept of partitioning got my interest. I would like to check if my understanding below is correct:

The partitioning concept only works by writing the data into files, using pyspark code below:

df.write \
  .mode("overwrite") \
  .partitionBy("department") \
  .parquet(output_path)

 

However, if I write data into tables, I am not allowed to partition the data. The code below would throw an error:

***

df.write.mode("overwrite").format("delta").partitionBy("department").saveAsTable("data_withoutpart")

***

So if my understanding is correct,  partitioning is not useful in my case at all my data needs to be write into delta tables in the end for use by Power BI reports. 

 

Please correct me if I got it wrong or refer me to relevant tutorials.

 

 
 
1 ACCEPTED SOLUTION
Ugk161610
Continued Contributor
Continued Contributor

Hi @Jeanxyz ,

 

Yes, you’re absolutely right. If your Delta table is partitioned by department but your merge condition is based on employee_id, then the partitioning doesn’t really help during the merge. Spark will still need to scan all partitions to find the matching employee_id values, because the partition column isn’t part of the predicate.

 

Partitioning only speeds things up when your filter, join, or merge condition includes the partition column itself. For example, a merge or query that filters on department = 'HR' would benefit immediately, but a merge on employee_id won’t take advantage of department partitions.

 

So the general rule is:
Pick a partition column only if your downstream workloads actually filter or process data using that column. Otherwise, partitioning doesn’t provide much benefit.

 

Hope this helps — and glad the earlier correction was useful. If this answers your question, please mark it as Accepted Solution ✔️

 

– Gopi Krishna

View solution in original post

6 REPLIES 6
Jeanxyz
Power Participant
Power Participant

You are right. I made a syntax mistake when writing partitioned tables. below is the correct one:

df2.write.mode("overwrite").format("delta").saveAsTable("data_withoutpart2", partitionBy="department")
Ugk161610
Continued Contributor
Continued Contributor

Hi @Jeanxyz ,

Your understanding is close, but there’s one important detail missing: you can partition Delta tables — it just depends on where and how the table is created. The error you’re seeing doesn’t mean Spark doesn’t support partitioning; it usually means the target table already exists without partitions, or Fabric created the table with a different structure than what your write command expects.

 

If the table already exists as a non-partitioned Delta table, Spark won’t let you overwrite it with a partitioned version. In that case, you’d have to drop the table first and then recreate it with partitionBy(). But writing a partitioned Delta table itself is completely supported.

Also, even for Power BI scenarios, partitioning can help — especially when your dataset grows and you want faster refresh or more efficient incremental loads. Power BI doesn’t care whether the underlying data is partitioned; it only reads the Delta table through the SQL endpoint or Spark.

 

So in short:
Partitioning is definitely not useless for Delta tables. The issue comes from trying to overwrite an existing table with a different partition structure. If you create the Delta table fresh with partitionBy(), it will work normally.

 

Hope this helps. If so, please give a Kudos 👍 or mark as Accepted Solution ✔️

 

– Gopi Krishna

You are right. I made a syntax mistake when writing partitioned tables. below is the correct one:

df2.write.mode("overwrite").format("delta").saveAsTable("data_withoutpart2"partitionBy="department")
Ugk161610
Continued Contributor
Continued Contributor

Hi @Jeanxyz ,

Your understanding is close, but there’s an important detail to clarify. Partitioning does work when writing Delta tables — it doesn’t only apply to writing files. The issue you hit is just the syntax.

 

In Spark, the partitionBy() method is always part of the writer, not an argument inside saveAsTable. So your corrected syntax is the right one. When you write:

 

df.write \
.mode("overwrite") \
.format("delta") \
.partitionBy("department") \
.saveAsTable("data_with_partition")

 

Spark will create a fully valid partitioned Delta table, and this works perfectly fine with Power BI, Fabric Lakehouse, and SQL endpoint queries.

 

So partitioning is still useful even if your final destination is a Delta table. As long as your table has a natural grouping column with reasonable cardinality (like department, date, region, category, etc.), partitioning can improve performance for large datasets.

 

Your updated example using:

 

saveAsTable("data_withoutpart2", partitionBy="department")

is also valid — it’s just an alternative syntax that Spark supports.

 

So yes, your data can be written as Delta and still be partitioned without any problem. You weren’t running into a conceptual limitation, just a small syntax detail.

 

Hope this helps. If it does, please give a Kudos 👍 or mark as Accepted Solution ✔️

 

– Gopi Krishna

Thanks for clarification, @Ugk161610 .

One last question: if my delta table is partitioned by 'department' but I'm running a merge query by 'employee_id', the partition won't help in the case. Is that correct?

 

Ugk161610
Continued Contributor
Continued Contributor

Hi @Jeanxyz ,

 

Yes, you’re absolutely right. If your Delta table is partitioned by department but your merge condition is based on employee_id, then the partitioning doesn’t really help during the merge. Spark will still need to scan all partitions to find the matching employee_id values, because the partition column isn’t part of the predicate.

 

Partitioning only speeds things up when your filter, join, or merge condition includes the partition column itself. For example, a merge or query that filters on department = 'HR' would benefit immediately, but a merge on employee_id won’t take advantage of department partitions.

 

So the general rule is:
Pick a partition column only if your downstream workloads actually filter or process data using that column. Otherwise, partitioning doesn’t provide much benefit.

 

Hope this helps — and glad the earlier correction was useful. If this answers your question, please mark it as Accepted Solution ✔️

 

– Gopi Krishna

Helpful resources

Announcements
December Fabric Update Carousel

Fabric Monthly Update - December 2025

Check out the December 2025 Fabric Holiday Recap!

FabCon Atlanta 2026 carousel

FabCon Atlanta 2026

Join us at FabCon Atlanta, March 16-20, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM.