Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
aa_tsl
Regular Visitor

Error/warnings during Delta table write in Spark Fabric Notebooks

Hello! So I'm having trouble with ShuffleMapTask. “Pipe has no content,” warnings that I believe could be slowing down Delta Table write operation in Fabric.

First, I read a CSV and do some data transformations and cleaning:

 

 

billings_schema = StructType([
# ...
])

billings_raw_df = spark.read \
.option("header", "true") \
.schema(billings_schema) \
.csv(f"{datalake_url}/raw/{latest_file_name}")

billings_df = billings_raw_df.select(
col("Transaction ID").alias("transaction_id"),
# ...
col("Pipeline Watermark").alias("pipeline_watermark")
).distinct()

 

 

 

The write succeeds, but Spark logs warnings that look like this:

 

 

ShuffleMapTask
java.io.IOException
ExceptionFailure(java.io.IOException,
Pipe has no content; awaitReadable() returned false
/mnt/vegas/pipes/####.pipe, pos=#, blocksRead=#; bytesRead=#; availInPipe=#]
Vegas Service: … Abandoned pipe - unread by the time it was closed

 

 

then write that data frame into a delta table (that will live in fabric's lake house:

 

 

try:
(billings_df
.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(billings_table_name)
)
except Exception as e:
raise e

 

 

 

Although the operation gets completed with warnings, it takes more than a minute (for a relatively small CSV with less than 2000 rows), and I'm wondering if the warning could be indicating why the operation takes so long.

What I've tried so far:
* adding these settings:

 

 

spark.conf.set("spark.sql.shuffle.partitions", 200)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

 

 

* using partitionBy:

 

 

(billings_df
    .write
    .partitionBy("transaction_date")
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(billings_table_name)
)

 

 

But I didn't have any luck so far. Did you guys ever experience that error? Or even experienced any slowness during delta table write?

Obs: the fabric capacity SKU is F2.

1 ACCEPTED SOLUTION
Anonymous
Not applicable

Hi @aa_tsl,

Thank you for reaching out in Microsoft Community Forum.

The long execution time is likely due to Spark job overhead in Fabric's F2 SKU, especially when using .collect() on small datasets. Even simple operations can feel slow because of cluster orchestration time, not computation itself.

Please follow below steps to resolve the issue;

1.Replace .collect() with .toPandas() for small datasets to reduce overhead.

2.Keep transformations within Spark as long as possible (avoid switching to Python lists too early).

3.For <2000 rows, consider using Pandas instead of Spark — it’s faster for small data.

4.Fabric's F2 SKU has higher latency for small jobs; if possible, test on F4 for better responsiveness.

Please continue using Microsoft Community Forum.

If this post helpes in  resolve your issue, kindly consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.

Regards,
Pavan.

View solution in original post

7 REPLIES 7
Anonymous
Not applicable

Hi @aa_tsl,

I wanted to follow up since we haven't heard back from you regarding our last response. We hope your issue has been resolved.
If the community member's answer your query, please mark it as "Accept as Solution" and select "Yes" if it was helpful.
If you need any further assistance, feel free to reach out.

Thank you,
Pavan.

Anonymous
Not applicable

Hi @aa_tsl,

I hope this information is helpful. Please let me know if you have any further questions or if you'd like to discuss this further. If this answers your question, kindly "Accept  as  Solution" and give it a 'Kudos' so others can find it easily.

Thank you,
Pavan.

Anonymous
Not applicable

Hi @aa_tsl,

I wanted to check if you had the opportunity to review the information provided. Please feel free to contact us if you have any further questions. If my response has addressed your query, please "Accept  as  Solution" and give a 'Kudos' so other members can easily find it.

Thank you,
Pavan.

Srisakthi
Super User
Super User

Hi @aa_tsl ,

 

Please check the following Spark Configurations

1. Ensure to use latest runtime (1.3 Spark 3.5 and Delta 3.2) 

2. Enable Native Execution Engine

3. Check your spark resource profile configs

Ref - https://learn.microsoft.com/en-us/fabric/data-engineering/configure-resource-profile-configurations

 

Based on the observation , if you write to warehouse it is comparatively slower than writing to Lakehouse. Lakehouse is faster in loading the data using spark with config profiles.

 

Regards,

Srisakthi

aa_tsl
Regular Visitor

Hi @Anonymous , thanks for the reply.

 

I tried removing the .distinct() from the code, but it didn't have much effect 😕
It takes a lot of time for simple operations like this:

vendors_df = agents_df.filter(col("parent_id") == vendor_parent_id).withColumn(
    "role", lit("vendor")
)
vendors_ids = [
    row.agent_id for row in vendors_df.select("agent_id").collect()
]

advancing_vendors_df = agents_df.filter(
    col("parent_id") == advancing_vendor_parent_id
).withColumn("role", lit("advancing_vendor"))

advancing_vendors_ids = [
    row.agent_id for row in advancing_vendors_df.select("agent_id").collect()
]

It took 2 minutes to run for a <2k dataset in fabric. When I run it locally (docker container simulating a spark environment it takes seconds, much faster).

 

Would you recommend using Pandas for data manipulation instead? Or do you have other tips on how to make Spark more performant on Fabric?

Best,

Alex

Anonymous
Not applicable

Hi @aa_tsl,

Thank you for reaching out in Microsoft Community Forum.

The long execution time is likely due to Spark job overhead in Fabric's F2 SKU, especially when using .collect() on small datasets. Even simple operations can feel slow because of cluster orchestration time, not computation itself.

Please follow below steps to resolve the issue;

1.Replace .collect() with .toPandas() for small datasets to reduce overhead.

2.Keep transformations within Spark as long as possible (avoid switching to Python lists too early).

3.For <2000 rows, consider using Pandas instead of Spark — it’s faster for small data.

4.Fabric's F2 SKU has higher latency for small jobs; if possible, test on F4 for better responsiveness.

Please continue using Microsoft Community Forum.

If this post helpes in  resolve your issue, kindly consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.

Regards,
Pavan.

Anonymous
Not applicable

Hi @aa_tsl,

Thank you for reaching out in Microsoft Community Forum.

Please follow below steps to resolve the issue;

1.Avoid using .distinct() on small datasets as it triggers expensive shuffling. Use .dropDuplicates() if deduplication is needed.

2.Do not use partitionBy for small datasets; let Spark handle partitioning automatically to reduce overhead.

3.Set spark.sql.shuffle.partitions to a lower value (e.g., 😎 and use spark.sql.files.maxPartitionBytes to control partition sizes.

4.Ensure sufficient resources (memory, CPU) in the Fabric F2 SKU cluster and set the log level to DEBUG for further insights into the issue.

Please continue using Microsoft community forum.

If you found this post helpful, please consider marking it as "Accept as Solution" and give it a 'Kudos'. if it was helpful. help other members find it more easily.

Regards,
Pavan.

Helpful resources

Announcements
July 2025 community update carousel

Fabric Community Update - July 2025

Find out what's new and trending in the Fabric community.

June FBC25 Carousel

Fabric Monthly Update - June 2025

Check out the June 2025 Fabric update to learn about new features.