Solved: Re: Error/warnings during Delta table write in Spa...

aa_tsl · ‎05-01-2025

Hello! So I'm having trouble with ShuffleMapTask. “Pipe has no content,” warnings that I believe could be slowing down Delta Table write operation in Fabric.

First, I read a CSV and do some data transformations and cleaning:

billings_schema = StructType([
# ...
])

billings_raw_df = spark.read \
.option("header", "true") \
.schema(billings_schema) \
.csv(f"{datalake_url}/raw/{latest_file_name}")

billings_df = billings_raw_df.select(
col("Transaction ID").alias("transaction_id"),
# ...
col("Pipeline Watermark").alias("pipeline_watermark")
).distinct()

The write succeeds, but Spark logs warnings that look like this:

ShuffleMapTask
java.io.IOException
ExceptionFailure(java.io.IOException,
Pipe has no content; awaitReadable() returned false
/mnt/vegas/pipes/####.pipe, pos=#, blocksRead=#; bytesRead=#; availInPipe=#]
Vegas Service: … Abandoned pipe - unread by the time it was closed

then write that data frame into a delta table (that will live in fabric's lake house:

try:
(billings_df
.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(billings_table_name)
)
except Exception as e:
raise e

Although the operation gets completed with warnings, it takes more than a minute (for a relatively small CSV with less than 2000 rows), and I'm wondering if the warning could be indicating why the operation takes so long.

What I've tried so far:
* adding these settings:

spark.conf.set("spark.sql.shuffle.partitions", 200)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")

* using partitionBy:

(billings_df
    .write
    .partitionBy("transaction_date")
    .format("delta")
    .mode("overwrite")
    .option("overwriteSchema", "true")
    .saveAsTable(billings_table_name)
)

But I didn't have any luck so far. Did you guys ever experience that error? Or even experienced any slowness during delta table write?

Obs: the fabric capacity SKU is F2.

Anonymous · ‎05-08-2025

Hi @aa_tsl,

Thank you for reaching out in Microsoft Community Forum.

The long execution time is likely due to Spark job overhead in Fabric's F2 SKU, especially when using .collect() on small datasets. Even simple operations can feel slow because of cluster orchestration time, not computation itself.

Please follow below steps to resolve the issue;

1.Replace .collect() with .toPandas() for small datasets to reduce overhead.

2.Keep transformations within Spark as long as possible (avoid switching to Python lists too early).

3.For <2000 rows, consider using Pandas instead of Spark — it’s faster for small data.

4.Fabric's F2 SKU has higher latency for small jobs; if possible, test on F4 for better responsiveness.

Please continue using Microsoft Community Forum.

If this post helpes in resolve your issue, kindly consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.

Regards,
Pavan.

View solution in original post

visudevaram · ‎08-28-2025

Hi, I'm facing the same issue while reading a csv file containing 10K rows. It takes more than a minute to read the file and the logs contain following error (1st attempt fails after 55 seconds with below error and the 2nd attempt succeeds in less than a second). Strangely when I change the extension of the file to .txt and read it as a text file, it doesn't give below error and executes in 2 seconds. Also we are using Large sized nodes and the size of the file is 500KB, so I doubt if the executor size is an issue. Any other potential rootcauses here?

2025-08-27 17:59:45,913 ERROR TaskResources [Executor task launch worker for task 0.0 in stage 1.0 (TID 1)]: Task 1 failed by error: 
java.io.IOException: Pipe has no content; awaitReadable() returned false [/mnt/vegas/pipes/75cd6630-ad17-4ad7-83d9-b9decdbaa50b.pipe, pos=0,blocksRead=0; bytesRead=0; availInPipe=0]
Vegas Service: Context=f8114d46-8034-4bcc-b04e-23773df9339f, Abandoned pipe - unread by the time it was closed
	at org.apache.hadoop.shaded.com.microsoft.vegas.common.VegasPipeInputStream.pipelineThread(VegasPipeInputStream.java:536)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Anonymous · ‎05-21-2025

Hi @aa_tsl,

I wanted to follow up since we haven't heard back from you regarding our last response. We hope your issue has been resolved.
If the community member's answer your query, please mark it as "Accept as Solution" and select "Yes" if it was helpful.
If you need any further assistance, feel free to reach out.

Thank you,
Pavan.

Anonymous · ‎05-16-2025

Hi @aa_tsl,

I hope this information is helpful. Please let me know if you have any further questions or if you'd like to discuss this further. If this answers your question, kindly "Accept as Solution" and give it a 'Kudos' so others can find it easily.

Thank you,
Pavan.

Anonymous · ‎05-12-2025

Hi @aa_tsl,

I wanted to check if you had the opportunity to review the information provided. Please feel free to contact us if you have any further questions. If my response has addressed your query, please "Accept as Solution" and give a 'Kudos' so other members can easily find it.

Thank you,
Pavan.

Srisakthi · ‎05-08-2025

Hi @aa_tsl ,

Please check the following Spark Configurations

1. Ensure to use latest runtime (1.3 Spark 3.5 and Delta 3.2)

2. Enable Native Execution Engine

3. Check your spark resource profile configs

Ref - https://learn.microsoft.com/en-us/fabric/data-engineering/configure-resource-profile-configurations

Based on the observation , if you write to warehouse it is comparatively slower than writing to Lakehouse. Lakehouse is faster in loading the data using spark with config profiles.

Regards,

Srisakthi

aa_tsl · ‎05-07-2025

Hi @Anonymous , thanks for the reply.

I tried removing the .distinct() from the code, but it didn't have much effect 😕
It takes a lot of time for simple operations like this:

vendors_df = agents_df.filter(col("parent_id") == vendor_parent_id).withColumn(
    "role", lit("vendor")
)
vendors_ids = [
    row.agent_id for row in vendors_df.select("agent_id").collect()
]

advancing_vendors_df = agents_df.filter(
    col("parent_id") == advancing_vendor_parent_id
).withColumn("role", lit("advancing_vendor"))

advancing_vendors_ids = [
    row.agent_id for row in advancing_vendors_df.select("agent_id").collect()
]

It took 2 minutes to run for a <2k dataset in fabric. When I run it locally (docker container simulating a spark environment it takes seconds, much faster).

Would you recommend using Pandas for data manipulation instead? Or do you have other tips on how to make Spark more performant on Fabric?

Best,

Alex

Anonymous · ‎05-08-2025

Hi @aa_tsl,

Thank you for reaching out in Microsoft Community Forum.

The long execution time is likely due to Spark job overhead in Fabric's F2 SKU, especially when using .collect() on small datasets. Even simple operations can feel slow because of cluster orchestration time, not computation itself.

Please follow below steps to resolve the issue;

1.Replace .collect() with .toPandas() for small datasets to reduce overhead.

2.Keep transformations within Spark as long as possible (avoid switching to Python lists too early).

3.For <2000 rows, consider using Pandas instead of Spark — it’s faster for small data.

4.Fabric's F2 SKU has higher latency for small jobs; if possible, test on F4 for better responsiveness.

Please continue using Microsoft Community Forum.

If this post helpes in resolve your issue, kindly consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.

Regards,
Pavan.

Anonymous · ‎05-05-2025

Hi @aa_tsl,

Thank you for reaching out in Microsoft Community Forum.

Please follow below steps to resolve the issue;

1.Avoid using .distinct() on small datasets as it triggers expensive shuffling. Use .dropDuplicates() if deduplication is needed.

2.Do not use partitionBy for small datasets; let Spark handle partitioning automatically to reduce overhead.

3.Set spark.sql.shuffle.partitions to a lower value (e.g., 😎 and use spark.sql.files.maxPartitionBytes to control partition sizes.

4.Ensure sufficient resources (memory, CPU) in the Fabric F2 SKU cluster and set the log level to DEBUG for further insights into the issue.

Please continue using Microsoft community forum.

If you found this post helpful, please consider marking it as "Accept as Solution" and give it a 'Kudos'. if it was helpful. help other members find it more easily.

Regards,
Pavan.

Error/warnings during Delta table write in Spark Fabric Notebooks

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Error/warnings during Delta table write in Spark Fabric Notebooks

Helpful resources

Fabric Monthly Update - August 2025

Fabric Community Update - August 2025