Power BI is turning 10! Tune in for a special live episode on July 24 with behind-the-scenes stories, product evolution highlights, and a sneak peek at what’s in store for the future.
Save the dateEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Hello! So I'm having trouble with ShuffleMapTask. “Pipe has no content,” warnings that I believe could be slowing down Delta Table write operation in Fabric.
First, I read a CSV and do some data transformations and cleaning:
billings_schema = StructType([
# ...
])
billings_raw_df = spark.read \
.option("header", "true") \
.schema(billings_schema) \
.csv(f"{datalake_url}/raw/{latest_file_name}")
billings_df = billings_raw_df.select(
col("Transaction ID").alias("transaction_id"),
# ...
col("Pipeline Watermark").alias("pipeline_watermark")
).distinct()
The write succeeds, but Spark logs warnings that look like this:
ShuffleMapTask
java.io.IOException
ExceptionFailure(java.io.IOException,
Pipe has no content; awaitReadable() returned false
/mnt/vegas/pipes/####.pipe, pos=#, blocksRead=#; bytesRead=#; availInPipe=#]
Vegas Service: … Abandoned pipe - unread by the time it was closed
then write that data frame into a delta table (that will live in fabric's lake house:
try:
(billings_df
.write
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(billings_table_name)
)
except Exception as e:
raise e
Although the operation gets completed with warnings, it takes more than a minute (for a relatively small CSV with less than 2000 rows), and I'm wondering if the warning could be indicating why the operation takes so long.
What I've tried so far:
* adding these settings:
spark.conf.set("spark.sql.shuffle.partitions", 200)
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
* using partitionBy:
(billings_df
.write
.partitionBy("transaction_date")
.format("delta")
.mode("overwrite")
.option("overwriteSchema", "true")
.saveAsTable(billings_table_name)
)
But I didn't have any luck so far. Did you guys ever experience that error? Or even experienced any slowness during delta table write?
Obs: the fabric capacity SKU is F2.
Solved! Go to Solution.
Hi @aa_tsl,
Thank you for reaching out in Microsoft Community Forum.
The long execution time is likely due to Spark job overhead in Fabric's F2 SKU, especially when using .collect() on small datasets. Even simple operations can feel slow because of cluster orchestration time, not computation itself.
Please follow below steps to resolve the issue;
1.Replace .collect() with .toPandas() for small datasets to reduce overhead.
2.Keep transformations within Spark as long as possible (avoid switching to Python lists too early).
3.For <2000 rows, consider using Pandas instead of Spark — it’s faster for small data.
4.Fabric's F2 SKU has higher latency for small jobs; if possible, test on F4 for better responsiveness.
Please continue using Microsoft Community Forum.
If this post helpes in resolve your issue, kindly consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.
Regards,
Pavan.
Hi @aa_tsl,
I wanted to follow up since we haven't heard back from you regarding our last response. We hope your issue has been resolved.
If the community member's answer your query, please mark it as "Accept as Solution" and select "Yes" if it was helpful.
If you need any further assistance, feel free to reach out.
Thank you,
Pavan.
Hi @aa_tsl,
I hope this information is helpful. Please let me know if you have any further questions or if you'd like to discuss this further. If this answers your question, kindly "Accept as Solution" and give it a 'Kudos' so others can find it easily.
Thank you,
Pavan.
Hi @aa_tsl,
I wanted to check if you had the opportunity to review the information provided. Please feel free to contact us if you have any further questions. If my response has addressed your query, please "Accept as Solution" and give a 'Kudos' so other members can easily find it.
Thank you,
Pavan.
Hi @aa_tsl ,
Please check the following Spark Configurations
1. Ensure to use latest runtime (1.3 Spark 3.5 and Delta 3.2)
2. Enable Native Execution Engine
3. Check your spark resource profile configs
Ref - https://learn.microsoft.com/en-us/fabric/data-engineering/configure-resource-profile-configurations
Based on the observation , if you write to warehouse it is comparatively slower than writing to Lakehouse. Lakehouse is faster in loading the data using spark with config profiles.
Regards,
Srisakthi
Hi @Anonymous , thanks for the reply.
I tried removing the .distinct() from the code, but it didn't have much effect 😕
It takes a lot of time for simple operations like this:
vendors_df = agents_df.filter(col("parent_id") == vendor_parent_id).withColumn(
"role", lit("vendor")
)
vendors_ids = [
row.agent_id for row in vendors_df.select("agent_id").collect()
]
advancing_vendors_df = agents_df.filter(
col("parent_id") == advancing_vendor_parent_id
).withColumn("role", lit("advancing_vendor"))
advancing_vendors_ids = [
row.agent_id for row in advancing_vendors_df.select("agent_id").collect()
]
It took 2 minutes to run for a <2k dataset in fabric. When I run it locally (docker container simulating a spark environment it takes seconds, much faster).
Would you recommend using Pandas for data manipulation instead? Or do you have other tips on how to make Spark more performant on Fabric?
Best,
Alex
Hi @aa_tsl,
Thank you for reaching out in Microsoft Community Forum.
The long execution time is likely due to Spark job overhead in Fabric's F2 SKU, especially when using .collect() on small datasets. Even simple operations can feel slow because of cluster orchestration time, not computation itself.
Please follow below steps to resolve the issue;
1.Replace .collect() with .toPandas() for small datasets to reduce overhead.
2.Keep transformations within Spark as long as possible (avoid switching to Python lists too early).
3.For <2000 rows, consider using Pandas instead of Spark — it’s faster for small data.
4.Fabric's F2 SKU has higher latency for small jobs; if possible, test on F4 for better responsiveness.
Please continue using Microsoft Community Forum.
If this post helpes in resolve your issue, kindly consider marking it as "Accept as Solution" and give it a 'Kudos' to help others find it more easily.
Regards,
Pavan.
Hi @aa_tsl,
Thank you for reaching out in Microsoft Community Forum.
Please follow below steps to resolve the issue;
1.Avoid using .distinct() on small datasets as it triggers expensive shuffling. Use .dropDuplicates() if deduplication is needed.
2.Do not use partitionBy for small datasets; let Spark handle partitioning automatically to reduce overhead.
3.Set spark.sql.shuffle.partitions to a lower value (e.g., 😎 and use spark.sql.files.maxPartitionBytes to control partition sizes.
4.Ensure sufficient resources (memory, CPU) in the Fabric F2 SKU cluster and set the log level to DEBUG for further insights into the issue.
Please continue using Microsoft community forum.
If you found this post helpful, please consider marking it as "Accept as Solution" and give it a 'Kudos'. if it was helpful. help other members find it more easily.
Regards,
Pavan.
User | Count |
---|---|
6 | |
2 | |
2 | |
2 | |
2 |
User | Count |
---|---|
18 | |
17 | |
6 | |
5 | |
4 |