failed barrier resultstage error when training a X...

cfccai · ‎08-28-2024

hello

I come accross an issue when using notebook pyspark to train a XGBoost model.

Code snippet:

#load sample data; transform date type into integer for vector

data = spark.createDataFrame(df)

data = data.withColumn("Year", F.year("DateID").cast(IntegerType()))

data = data.withColumn("Month", F.month("DateID").cast(IntegerType()))

# Assemble features into a single vector column

assembler = VectorAssembler(inputCols=["Year","Month"], outputCol="features")

data = assembler.transform(data)

train, test = data.randomSplit([0.7, 0.3], seed=123)

# Initialize XGBoost regressor

xgb_regressor = SparkXGBRegressor(label_col="SalesOrderAmount", num_round=10)

# Train the model

model_xgbregressor = xgb_regressor.fit(train)

When the code running into this statement, it reports an error:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.scheduler.BarrierJobRunWithDynamicAllocationException: [SPARK-24942]: Barrier execution mode does not support dynamic resource allocation for now. You can disable dynamic resource allocation by setting Spark conf "spark.dynamicAllocation.enabled" to "false".

So I went to workspace setting and turn off the dynamicAllocation. In addtion, I choose the nodes size as X-large.

Rerun the code, I got below different error msg:

job aborted due to stage failure: could not recover from a failed barrier resultstage. most recent failure reason: stage failed because barrier task resulttask(13, 0) finished unsuccessfully.

I get lost here. Can someone help? Thanks a lot.

thinkall · ‎05-08-2025

Hi @cfccai @KK94 , could you please try if this works for you? Thanks.

KK94 · ‎03-19-2025

Hi all,

We are running into exactly the same problems as well. First we got the BarrierJobRunWithDynamicAllocationException . Turned off Dynamic Allocation. Got job aborted due to stage failure: could not recover from a failed barrier resultstage. most recent failure reason: stage failed because barrier task resulttask(13, 0) finished unsuccessfully. after that as well.

Tried to look at your sollutions @nilendraFabric , but none of those seem to help. Insufficient resources should not be a problem, as we have a very small dataset. I did see that our dataset of 700 rows was partitioned in 200 partitions. Manually changing this and adjusting the num_workers in XGBoost does not help either.

nilendraFabric · ‎01-16-2025

hi @cfccai

The error you are encountering, "failed barrier resultstage error," when training an XGBoost model in PySpark , is likely caused by a combination of issues related to Spark's barrier execution mode and resource allocation. Here’s a detailed explanation:

Barrier Execution Mode Limitations:
XGBoost training in Spark uses barrier execution mode, which ensures that all tasks start simultaneously. However, this mode has strict requirements and limitations, such as the need for sufficient resources to run all tasks concurrently. If these conditions are not met, the job will fail with errors like "failed barrier resultstage" or "could not recover from a failed barrier resultstage".
Dynamic Resource Allocation:
The initial error suggests that dynamic resource allocation was enabled (spark.dynamicAllocation.enabled = true). Barrier execution mode does not support dynamic resource allocation because it requires a fixed number of resources to launch all tasks simultaneously. Disabling dynamic allocation was the correct step.
Insufficient Resources:
Even after disabling dynamic allocation, the second error indicates that there might not be enough resources (e.g., CPU cores or memory) to execute all tasks concurrently. Barrier tasks require all partitions to complete successfully, and any failure (e.g., due to insufficient resources or partition imbalance) will cause the stage to fail.
Partition Imbalance:
If some partitions are empty or unevenly distributed, it can lead to failures in barrier execution mode. This is a common issue when using XGBoost with Spark, as XGBoost automatically repartitions the data but may encounter imbalances

try adjust XGBoost Parameters

Reduce resource-intensive parameters such as max_depth or num_round. For instance:

xgb_regressor = SparkXGBRegressor(label_col="SalesOrderAmount", num_round=5, max_depth=3)

Set num_workers explicitly to control parallelism:

xgb_regressor = SparkXGBRegressor(label_col="SalesOrderAmount", num_round=10, num_workers=2)

If this post helps, then please give us Kudos and consider Accept it as a solution to help the other members find it more quickly.

MKinsight · ‎11-05-2024

Anyone figured this out?

I've got exactly these 2 errors in the same order. It seems like XGBoost should run on Fabric. It also seems like Fabric doesn't support it.. Msft Fabric docs don't list xgboost models in the training guides, at least now. There's SparkML but these are simpler models in it, also as of now, but seems like SparkML will be the go-to lib. We still need a working solution, please.

Anonymous · ‎08-28-2024

HI @cfccai,

It seems like you turned off the ‘dynamic allocation’ option but the existed pool resource not able to handle with current model. Have you tried to reduce the sample data amount or manually modify the environment compute setting to use more resource in these operations?

Compute management in Fabric environments - Microsoft Fabric | Microsoft Learn

Spark pool node size:

Apache Spark compute for Data Engineering and Data Science - Microsoft Fabric | Microsoft Learn

Regards,

Xiaoxin Sheng