Solved: Microsoft fabrics - Default Environment - Ml Model...

ca_solution · ‎06-24-2025

I have been facing connection refused error, didn't understand what's reason.
Any ideas about what's is wrong here ??

My Code:

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

from pyspark.ml.evaluation import BinaryClassificationEvaluator

from synapse.ml.lightgbm import LightGBMClassifier

from synapse.ml.train import ComputeModelStatistics

import mlflow

from pyspark.ml import Pipeline

# Start MLflow experiment

mlflow.set_experiment("HyperparameterTuning")

with mlflow.start_run() as run:

# Set experiment tags to indicate hyperparameter tuning

mlflow.set_tag("run_type", "hyperparameter_tuning")

mlflow.set_tag("model_type", "LightGBM")

mlflow.set_tag("experiment_purpose", "prediction_test")

# Define the model

lgbm = LightGBMClassifier(

labelCol="label",

featuresCol="features",

featuresShapCol="shapValues",

dataTransferMode="bulk",

verbosity=1,

boostingType="gbdt",

maxBin=255,

objective="binary"

)

# Define parameter grid

paramGrid = ParamGridBuilder() \

.addGrid(lgbm.numIterations, [50, 100, 200]) \

.addGrid(lgbm.learningRate, [0.01, 0.05, 0.1]) \

.addGrid(lgbm.numLeaves, [31, 64, 128]) \

.addGrid(lgbm.isUnbalance, [True, False]) \

.build()

# Log the parameter grid as JSON

mlflow.log_dict({"param_grid": [

{"numIterations": [50, 100, 200]},

{"learningRate": [0.01, 0.05, 0.1]},

{"numLeaves": [31, 64, 128]},

{"isUnbalance": [True, False]}

]}, "param_grid.json")

# Setup CrossValidator

cv = CrossValidator(

estimator=lgbm,

estimatorParamMaps=paramGrid,

evaluator=evaluator,

numFolds=3,

parallelism=2

)

# Fit the model

cvModel = cv.fit(final_train)

# Log the best model

mlflow.spark.log_model(cvModel.bestModel, "lightgbm_best_model")

# Log best hyperparameters

best_params = cvModel.bestModel.extractParamMap()

best_params_dict = {param.name: best_params[param] for param in best_params}

mlflow.log_params(best_params_dict)

# Make predictions

predictions = cvModel.bestModel.transform(final_test)

# Calculate detailed statistics

metrics_df = ComputeModelStatistics(

evaluationMetric="classification",

labelCol="label",

scoredLabelsCol="prediction",

scoresCol="probability"

).transform(predictions)

metrics = metrics_df.first().asDict()

# Log evaluation metrics

mlflow.log_metrics({

"Accuracy": metrics["accuracy"],

"AUC": metrics["AUC"],

"Precision": metrics["precision"],

"Recall": metrics["recall"]

})

# Log confusion matrix

confusion_matrix = metrics["confusion_matrix"].toArray().tolist()

mlflow.log_dict({"confusion_matrix": confusion_matrix}, "confusion_matrix.json")

print("Hyperparameter Tuning Completed. Best Params and Metrics Logged to MLflow.")

Py4JJavaError: An error occurred while calling o37078.fit.

: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 36.0 failed 4 times, most recent failure: Lost task 10.3 in stage 36.0 (TID 3493😞

java.net.ConnectException: Connection refused (Connection refused)

at java.net.PlainSocketImpl.socketConnect(Native Method)

at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)

Poojara_D12 · ‎07-04-2025

Hi @ca_solution

You're encountering a Connection refused error during the execution of your PySpark code, which uses SynapseML's LightGBM classifier for hyperparameter tuning within a Spark environment. This error is not due to a coding mistake, but rather a network-level issue within the Spark cluster. Specifically, SynapseML's LightGBM component relies on inter-node communication over TCP ports to coordinate training across distributed workers. If those ports are blocked, unavailable, or misconfigured, the Spark tasks will fail with a java.net.ConnectException, indicating they cannot reach each other. This can happen if the Spark environment (e.g., Azure Synapse, Databricks, HDInsight) has firewall restrictions, lacks sufficient resources (like CPU or memory), or is misconfigured for distributed LightGBM usage. It can also occur if too many parallel tasks are launched at once—exceeding what the cluster can handle—especially during cross-validation. To resolve this, you should ensure that the environment allows node-to-node communication, possibly by opening necessary ports (typically starting at 12400), reducing the level of parallelism, and reviewing Spark cluster logs for more details. Ultimately, this is an infrastructure or environment issue that affects how the LightGBM workers connect across the distributed Spark cluster, and addressing it requires adjusting networking or resource configurations in your cluster setup.

Did I answer your question? Mark my post as a solution, this will help others!
If my response(s) assisted you in any way, don't forget to drop me a "Kudos"

Kind Regards,
Poojara - Proud to be a Super User
Data Analyst | MSBI Developer | Power BI Consultant
Consider Subscribing my YouTube for Beginners/Advance Concepts: https://youtube.com/@biconcepts?si=04iw9SYI2HN80HKS

View solution in original post

v-sdhruv · ‎07-09-2025

Hi @ca_solution ,
Just wanted to check, if you were able to review the suggestions provided?
If any of the responses has addressed your query, please accept it as a solution so other members can easily find it.
Thank you.

v-sdhruv · ‎07-07-2025

Hi @ca_solution ,

Just wanted to check, if you were able to review the suggestions provided?
Thank You @Poojara_D12 for your detail explaination to the query .Its true that LightGBM in distributed mode can fail if ports between the nodes are not configured correctly hence its a set up issue.

If any of the responses has addressed your query, please accept it as a solution so other members can easily find it.
Thank you.

Poojara_D12 · ‎07-04-2025

Hi @ca_solution

You're encountering a Connection refused error during the execution of your PySpark code, which uses SynapseML's LightGBM classifier for hyperparameter tuning within a Spark environment. This error is not due to a coding mistake, but rather a network-level issue within the Spark cluster. Specifically, SynapseML's LightGBM component relies on inter-node communication over TCP ports to coordinate training across distributed workers. If those ports are blocked, unavailable, or misconfigured, the Spark tasks will fail with a java.net.ConnectException, indicating they cannot reach each other. This can happen if the Spark environment (e.g., Azure Synapse, Databricks, HDInsight) has firewall restrictions, lacks sufficient resources (like CPU or memory), or is misconfigured for distributed LightGBM usage. It can also occur if too many parallel tasks are launched at once—exceeding what the cluster can handle—especially during cross-validation. To resolve this, you should ensure that the environment allows node-to-node communication, possibly by opening necessary ports (typically starting at 12400), reducing the level of parallelism, and reviewing Spark cluster logs for more details. Ultimately, this is an infrastructure or environment issue that affects how the LightGBM workers connect across the distributed Spark cluster, and addressing it requires adjusting networking or resource configurations in your cluster setup.

Did I answer your question? Mark my post as a solution, this will help others!
If my response(s) assisted you in any way, don't forget to drop me a "Kudos"

Kind Regards,
Poojara - Proud to be a Super User
Data Analyst | MSBI Developer | Power BI Consultant
Consider Subscribing my YouTube for Beginners/Advance Concepts: https://youtube.com/@biconcepts?si=04iw9SYI2HN80HKS

v-sdhruv · ‎07-01-2025

Hi @ca_solution ,
Just wanted to check if you had the opportunity to review the solutions provided?
If the response has addressed your query, please accept it as a solution so other members can easily find it.
Thank You

v-sdhruv · ‎06-25-2025

Hi @ca_solution ,

From the traceback, it’s happening during the .fit() call, meaning Spark is trying to talk to a backend service or cluster node, and something isn’t playing along.

1. Make sure your Spark cluster or Synapse environment is up and fully initialized.If you're using Azure Synapse or a Spark pool, double-check that it's started and accepting jobs.
2.LightGBM in distributed mode can sometimes fail if ports between nodes aren’t open.
3.You’re using parallelism=2, which suggests multiple processes. It's possible that two processes are trying to bind to the same port or resource, causing a conflict.

Try setting dataTransferMode="tcp" , instead of "bulk"—sometimes that resolves strange socket issues.

Run a very small subset of your training data to see if the issue still occurs.

Hope this helps!

Microsoft fabrics - Default Environment - Ml Model - Connection Refused error

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025

Party with Power BI’s own Guy in a Cube

Microsoft fabrics - Default Environment - Ml Model - Connection Refused error

Helpful resources

Fabric Community Update - July 2025

Power BI Monthly Update - July 2025