Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Enhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.

Reply
ca_solution
New Member

Microsoft fabrics - Default Environment - Ml Model - Connection Refused error

I have been facing connection refused error, didn't understand what's reason.
Any ideas about what's is wrong here ??

My Code:
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from synapse.ml.lightgbm import LightGBMClassifier
from synapse.ml.train import ComputeModelStatistics
import mlflow
from pyspark.ml import Pipeline

 

# Start MLflow experiment
mlflow.set_experiment("HyperparameterTuning")

 

with mlflow.start_run() as run:
# Set experiment tags to indicate hyperparameter tuning
mlflow.set_tag("run_type", "hyperparameter_tuning")
mlflow.set_tag("model_type", "LightGBM")
mlflow.set_tag("experiment_purpose", "prediction_test")

 

# Define the model
lgbm = LightGBMClassifier(
labelCol="label",
featuresCol="features",
featuresShapCol="shapValues",
dataTransferMode="bulk",
verbosity=1,
boostingType="gbdt",
maxBin=255,
objective="binary"
)

 

# Define parameter grid
paramGrid = ParamGridBuilder() \
.addGrid(lgbm.numIterations, [50, 100, 200]) \
.addGrid(lgbm.learningRate, [0.01, 0.05, 0.1]) \
.addGrid(lgbm.numLeaves, [31, 64, 128]) \
.addGrid(lgbm.isUnbalance, [True, False]) \
.build()

 

# Log the parameter grid as JSON
mlflow.log_dict({"param_grid": [
{"numIterations": [50, 100, 200]},
{"learningRate": [0.01, 0.05, 0.1]},
{"numLeaves": [31, 64, 128]},
{"isUnbalance": [True, False]}
]}, "param_grid.json")

 

# Setup CrossValidator
cv = CrossValidator(
estimator=lgbm,
estimatorParamMaps=paramGrid,
evaluator=evaluator,
numFolds=3,
parallelism=2
)

 

# Fit the model
cvModel = cv.fit(final_train)

 

# Log the best model
mlflow.spark.log_model(cvModel.bestModel, "lightgbm_best_model")

 

# Log best hyperparameters
best_params = cvModel.bestModel.extractParamMap()
best_params_dict = {param.name: best_params[param] for param in best_params}
mlflow.log_params(best_params_dict)

 

# Make predictions
predictions = cvModel.bestModel.transform(final_test)

 

# Calculate detailed statistics
metrics_df = ComputeModelStatistics(
evaluationMetric="classification",
labelCol="label",
scoredLabelsCol="prediction",
scoresCol="probability"
).transform(predictions)

 

metrics = metrics_df.first().asDict()

 

# Log evaluation metrics
mlflow.log_metrics({
"Accuracy": metrics["accuracy"],
"AUC": metrics["AUC"],
"Precision": metrics["precision"],
"Recall": metrics["recall"]
})

 

# Log confusion matrix
confusion_matrix = metrics["confusion_matrix"].toArray().tolist()
mlflow.log_dict({"confusion_matrix": confusion_matrix}, "confusion_matrix.json")

 

print("Hyperparameter Tuning Completed. Best Params and Metrics Logged to MLflow.")




Py4JJavaError: An error occurred while calling o37078.fit.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 36.0 failed 4 times, most recent failure: Lost task 10.3 in stage 36.0 (TID 3493😞
java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:412)
1 ACCEPTED SOLUTION
Poojara_D12
Super User
Super User

Hi @ca_solution 

You're encountering a Connection refused error during the execution of your PySpark code, which uses SynapseML's LightGBM classifier for hyperparameter tuning within a Spark environment. This error is not due to a coding mistake, but rather a network-level issue within the Spark cluster. Specifically, SynapseML's LightGBM component relies on inter-node communication over TCP ports to coordinate training across distributed workers. If those ports are blocked, unavailable, or misconfigured, the Spark tasks will fail with a java.net.ConnectException, indicating they cannot reach each other. This can happen if the Spark environment (e.g., Azure Synapse, Databricks, HDInsight) has firewall restrictions, lacks sufficient resources (like CPU or memory), or is misconfigured for distributed LightGBM usage. It can also occur if too many parallel tasks are launched at once—exceeding what the cluster can handle—especially during cross-validation. To resolve this, you should ensure that the environment allows node-to-node communication, possibly by opening necessary ports (typically starting at 12400), reducing the level of parallelism, and reviewing Spark cluster logs for more details. Ultimately, this is an infrastructure or environment issue that affects how the LightGBM workers connect across the distributed Spark cluster, and addressing it requires adjusting networking or resource configurations in your cluster setup.

 

Did I answer your question? Mark my post as a solution, this will help others!
If my response(s) assisted you in any way, don't forget to drop me a "Kudos"

Kind Regards,
Poojara - Proud to be a Super User
Data Analyst | MSBI Developer | Power BI Consultant
Consider Subscribing my YouTube for Beginners/Advance Concepts: https://youtube.com/@biconcepts?si=04iw9SYI2HN80HKS

View solution in original post

5 REPLIES 5
v-sdhruv
Community Support
Community Support

Hi @ca_solution ,
Just wanted to check, if you were able to review the suggestions provided?
If any of the responses has addressed your query, please accept it as a solution so other members can easily find it.
Thank you.

v-sdhruv
Community Support
Community Support

Hi @ca_solution ,

Just wanted to check, if you were able to review the suggestions provided?
Thank You @Poojara_D12 for your detail explaination to the query .Its true that LightGBM in distributed mode can fail if ports between the nodes are not configured correctly hence its a set up issue.

If any of the responses has addressed your query, please accept it as a solution so other members can easily find it.
Thank you.

Poojara_D12
Super User
Super User

Hi @ca_solution 

You're encountering a Connection refused error during the execution of your PySpark code, which uses SynapseML's LightGBM classifier for hyperparameter tuning within a Spark environment. This error is not due to a coding mistake, but rather a network-level issue within the Spark cluster. Specifically, SynapseML's LightGBM component relies on inter-node communication over TCP ports to coordinate training across distributed workers. If those ports are blocked, unavailable, or misconfigured, the Spark tasks will fail with a java.net.ConnectException, indicating they cannot reach each other. This can happen if the Spark environment (e.g., Azure Synapse, Databricks, HDInsight) has firewall restrictions, lacks sufficient resources (like CPU or memory), or is misconfigured for distributed LightGBM usage. It can also occur if too many parallel tasks are launched at once—exceeding what the cluster can handle—especially during cross-validation. To resolve this, you should ensure that the environment allows node-to-node communication, possibly by opening necessary ports (typically starting at 12400), reducing the level of parallelism, and reviewing Spark cluster logs for more details. Ultimately, this is an infrastructure or environment issue that affects how the LightGBM workers connect across the distributed Spark cluster, and addressing it requires adjusting networking or resource configurations in your cluster setup.

 

Did I answer your question? Mark my post as a solution, this will help others!
If my response(s) assisted you in any way, don't forget to drop me a "Kudos"

Kind Regards,
Poojara - Proud to be a Super User
Data Analyst | MSBI Developer | Power BI Consultant
Consider Subscribing my YouTube for Beginners/Advance Concepts: https://youtube.com/@biconcepts?si=04iw9SYI2HN80HKS
v-sdhruv
Community Support
Community Support

Hi @ca_solution ,
Just wanted to check if you had the opportunity to review the solutions provided?
If the response has addressed your query, please accept it as a solution  so other members can easily find it.
Thank You

v-sdhruv
Community Support
Community Support

Hi @ca_solution ,

From the traceback, it’s happening during the  .fit() call, meaning Spark is trying to talk to a backend service or cluster node, and something isn’t playing along.

1. Make sure your Spark cluster or Synapse environment is up and fully initialized.If you're using Azure Synapse or a Spark pool, double-check that it's started and accepting jobs.
2.LightGBM in distributed mode can sometimes fail if ports between nodes aren’t open.
3.You’re using parallelism=2, which suggests multiple processes. It's possible that two processes are trying to bind to the same port or resource, causing a conflict.

Try setting dataTransferMode="tcp" , instead of "bulk"—sometimes that resolves strange socket issues.

Run a very small subset of your training data to see if the issue still occurs.

Hope this helps!

 

Helpful resources

Announcements
July 2025 community update carousel

Fabric Community Update - July 2025

Find out what's new and trending in the Fabric community.

July PBI25 Carousel

Power BI Monthly Update - July 2025

Check out the July 2025 Power BI update to learn about new features.