Don't miss your chance to take the Fabric Data Engineer (DP-600) exam for FREE! Find out how by attending the DP-600 session on April 23rd (pacific time), live or on-demand.
Learn moreNext up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now
Hi Fabric community,
I’m running a piece of code in a Fabric notebook that performs a few .collect() and .first() operations on a very small dataset (less than 5 columns). Intermittently, the job fails with a socket error, but a subsequent run often succeeds.
Retry or avoiding collect operations will help but wanted to check the root cause for this error .
I found a Databricks Community post suggesting this can be caused by a Databricks runtime upgrade (Solved: [CANNOT_OPEN_SOCKET] Can not open socket — Databricks Community #134032)
Error message (representative):
PySparkRuntimeError: pyspark.errors.exceptions.base.PySparkRuntimeError: [CANNOT_OPEN_SOCKET] Can not open socket: ["tried to connect to ('123.4.5.67), but an error occurred: [Errno 104] Connection reset by peer"]
Has anyone seen this behavior in Fabric notebooks and can suggest likely causes or mitigations?
Following up to confirm if the earlier responses addressed your query. If not, please share your questions and we’ll assist further.
Hi @navakanth_DE
We wanted to follow up to check if you’ve had an opportunity to review the previous responses. If you require further assistance, please don’t hesitate to let us know.
Hi @navakanth_DE,
@deborshi_nag - I don't think accepting that "this just happens" and advising users to "do less transformations" is an acceptable answer to this problem.
@navakanth_DE, I've never encountered this myself, but this is very much an issue if this is happening. I'd recommend opening a support ticket with Microsoft, that way they can dig into the telemetry from your tenant and get to the bottom of exactly what is going on, and if there is a bug in the platform, they can get it on the product team's roadmap to fix.
Proud to be a Super User! | |
hey @tayloramy , I have already raised a MSFT ticket as well long back 4 months ago but even they are not able to provide a proper solution for this .
Only recommendation sugessted by them is to apply a retry mechanism in the code when there is a cocket error .
Hello @tayloramy
To clarify, I’m not suggesting this behaviour is “acceptable” or expected from a user perspective, nor that the solution is simply to “do less work”.
The point I was making is about where the instability is introduced. In Fabric (and other managed Spark services), instability tends to surface specifically at action boundaries, where results are marshalled from the Spark driver back into the Python process over a socket.
Using more Spark transformations and fewer actions is not a general performance tip, but a way to reduce exposure to that driver‑to‑Python boundary. Each collect(), first(), or count() opens a new result channel; under capacity pressure or executor recycling, that channel can be reset even for very small datasets.
So the mitigation is not “do less transformations”, but batching result materialisation and being deliberate about when data is pulled into Python, until the underlying platform behaviour is improved.
Hello @navakanth_DE
The underlying reason for this error is that a Spark action such as .collect() or .first() forces Microsoft Fabric to send results back from the Spark driver and executors into the Python process, and that network connection is being reset while the action is completing. This is not related to dataset size or faulty code logic, and it can happen even when you are working with very small, simple datasets.
One contributing cause is Fabric’s use of managed, ephemeral compute with autosuspend and background rebalancing. Spark drivers and executors can be paused, recycled, or restarted due to capacity throttling or internal health checks, sometimes right in the middle of returning results to Python. When that happens, the driver‑to‑Python socket is dropped, which surfaces as a “connection reset by peer” error.
Another factor is running multiple small Spark actions in the same notebook. Each .collect() or .first() triggers a separate Spark job and opens a new result channel back to Python, increasing the number of round trips across that fragile boundary. Even though each action is cheap, the cumulative effect makes it more likely that one of those result transfers gets interrupted under capacity pressure.
You can reduce the likelihood of this issue by minimising round trips to Python and restructuring your code so fewer actions are executed overall. Push as much logic as possible into Spark transformations, and when you only need a small sample, prefer .take(n) or .limit(n) instead of repeated .collect() calls. If you do need data in Python, aim to do a single, controlled .collect() at the end rather than many small ones throughout the notebook.
The source of this information has come from Microsoft CoPilot.
Hi @navakanth_DE,
I have not encountered this in my fabric environments before. Can you tell us what your data sources are?
Is this an issue when reading the data from source, when writing it to a fabric datastore as a target, or is this an issue in processing the data once it's already been loaded to a spark dataframe?
Proud to be a Super User! | |
Hey @tayloramy ,
Getting this error while i am processing the data from a dataframe . For example , read the data from delta lake and trying to get a record using a collect/first function from the dataframe
Experience the highlights from FabCon & SQLCon, available live and on-demand starting April 14th.
If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.
| User | Count |
|---|---|
| 7 | |
| 4 | |
| 4 | |
| 3 | |
| 3 |