Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Next up in the FabCon + SQLCon recap series: The roadmap for Microsoft SQL and Maximizing Developer experiences in Fabric. All sessions are available on-demand after the live show. Register now

Reply
navakanth_DE
Frequent Visitor

Notebook Error: CANNOT_OPEN_SOCKET in collect()

Hi Fabric community,

 

I’m running a piece of code in a Fabric notebook that performs a few .collect() and .first() operations on a very small dataset (less than 5 columns). Intermittently, the job fails with a socket error, but a subsequent run often succeeds.

Retry or avoiding collect operations will help but wanted to check the root cause for this error .

 

I found a Databricks Community post suggesting this can be caused by a Databricks runtime upgrade (Solved: [CANNOT_OPEN_SOCKET] Can not open socket — Databricks Community #134032)

 

Error message (representative):

PySparkRuntimeError: pyspark.errors.exceptions.base.PySparkRuntimeError: [CANNOT_OPEN_SOCKET] Can not open socket: ["tried to connect to ('123.4.5.67), but an error occurred: [Errno 104] Connection reset by peer"]

 

Has anyone seen this behavior in Fabric notebooks and can suggest likely causes or mitigations?

9 REPLIES 9
ati_puri
Resolver I
Resolver I

Hi,
 
Generally it is not advised to use .collect() even while working with small subset of data as this functions holds the result back to the driver memory. The interaction between the driver and workers while transferring the result makes it more slower as compared to other functions like .take() etc. Its is advisable to use .take(), filter by using .limit() function , cache or persist the results and clear spark cache for releasing driver memoery.
Thanks

@navakanth_DE

Hi @navakanth_DE 

Following up to confirm if the earlier responses addressed your query. If not, please share your questions and we’ll assist further.

 

 

Hi @navakanth_DE 

We wanted to follow up to check if you’ve had an opportunity to review the previous responses. If you require further assistance, please don’t hesitate to let us know.

tayloramy
Super User
Super User

Hi @navakanth_DE

 

@deborshi_nag - I don't think accepting that "this just happens" and advising users to "do less transformations" is an acceptable answer to this problem. 

 

@navakanth_DE, I've never encountered this myself, but this is very much an issue if this is happening. I'd recommend opening a support ticket with Microsoft, that way they can dig into the telemetry from your tenant and get to the bottom of exactly what is going on, and if there is a bug in the platform, they can get it on the product team's roadmap to fix. 

 





If you found this helpful, consider giving some Kudos.
If I answered your question or solved your problem, mark this post as the solution!

Join the Fabric Discord!

Proud to be a Super User!





hey @tayloramy , I have already raised a MSFT ticket as well long back 4 months ago but even they are not able to provide a proper solution for this .

Only recommendation sugessted by them is to apply a retry mechanism in the code when there is a cocket error .

Hello @tayloramy 

 

To clarify, I’m not suggesting this behaviour is “acceptable” or expected from a user perspective, nor that the solution is simply to “do less work”.

 

The point I was making is about where the instability is introduced. In Fabric (and other managed Spark services), instability tends to surface specifically at action boundaries, where results are marshalled from the Spark driver back into the Python process over a socket.

 

Using more Spark transformations and fewer actions is not a general performance tip, but a way to reduce exposure to that driver‑to‑Python boundary. Each collect(), first(), or count() opens a new result channel; under capacity pressure or executor recycling, that channel can be reset even for very small datasets.

 

So the mitigation is not “do less transformations”, but batching result materialisation and being deliberate about when data is pulled into Python, until the underlying platform behaviour is improved.

 

I trust this will be helpful. If you found this guidance useful, you are welcome to acknowledge with a Kudos or by marking it as a Solution.
deborshi_nag
Resident Rockstar
Resident Rockstar

Hello @navakanth_DE 

 

The underlying reason for this error is that a Spark action such as .collect() or .first() forces Microsoft Fabric to send results back from the Spark driver and executors into the Python process, and that network connection is being reset while the action is completing. This is not related to dataset size or faulty code logic, and it can happen even when you are working with very small, simple datasets.

 

One contributing cause is Fabric’s use of managed, ephemeral compute with autosuspend and background rebalancing. Spark drivers and executors can be paused, recycled, or restarted due to capacity throttling or internal health checks, sometimes right in the middle of returning results to Python. When that happens, the driver‑to‑Python socket is dropped, which surfaces as a “connection reset by peer” error.

 

Another factor is running multiple small Spark actions in the same notebook. Each .collect() or .first() triggers a separate Spark job and opens a new result channel back to Python, increasing the number of round trips across that fragile boundary. Even though each action is cheap, the cumulative effect makes it more likely that one of those result transfers gets interrupted under capacity pressure.

 

You can reduce the likelihood of this issue by minimising round trips to Python and restructuring your code so fewer actions are executed overall. Push as much logic as possible into Spark transformations, and when you only need a small sample, prefer .take(n) or .limit(n) instead of repeated .collect() calls. If you do need data in Python, aim to do a single, controlled .collect() at the end rather than many small ones throughout the notebook.

 

The source of this information has come from Microsoft CoPilot.

 

I trust this will be helpful. If you found this guidance useful, you are welcome to acknowledge with a Kudos or by marking it as a Solution.
tayloramy
Super User
Super User

Hi @navakanth_DE

 

I have not encountered this in my fabric environments before. Can you tell us what your data sources are? 

 

Is this an issue when reading the data from source, when writing it to a fabric datastore as a target, or is this an issue in processing the data once it's already been loaded to a spark dataframe? 

 





If you found this helpful, consider giving some Kudos.
If I answered your question or solved your problem, mark this post as the solution!

Join the Fabric Discord!

Proud to be a Super User!





Hey @tayloramy ,

Getting this error while i am processing the data from a dataframe . For example , read the data from delta lake and trying to get a record using a collect/first function from the dataframe

Helpful resources

Announcements
FabCon and SQLCon Highlights Carousel

FabCon &SQLCon Highlights

Experience the highlights from FabCon & SQLCon, available live and on-demand starting April 14th.

New to Fabric survey Carousel

New to Fabric Survey

If you have recently started exploring Fabric, we'd love to hear how it's going. Your feedback can help with product improvements.

March Fabric Update Carousel

Fabric Monthly Update - March 2026

Check out the March 2026 Fabric update to learn about new features.