Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started

Reply
trutz
Advocate III
Advocate III

MS Fabric Notebook: Error when using dataframe.to_json() with Fabric Lakehouse storage

Hi all,

I'm getting the error below when using pandas dataframe method to_json() with dataframe.size() "4263942".
Dataframes with lower size can be saved without issue.
The filepath and name are just a standard abfss path and works with lower file sizes just fine.

Any ideas how this could be solved? Is this an issue with the adlfs libraries used by the pandas library?

 

RuntimeError                              Traceback (most recent call last)
Cell In[35], line 86
---> 86 df.to_json(filenameActivity,orient="records",lines=True)
     87 print("File has been saved as", filenameActivity)

File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/core/generic.py:2650, in NDFrame.to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options)
   2647 config.is_nonnegative_int(indent)
   2648 indent = indent or 0
-> 2650 return json.to_json(
   2651     path_or_buf=path_or_buf,
   2652     obj=self,
   2653     orient=orient,
   2654     date_format=date_format,
   2655     double_precision=double_precision,
   2656     force_ascii=force_ascii,
   2657     date_unit=date_unit,
   2658     default_handler=default_handler,
   2659     lines=lines,
   2660     compression=compression,
   2661     index=index,
   2662     indent=indent,
   2663     storage_options=storage_options,
   2664 )

File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/io/json/_json.py:178, in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options)
    174     s = convert_to_line_delimits(s)
    176 if path_or_buf is not None:
    177     # apply compression and byte/text conversion
--> 178     with get_handle(
    179         path_or_buf, "w", compression=compression, storage_options=storage_options
    180     ) as handles:
    181         handles.handle.write(s)
    182 else:

File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/io/common.py:133, in IOHandles.__exit__(self, *args)
    132 def __exit__(self, *args: Any) -> None:
--> 133     self.close()

File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/io/common.py:125, in IOHandles.close(self)
    123     self.created_handles.remove(self.handle)
    124 for handle in self.created_handles:
--> 125     handle.close()
    126 self.created_handles = []
    127 self.is_wrapped = False

File ~/cluster-env/trident_env/lib/python3.10/site-packages/adlfs/spec.py:1919, in AzureBlobFile.close(self)
   1917 """Close file and azure client."""
   1918 asyncio.run_coroutine_threadsafe(close_container_client(self), loop=self.loop)
-> 1919 super().close()

File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/spec.py:1789, in AbstractBufferedFile.close(self)
   1787 else:
   1788     if not self.forced:
-> 1789         self.flush(force=True)
   1791     if self.fs is not None:
   1792         self.fs.invalidate_cache(self.path)

File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/spec.py:1660, in AbstractBufferedFile.flush(self, force)
   1657         self.closed = True
   1658         raise
-> 1660 if self._upload_chunk(final=force) is not False:
   1661     self.offset += self.buffer.seek(0, 2)
   1662     self.buffer = io.BytesIO()

File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/asyn.py:115, in sync_wrapper.<locals>.wrapper(*args, **kwargs)
    112 @functools.wraps(func)
    113 def wrapper(*args, **kwargs):
    114     self = obj or args[0]
--> 115     return sync(self.loop, func, *args, **kwargs)

File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/asyn.py:100, in sync(loop, func, timeout, *args, **kwargs)
     98     raise FSTimeoutError from return_result
     99 elif isinstance(return_result, BaseException):
--> 100     raise return_result
    101 else:
    102     return return_result

File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/asyn.py:55, in _runner(event, coro, result, timeout)
     53     coro = asyncio.wait_for(coro, timeout=timeout)
     54 try:
---> 55     result[0] = await coro
     56 except Exception as ex:
     57     result[0] = ex

File ~/cluster-env/trident_env/lib/python3.10/site-packages/adlfs/spec.py:2083, in AzureBlobFile._async_upload_chunk(self, final, **kwargs)
   2079                 await bc.commit_block_list(
   2080                     block_list=block_list, metadata=self.metadata
   2081                 )
   2082         else:
-> 2083             raise RuntimeError(f"Failed to upload block{e}!") from e
   2084 elif self.mode == "ab":
   2085     async with self.container_client.get_blob_client(blob=self.blob) as bc:

RuntimeError: Failed to upload blockInternal Server Error

 

1 ACCEPTED SOLUTION

Hi,
I switched from JSON to Parquet and had no issue with storing the data. So for the moment I circumvented the issue.

View solution in original post

10 REPLIES 10
trutz
Advocate III
Advocate III

In the mean time I switched the method to "dataframe.to_parquet()" to be able to continue with the data exploration. An update on the bug would be helpful nontheless. Thanks for your help so far.

Hi   ,

The internal team has updated me that there is no such issue at their side. They are able to repro your scenario successfully using 5M records. I have attached the screenshots for your reference.
So, can you please confirm is this a consistent error or a one time glitch which throwed due to Internal Server Error.


vnikhilanmsft_0-1695897915398.png

 

I received this error on consecutive days.
I used the orient=records option. Maybe this got soemthing to do with the error.
Maybe my JSON data contains a character that is not properly escaped by the pandas library and then leads to errors in the underlying libraries.
Unfortunately the data is sensitive, so I can't provide an example.

Hi ,


Apologies for the issue you are facing. To eliminate the data issue, you can try validating the json to see if data is in proper json format. This might help us in identifying if the issue is from data or from ADLFS package.

Hi @trutz ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet . In case if you have any resolution please do share that same with the community as it can be helpful to others . If you have any question relating to the current thread, please do let us know and we will try out best to help you.

Hi @trutz ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.

Hi,
I switched from JSON to Parquet and had no issue with storing the data. So for the moment I circumvented the issue.

v-nikhilan-msft
Community Support
Community Support

Hi @trutz ,
Thanks for using Microsoft Fabric Community.

The error message "4263942" is a generic error that can occur in a variety of contexts. It typically indicates a problem with a large dataset. For example, it could occur if the dataset is too large to be processed by the computer, or if the dataset is corrupted.

 

A DataFrame of size 4263942 is approximately 4.1 gigabytes. This is a relatively large DataFrame, and it is possible that your computer does not have enough memory to convert it to a JSON string. The error message "4263942" indicates that the size of the JSON data is too large to be written to Fabric Lakehouse storage. The maximum size of a JSON file that can be written to Fabric Lakehouse storage is 4 MB.

 

The specific size limit for a DataFrame depends on the amount of available memory on the computer and the complexity of the Pandas operations being performed. However, it is generally recommended to avoid using DataFrames that are larger than a few gigabytes.

 

There are a few things you can do to try to fix the error:

 

1) Reduce the size of the DataFrame. You can do this by selecting a subset of the rows or columns in the DataFrame.
2) Use a streaming JSON library. Streaming JSON libraries can convert DataFrames to JSON more efficiently than the built-in JSON library in Python.
3) Use a distributed computing framework. This allows you to distribute the workload across multiple computers.

 

Hope this helps. If you have any further questions or requests, please do ask.

Hi @v-nikhilan-msft ,
Thanks for your help.
The computer in this case is the Fabric instance (Premium P1 node). There's not much to configure as it runs as Software as a Service.
The other JSON files that are sucessfully written to the Lakehouse storage are up to 119 MB in size. So the 4MB limit you mentioned seems not to be valid.
I'll check out the streaming library approach.

Hi @trutz ,
I have reached the internal team for help regarding this issue. I will update you once I hear from them.

Helpful resources

Announcements
Sept Fabric Carousel

Fabric Monthly Update - September 2024

Check out the September 2024 Fabric update to learn about new features.

September Hackathon Carousel

Microsoft Fabric & AI Learning Hackathon

Learn from experts, get hands-on experience, and win awesome prizes.

Sept NL Carousel

Fabric Community Update - September 2024

Find out what's new and trending in the Fabric Community.