Get certified in Microsoft Fabric—for free! For a limited time, the Microsoft Fabric Community team will be offering free DP-600 exam vouchers. Prepare now
Hi all,
I'm getting the error below when using pandas dataframe method to_json() with dataframe.size() "4263942".
Dataframes with lower size can be saved without issue.
The filepath and name are just a standard abfss path and works with lower file sizes just fine.
Any ideas how this could be solved? Is this an issue with the adlfs libraries used by the pandas library?
RuntimeError Traceback (most recent call last) Cell In[35], line 86 ---> 86 df.to_json(filenameActivity,orient="records",lines=True) 87 print("File has been saved as", filenameActivity) File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/core/generic.py:2650, in NDFrame.to_json(self, path_or_buf, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options) 2647 config.is_nonnegative_int(indent) 2648 indent = indent or 0 -> 2650 return json.to_json( 2651 path_or_buf=path_or_buf, 2652 obj=self, 2653 orient=orient, 2654 date_format=date_format, 2655 double_precision=double_precision, 2656 force_ascii=force_ascii, 2657 date_unit=date_unit, 2658 default_handler=default_handler, 2659 lines=lines, 2660 compression=compression, 2661 index=index, 2662 indent=indent, 2663 storage_options=storage_options, 2664 ) File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/io/json/_json.py:178, in to_json(path_or_buf, obj, orient, date_format, double_precision, force_ascii, date_unit, default_handler, lines, compression, index, indent, storage_options) 174 s = convert_to_line_delimits(s) 176 if path_or_buf is not None: 177 # apply compression and byte/text conversion --> 178 with get_handle( 179 path_or_buf, "w", compression=compression, storage_options=storage_options 180 ) as handles: 181 handles.handle.write(s) 182 else: File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/io/common.py:133, in IOHandles.__exit__(self, *args) 132 def __exit__(self, *args: Any) -> None: --> 133 self.close() File ~/cluster-env/trident_env/lib/python3.10/site-packages/pandas/io/common.py:125, in IOHandles.close(self) 123 self.created_handles.remove(self.handle) 124 for handle in self.created_handles: --> 125 handle.close() 126 self.created_handles = [] 127 self.is_wrapped = False File ~/cluster-env/trident_env/lib/python3.10/site-packages/adlfs/spec.py:1919, in AzureBlobFile.close(self) 1917 """Close file and azure client.""" 1918 asyncio.run_coroutine_threadsafe(close_container_client(self), loop=self.loop) -> 1919 super().close() File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/spec.py:1789, in AbstractBufferedFile.close(self) 1787 else: 1788 if not self.forced: -> 1789 self.flush(force=True) 1791 if self.fs is not None: 1792 self.fs.invalidate_cache(self.path) File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/spec.py:1660, in AbstractBufferedFile.flush(self, force) 1657 self.closed = True 1658 raise -> 1660 if self._upload_chunk(final=force) is not False: 1661 self.offset += self.buffer.seek(0, 2) 1662 self.buffer = io.BytesIO() File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/asyn.py:115, in sync_wrapper.<locals>.wrapper(*args, **kwargs) 112 @functools.wraps(func) 113 def wrapper(*args, **kwargs): 114 self = obj or args[0] --> 115 return sync(self.loop, func, *args, **kwargs) File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/asyn.py:100, in sync(loop, func, timeout, *args, **kwargs) 98 raise FSTimeoutError from return_result 99 elif isinstance(return_result, BaseException): --> 100 raise return_result 101 else: 102 return return_result File ~/cluster-env/trident_env/lib/python3.10/site-packages/fsspec/asyn.py:55, in _runner(event, coro, result, timeout) 53 coro = asyncio.wait_for(coro, timeout=timeout) 54 try: ---> 55 result[0] = await coro 56 except Exception as ex: 57 result[0] = ex File ~/cluster-env/trident_env/lib/python3.10/site-packages/adlfs/spec.py:2083, in AzureBlobFile._async_upload_chunk(self, final, **kwargs) 2079 await bc.commit_block_list( 2080 block_list=block_list, metadata=self.metadata 2081 ) 2082 else: -> 2083 raise RuntimeError(f"Failed to upload block{e}!") from e 2084 elif self.mode == "ab": 2085 async with self.container_client.get_blob_client(blob=self.blob) as bc: RuntimeError: Failed to upload blockInternal Server Error
Solved! Go to Solution.
Hi,
I switched from JSON to Parquet and had no issue with storing the data. So for the moment I circumvented the issue.
In the mean time I switched the method to "dataframe.to_parquet()" to be able to continue with the data exploration. An update on the bug would be helpful nontheless. Thanks for your help so far.
Hi ,
The internal team has updated me that there is no such issue at their side. They are able to repro your scenario successfully using 5M records. I have attached the screenshots for your reference.
So, can you please confirm is this a consistent error or a one time glitch which throwed due to Internal Server Error.
I received this error on consecutive days.
I used the orient=records option. Maybe this got soemthing to do with the error.
Maybe my JSON data contains a character that is not properly escaped by the pandas library and then leads to errors in the underlying libraries.
Unfortunately the data is sensitive, so I can't provide an example.
Hi ,
Apologies for the issue you are facing. To eliminate the data issue, you can try validating the json to see if data is in proper json format. This might help us in identifying if the issue is from data or from ADLFS package.
Hi @trutz ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet . In case if you have any resolution please do share that same with the community as it can be helpful to others . If you have any question relating to the current thread, please do let us know and we will try out best to help you.
Hi @trutz ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
Hi,
I switched from JSON to Parquet and had no issue with storing the data. So for the moment I circumvented the issue.
Hi @trutz ,
Thanks for using Microsoft Fabric Community.
The error message "4263942" is a generic error that can occur in a variety of contexts. It typically indicates a problem with a large dataset. For example, it could occur if the dataset is too large to be processed by the computer, or if the dataset is corrupted.
A DataFrame of size 4263942 is approximately 4.1 gigabytes. This is a relatively large DataFrame, and it is possible that your computer does not have enough memory to convert it to a JSON string. The error message "4263942" indicates that the size of the JSON data is too large to be written to Fabric Lakehouse storage. The maximum size of a JSON file that can be written to Fabric Lakehouse storage is 4 MB.
The specific size limit for a DataFrame depends on the amount of available memory on the computer and the complexity of the Pandas operations being performed. However, it is generally recommended to avoid using DataFrames that are larger than a few gigabytes.
There are a few things you can do to try to fix the error:
1) Reduce the size of the DataFrame. You can do this by selecting a subset of the rows or columns in the DataFrame.
2) Use a streaming JSON library. Streaming JSON libraries can convert DataFrames to JSON more efficiently than the built-in JSON library in Python.
3) Use a distributed computing framework. This allows you to distribute the workload across multiple computers.
Hope this helps. If you have any further questions or requests, please do ask.
Hi @v-nikhilan-msft ,
Thanks for your help.
The computer in this case is the Fabric instance (Premium P1 node). There's not much to configure as it runs as Software as a Service.
The other JSON files that are sucessfully written to the Lakehouse storage are up to 119 MB in size. So the 4MB limit you mentioned seems not to be valid.
I'll check out the streaming library approach.
Hi @trutz ,
I have reached the internal team for help regarding this issue. I will update you once I hear from them.
Check out the October 2024 Fabric update to learn about new features.
Learn from experts, get hands-on experience, and win awesome prizes.
User | Count |
---|---|
3 | |
2 | |
1 | |
1 | |
1 |