Solved: Large Parquet files 2GB

zunigaw · ‎06-10-2025

Hi Community,

I’ve been working on a data pipeline where I extract Parquet files from an S3 shortcut connection into a PySpark notebook and transform them into Delta tables. The pipeline processes 12 Parquet files, each under 1.5GB in size, and everything was running smoothly (around 8 minutes for the entire process).

However, I recently encountered an issue after one of the Parquet files grew to 2.2GB. The process now crashes, even when I try processing this particular file separately as a test. I’m wondering whether Microsoft Fabric has a size limit for Parquet files or if this could be a timeout issue.

Interestingly, the problematic Parquet file can still be loaded into a Delta table manually using the "Load To Tables" option in the Amazon S3 shortcut, so the file itself doesn’t appear to be corrupted.

The error I am getting is the following:

Py4JJavaError: An error occurred while calling o5084.parquet. : Operation failed: "Internal Server Error", 500, HEAD, "path"/Order_Line_Item.parquet?upn=false&action=getStatus&timeout=90

When I click the link included in the error message, I see this additional error:

{"error":{"code":"Unauthorized","message":"Authentication Failed with Bearer token is not present in the request"}}

Has anyone experienced similar issues with large Parquet files in Microsoft Fabric or PySpark?
Does Microsoft Fabric impose a size limit on Parquet files, or could this be related to authentication or request timeouts?
I really appreciate any help here .

Here is the code:

from pyspark.sql.functions import explode
from pyspark.sql import SparkSession
import requests, os, datetime
from delta.tables import *

spark = SparkSession.builder\
.appName("Read Parquet Files")\
.config("spark.driver.memory", "16g")\
.config("spark.executor.memory", "16g")\
.config("spark.executor.cores", "4")\
.getOrCreate()

files_to_read = ["Order_Line_Item.parquet", "Order.parquet", "Product.parquet", "Account.parquet","Product_Allocation.parquet", "Fulfillment.parquet","Bridge.parquet","Opportunity.parquet", "abc.parquet", "abc2.parquet", "Region_Allocation_CO.parquet", "Region_Allocation_ST.parquet"]

dataframes = {}

for file_name in files_to_read:
file_path = f"abfss://path/{file_name}"
df = spark.read.parquet(file_path)
dataframes[file_name] = df
path = "abfss://path/Tables"
for file_name, df in dataframes.items():
if file_name == "abc.parquet" or file_name == "abc2.parquet":
output_file_name = "fact_" + file_name.replace('.parquet', '').lower()
else:
output_file_name = "dim_" + file_name.replace('.parquet', '').lower()
output_path = f"{path}/{output_file_name}"
df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(output_path)
spark.stop()

v-hashadapu · ‎06-10-2025

Hi @zunigaw , Thank you for reaching out to the Microsoft Community Forum.

This may not because of a size limit, but due to how Spark handles execution and authentication. Microsoft Fabric uses time-limited tokens for accessing shortcut-linked storage (like S3) and Spark delays file access until an action is triggered. If this delay exceeds the token's validity, the job fails with an authentication error, which is exactly what you're seeing.

In notebooks, that token is scoped to the session and doesn’t auto-refresh once expired. You should force Spark to access the file immediately after reading, while the token is still valid. You can do this by caching and counting the DataFrame before any further transformation or by writing a python code snippet to ensure the file is accessed while the token is still valid, avoiding deferred failures.

If this helped solve the issue, please consider marking it “Accept as Solution” and giving a ‘Kudos’ so others with similar queries may find it more easily. If not, please share the details, always happy to help.
Thank you.

View solution in original post

v-hashadapu · ‎06-10-2025

Hi @zunigaw , Thank you for reaching out to the Microsoft Community Forum.

This may not because of a size limit, but due to how Spark handles execution and authentication. Microsoft Fabric uses time-limited tokens for accessing shortcut-linked storage (like S3) and Spark delays file access until an action is triggered. If this delay exceeds the token's validity, the job fails with an authentication error, which is exactly what you're seeing.

In notebooks, that token is scoped to the session and doesn’t auto-refresh once expired. You should force Spark to access the file immediately after reading, while the token is still valid. You can do this by caching and counting the DataFrame before any further transformation or by writing a python code snippet to ensure the file is accessed while the token is still valid, avoiding deferred failures.

If this helped solve the issue, please consider marking it “Accept as Solution” and giving a ‘Kudos’ so others with similar queries may find it more easily. If not, please share the details, always happy to help.
Thank you.

Large Parquet files 2GB

Helpful resources

Fabric Monthly Update - May 2025

Fabric Community Update - June 2025

Become a Certified Power BI Data Analyst!

Large Parquet files 2GB

Helpful resources

Fabric Monthly Update - May 2025

Fabric Community Update - June 2025