Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Join us at FabCon Vienna from September 15-18, 2025, for the ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM. Get registered

Reply
zunigaw
New Member

Large Parquet files 2GB

Hi Community,

 

I’ve been working on a data pipeline where I extract Parquet files from an S3 shortcut connection into a PySpark notebook and transform them into Delta tables. The pipeline processes 12 Parquet files, each under 1.5GB in size, and everything was running smoothly (around 8 minutes for the entire process).

 

However, I recently encountered an issue after one of the Parquet files grew to 2.2GB. The process now crashes, even when I try processing this particular file separately as a test. I’m wondering whether Microsoft Fabric has a size limit for Parquet files or if this could be a timeout issue.

Interestingly, the problematic Parquet file can still be loaded into a Delta table manually using the "Load To Tables" option in the Amazon S3 shortcut, so the file itself doesn’t appear to be corrupted.

zunigaw_0-1749588136850.png

 

The error I am getting is the following: 

 Py4JJavaError: An error occurred while calling o5084.parquet. : Operation failed: "Internal Server Error", 500, HEAD, "path"/Order_Line_Item.parquet?upn=false&action=getStatus&timeout=90 

 

When I click the link included in the error message, I see this additional error:

{"error":{"code":"Unauthorized","message":"Authentication Failed with Bearer token is not present in the request"}}

 

  1. Has anyone experienced similar issues with large Parquet files in Microsoft Fabric or PySpark?
  2. Does Microsoft Fabric impose a size limit on Parquet files, or could this be related to authentication or request timeouts?

    I really appreciate any help here . 

     

    Here is the code: 

     

    from pyspark.sql.functions import explode
    from pyspark.sql import SparkSession
    import requests, os, datetime
    from delta.tables import *
     
    spark = SparkSession.builder\
        .appName("Read Parquet Files")\
        .config("spark.driver.memory", "16g")\
        .config("spark.executor.memory", "16g")\
        .config("spark.executor.cores", "4")\
        .getOrCreate()
      
    files_to_read = ["Order_Line_Item.parquet", "Order.parquet", "Product.parquet", "Account.parquet","Product_Allocation.parquet", "Fulfillment.parquet","Bridge.parquet","Opportunity.parquet", "abc.parquet", "abc2.parquet", "Region_Allocation_CO.parquet", "Region_Allocation_ST.parquet"]
     
    dataframes = {}
     
    for file_name in files_to_read:
        file_path = f"abfss://path/{file_name}"
        df = spark.read.parquet(file_path)
        dataframes[file_name] = df
     path = "abfss://path/Tables"    
    for file_name, df in dataframes.items():
        if file_name == "abc.parquet" or file_name == "abc2.parquet":
            output_file_name = "fact_" + file_name.replace('.parquet', '').lower()
        else:
            output_file_name = "dim_" + file_name.replace('.parquet', '').lower()
        output_path = f"{path}/{output_file_name}"
        df.write.format("delta").mode("overwrite").option("overwriteSchema", "true").save(output_path)
    spark.stop()
1 ACCEPTED SOLUTION
v-hashadapu
Community Support
Community Support

Hi @zunigaw , Thank you for reaching out to the Microsoft Community Forum.

 

This may not because of a size limit, but due to how Spark handles execution and authentication. Microsoft Fabric uses time-limited tokens for accessing shortcut-linked storage (like S3) and Spark delays file access until an action is triggered. If this delay exceeds the token's validity, the job fails with an authentication error, which is exactly what you're seeing.

 

In notebooks, that token is scoped to the session and doesn’t auto-refresh once expired. You should force Spark to access the file immediately after reading, while the token is still valid. You can do this by caching and counting the DataFrame before any further transformation or by writing a python code snippet to ensure the file is accessed while the token is still valid, avoiding deferred failures.

 

If this helped solve the issue, please consider marking it “Accept as Solution” and giving a ‘Kudos’ so others with similar queries may find it more easily. If not, please share the details, always happy to help.
Thank you.

View solution in original post

1 REPLY 1
v-hashadapu
Community Support
Community Support

Hi @zunigaw , Thank you for reaching out to the Microsoft Community Forum.

 

This may not because of a size limit, but due to how Spark handles execution and authentication. Microsoft Fabric uses time-limited tokens for accessing shortcut-linked storage (like S3) and Spark delays file access until an action is triggered. If this delay exceeds the token's validity, the job fails with an authentication error, which is exactly what you're seeing.

 

In notebooks, that token is scoped to the session and doesn’t auto-refresh once expired. You should force Spark to access the file immediately after reading, while the token is still valid. You can do this by caching and counting the DataFrame before any further transformation or by writing a python code snippet to ensure the file is accessed while the token is still valid, avoiding deferred failures.

 

If this helped solve the issue, please consider marking it “Accept as Solution” and giving a ‘Kudos’ so others with similar queries may find it more easily. If not, please share the details, always happy to help.
Thank you.

Helpful resources

Announcements
May FBC25 Carousel

Fabric Monthly Update - May 2025

Check out the May 2025 Fabric update to learn about new features.

June 2025 community update carousel

Fabric Community Update - June 2025

Find out what's new and trending in the Fabric community.