Unable to open large Parquet file from S3 connecti...

TNastyCodes · ‎10-09-2025

Hi all,

I'm currently using a notebook connected to a lakehouse in order to establish my bronze layer tables. I have a relatively large parquet file (approx 7GB of data) that is unable to be read in. Currently I've copied my 3 files in from an S3 connection into my lakehouse , two of which are relatively small and the third that is this 7gb one. My first two were being read in correctly but had data issues that weren't allowing me to save them as tables, which I resolved by adding the following configuration changes:

spark.conf.set("spark.sql.parquet.writeLegacyFormat","true")

spark.conf.set("spark.sql.parquet.enableVectorizedReader","false")

From there, I was able to read and set up tables for my first two files. For the third large file, I'm trying to read it in altogether without trying to set up a table for it yet, but it consistently fails with the following error:
Py4JJavaError: An error occurred while calling o10861.parquet. : Operation failed: "Internal Server Error", 500, HEAD, http://onelake.dfs.fabric.microsoft.com/{warehouse_id}/Files/{file_path}?upn=false&action=getStatus&...

I tried adding other configuration changes such as the following:

spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")

spark.conf.set("spark.sql.parquet.timestampNTZ.enabled", "true")

I also tried the following options into my read statement:

.option('mergeSchema', 'false')

.option('parquet.ignoreMetadata','true')

.option('parquet.maxMetadataLength', 256*1024*1024)

I then tried to read it using openrowset and bulk

spark.sql('''SELECT *

FROM

OPENROWSET(

BULK \'abfss://*@onelake.dfs.fabric.microsoft.com/*/file_path\',

FORMAT = 'PARQUET'

) AS test''')

And lastly just to make the operation easier by passing in the schema as an option into the read, but with no avail the same error persists.

Using the option to create a delta table directly from the file manager also does not work.

Any idea why or ways I can remedy this?

None of these attempts worked. Am I missing something?

v-echaithra

Hi @TNastyCodes ,

We’d like to follow up regarding the recent concern. Kindly confirm whether the issue has been resolved, or if further assistance is still required. We are available to support you and are committed to helping you reach a resolution.

Best Regards,
Chaithra E.

v-echaithra · ‎10-10-2025

Hi @TNastyCodes ,

Check Connection and Confirm S3 Access: Verify that your Spark cluster can access the S3 bucket and has the necessary permissions by listing files or performing a simple file read operation.

Increase Resources with Memory and Executor Tuning for Spark: Adjust Spark's executor memory, driver memory, and partition settings to allocate sufficient resources for handling large files efficiently.

Optimize Read: Use mergeSchema=false, ignoreMetadata=true: Disable schema merging and metadata reading to avoid issues when reading large or inconsistent Parquet files.

Check for File Corruption: Use pyarrow to Verify File Integrity: Use pyarrow or another tool to check if the Parquet file is corrupted or unreadable outside of Spark.

Split File: If All Else Fails, Split Large File and Read in Smaller Parts: If the file is too large to process, consider splitting it into smaller chunks using tools like aws s3 cp, then process each chunk separately.

Hope this helps.
Thank you

TNastyCodes · ‎10-10-2025

Hi @v-echaithra ,

Thanks for your response! My spark cluster does have access to the S3 location as the other two files I am using from the same source are able to be read in and saved as a delta table after transformations. My logic also includes those two optimization options for the read as well, but sadly still fails.

I'll try playing with the memory & executor tuning as well as attempting with pyarrow/pandas.

It might be ideal to just break down the file into smaller chunks but how is that done w/ s3 cp?

v-echaithra · ‎10-15-2025

Hi @TNastyCodes ,

Check File Integrity: Even though you’ve verified that the smaller files are working, there could still be issues with the 7GB file. You can use PyArrow to check the integrity of the file and ensure it's not corrupted.

Steps to Verify with PyArrow:

import pyarrow.parquet as pq
try:
pq.read_table('s3://your-bucket/your-large-file.parquet')
print("File is valid and readable.")
except Exception as e:
print(f"Error reading file with PyArrow: {e}")

If this step fails, it may suggest file corruption, and you might need to try re-uploading the file or fixing it at the source.

2. Increase Spark Resources: Large files often require more memory and computing power to process. Consider tuning the Spark configurations related to memory and partitioning. Here are a few options to explore:

Increase Driver and Executor Memory:

spark.conf.set("spark.executor.memory", "16g")
spark.conf.set("spark.driver.memory", "16g")
spark.conf.set("spark.sql.shuffle.partitions", "200")

Increase Parallelism: If the file is large and you are reading it in one go, try splitting the data into smaller chunks for parallel processing:

spark.conf.set("spark.sql.files.maxPartitionBytes", "134217728")
spark.conf.set("spark.sql.files.openCostInBytes", "134217728")

Use More Executors: If your cluster has multiple executors, increase the number of executors and cores available for the job:

spark.conf.set("spark.executor.instances", "10")
spark.conf.set("spark.executor.cores", "4")

3. Optimize Parquet Settings:

You've already tried a few optimization options. Some other options to consider:

Increase parquet.block.size: This increases the block size for Parquet files, which can improve read performance for large files. spark.conf.set("spark.sql.parquet.block.size", "134217728") # Set block size to 128MB

Disable vectorizedReader: You already disabled the vectorizedReader option. It’s known that sometimes large files with complex schemas don’t work well with the vectorized reader, so disabling this is correct.

4. Use Delta Lake for Large File Handling: Since you are using a lakehouse setup, try utilizing Delta Lake for more efficient file handling and processing. Delta provides ACID transactions and better handling of large files. To load a large file directly into Delta:

delta_table = DeltaTable.forPath(spark, 's3://your-bucket/your-large-file.parquet')
delta_table.toDF().write.format("delta").save('s3://your-bucket/delta-tables/')
If Delta is not an option, consider using Apache Iceberg as an alternative for handling large datasets in a more optimized way.

5. Consider File Compression: Parquet files are already compressed, but if the file size is causing memory issues, consider compressing it further into formats like gzip or snappy when writing files. You can specify compression options when reading:

df = spark.read.option("compression", "snappy").parquet('s3://your-bucket/your-large-file.parquet')

Thank you
Chaithra E.

Unable to open large Parquet file from S3 connection

Helpful resources

Fabric Monthly Update - September 2025

FabCon Atlanta 2026

Fabric Data Days starts November 4th!

Unable to open large Parquet file from S3 connection

Helpful resources

Fabric Monthly Update - September 2025

FabCon Atlanta 2026