Advance your Data & AI career with 50 days of live learning, dataviz contests, hands-on challenges, study groups & certifications and more!
Get registeredJoin us at FabCon Atlanta from March 16 - 20, 2026, for the ultimate Fabric, Power BI, AI and SQL community-led event. Save $200 with code FABCOMM. Register now.
Hi all,
I'm currently using a notebook connected to a lakehouse in order to establish my bronze layer tables. I have a relatively large parquet file (approx 7GB of data) that is unable to be read in. Currently I've copied my 3 files in from an S3 connection into my lakehouse , two of which are relatively small and the third that is this 7gb one. My first two were being read in correctly but had data issues that weren't allowing me to save them as tables, which I resolved by adding the following configuration changes:
Hi @TNastyCodes ,
We’d like to follow up regarding the recent concern. Kindly confirm whether the issue has been resolved, or if further assistance is still required. We are available to support you and are committed to helping you reach a resolution.
Best Regards,
Chaithra E.
Hi @TNastyCodes ,
Check Connection and Confirm S3 Access: Verify that your Spark cluster can access the S3 bucket and has the necessary permissions by listing files or performing a simple file read operation.
Increase Resources with Memory and Executor Tuning for Spark: Adjust Spark's executor memory, driver memory, and partition settings to allocate sufficient resources for handling large files efficiently.
Optimize Read: Use mergeSchema=false, ignoreMetadata=true: Disable schema merging and metadata reading to avoid issues when reading large or inconsistent Parquet files.
Check for File Corruption: Use pyarrow to Verify File Integrity: Use pyarrow or another tool to check if the Parquet file is corrupted or unreadable outside of Spark.
Split File: If All Else Fails, Split Large File and Read in Smaller Parts: If the file is too large to process, consider splitting it into smaller chunks using tools like aws s3 cp, then process each chunk separately.
Hope this helps.
Thank you
Hi @v-echaithra ,
Thanks for your response! My spark cluster does have access to the S3 location as the other two files I am using from the same source are able to be read in and saved as a delta table after transformations. My logic also includes those two optimization options for the read as well, but sadly still fails.
I'll try playing with the memory & executor tuning as well as attempting with pyarrow/pandas.
It might be ideal to just break down the file into smaller chunks but how is that done w/ s3 cp?
Hi @TNastyCodes ,
Check File Integrity: Even though you’ve verified that the smaller files are working, there could still be issues with the 7GB file. You can use PyArrow to check the integrity of the file and ensure it's not corrupted.
Steps to Verify with PyArrow:
import pyarrow.parquet as pq
try:
pq.read_table('s3://your-bucket/your-large-file.parquet')
print("File is valid and readable.")
except Exception as e:
print(f"Error reading file with PyArrow: {e}")
If this step fails, it may suggest file corruption, and you might need to try re-uploading the file or fixing it at the source.
2. Increase Spark Resources: Large files often require more memory and computing power to process. Consider tuning the Spark configurations related to memory and partitioning. Here are a few options to explore:
Increase Driver and Executor Memory:
spark.conf.set("spark.executor.memory", "16g")
spark.conf.set("spark.driver.memory", "16g")
spark.conf.set("spark.sql.shuffle.partitions", "200")
Increase Parallelism: If the file is large and you are reading it in one go, try splitting the data into smaller chunks for parallel processing:
spark.conf.set("spark.sql.files.maxPartitionBytes", "134217728")
spark.conf.set("spark.sql.files.openCostInBytes", "134217728")
Use More Executors: If your cluster has multiple executors, increase the number of executors and cores available for the job:
spark.conf.set("spark.executor.instances", "10")
spark.conf.set("spark.executor.cores", "4")
3. Optimize Parquet Settings:
You've already tried a few optimization options. Some other options to consider:
Increase parquet.block.size: This increases the block size for Parquet files, which can improve read performance for large files. spark.conf.set("spark.sql.parquet.block.size", "134217728") # Set block size to 128MB
Disable vectorizedReader: You already disabled the vectorizedReader option. It’s known that sometimes large files with complex schemas don’t work well with the vectorized reader, so disabling this is correct.
4. Use Delta Lake for Large File Handling: Since you are using a lakehouse setup, try utilizing Delta Lake for more efficient file handling and processing. Delta provides ACID transactions and better handling of large files. To load a large file directly into Delta:
delta_table = DeltaTable.forPath(spark, 's3://your-bucket/your-large-file.parquet')
delta_table.toDF().write.format("delta").save('s3://your-bucket/delta-tables/')
If Delta is not an option, consider using Apache Iceberg as an alternative for handling large datasets in a more optimized way.
5. Consider File Compression: Parquet files are already compressed, but if the file size is causing memory issues, consider compressing it further into formats like gzip or snappy when writing files. You can specify compression options when reading:
df = spark.read.option("compression", "snappy").parquet('s3://your-bucket/your-large-file.parquet')
Thank you
Chaithra E.
| User | Count |
|---|---|
| 14 | |
| 7 | |
| 2 | |
| 2 | |
| 2 |