The ultimate Fabric, Power BI, SQL, and AI community-led learning event. Save €200 with code FABCOMM.
Get registeredEnhance your career with this limited time 50% discount on Fabric and Power BI exams. Ends August 31st. Request your voucher.
Dear All,
I have a requirement to read parquet files form the directory into a data frame to prepare the data form Bronze Lakehouse to Silver Lakehouse. while reading files, it is throwing error message
org.apache.spark.SparkException: Parquet column cannot be converted in file
filepath/SRV0001148_20250819065539974.parquet. Column: [Syncxxx.xxx:ApplicationArea.xxx:CreationDateTime], Expected: string, Found: INT96.
#1) sample script:
#2) sample script:
Solved! Go to Solution.
Dear Community,
Thank you for your continued support.
I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.
The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.
This led to data type mismatch errors during the read operation.
To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.
I modified my script to iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.
Hello @Sureshmannem,
Thank you for reaching out to the Microsoft Fabric Community Forum.
I have reproduced your scenario in a Fabric Notebook, and I got the expected results. Below I’ll share the steps, the code I used and screenshots of the outputs for clarity.
from datetime import datetime
from pyspark.sql import Row
data = [
Row(ID="1", Name="Ganesh", CreationDateTime=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]),
Row(ID="2", Name="Ravi", CreationDateTime=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3])
]
df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)
Output (Screenshot 1 – Schema & Screenshot 2 – Data):
df.write.mode("overwrite").saveAsTable("DemoTable")
spark.catalog.listTables("default")
Output (Screenshot 3 – Table Catalog):
With this approach, the table DemoTable was successfully created in the Lakehouse with the expected schema, and data was retrieved correctly with CreationDateTime as a string. It worked in my case because I explicitly formatted the creationdatetime column as a string before saving to the Lakehouse table. By default, Spark can sometimes infer a different data type (like timestamp) depending on how the value is created. Converting it to string ensures consistency and prevents schema mismatch issues.
Best Regards,
Ganesh singamshetty.
Hi Ganesh,
Thanks for your kind support and explanation.
My scneario is slightly different, I am sharing the sample script with masking
I have a scneario to read parquet files stored in lakehouse into a data frame to prepare my data, the issue is happening at very first step itself
source_df = spark.read.parquet("abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxx.Lakehouse/Files/xxxx/SRV0001148_*.parquet")
error: org.apache.spark.SparkException: Parquet column cannot be converted in file xxxxxxx Expected: string, Found: INT96.
I have tried by defining my schema explicitly, spark is still ignoring and considering only from parquet files.
Dear Community,
Thank you for your continued support.
I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.
The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.
This led to data type mismatch errors during the read operation.
To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.
I modified my script to iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.