Solved: Re: Parquet files reading into spark data frame is...

Sureshmannem

Dear All,

I have a requirement to read parquet files form the directory into a data frame to prepare the data form Bronze Lakehouse to Silver Lakehouse. while reading files, it is throwing error message

org.apache.spark.SparkException: Parquet column cannot be converted in file

filepath/SRV0001148_20250819065539974.parquet. Column: [Syncxxx.xxx:ApplicationArea.xxx:CreationDateTime], Expected: string, Found: INT96.

#1) sample script:

from pyspark.sql import SparkSession

from pyspark.sql.types import *

source_df = spark.read.parquet("filepath/SRV0001148_*.parquet")

source_df.show()

#2) sample script:

from pyspark.sql import SparkSession

from pyspark.sql.types import *

schema = StructType([

StructField("column1", StringType(), True),

StructField("column2", StringType(), True)

])

source_df = spark.read.schema(schema).parquet("filepath/SRV0001148_*.parquet")

source_df.show()

some of them are working, I was looking for approach to load data with every attribute to be considered as string, it's not working. Hence requesting support. please let us know if any one is experiencing the similar issue, please share your insight. it would be great help. Thanks in advance.

Regards,

Suresh

Sureshmannem

Dear Community,

Thank you for your continued support.

I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.

Initial Observation

The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.

This led to data type mismatch errors during the read operation.

Testing

To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.

Solution

I modified my script to iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.

# Step 1: Import required packages

from pyspark.sql import SparkSession

from pyspark.sql.types import *

import pandas as pd

from functools import reduce

from notebookutils import mssparkutils

# Step 2: Define the Lakehouse path

lakehouse_path = "abfss://xxx@onelake.dfs.fabric.microsoft.com/xxx.Lakehouse/xxx/"

# Step 3: List all Parquet files in the folder

file_list = mssparkutils.fs.ls(lakehouse_path)

parquet_files = [f.path for f in file_list if f.path.endswith(".parquet")]

# Step 4: Read schema reference file once

schema_df = spark.read.parquet("abfss://xxx@onelake.dfs.fabric.microsoft.com/xxx.Lakehouse/Files/xxx/xxx.parquet").toPandas()

schema_df = schema_df.head(0) # Empty schema frame

schema_columns = schema_df.columns.tolist()

# Step 5: Define helper functions

def clean_column_name(col_name):

for sep in ['@', ':']:

if sep in col_name:

col_name = col_name.split(sep)[-1]

return col_name

def rename_columns(df, old_names, new_names):

return reduce(

lambda data, idx: data.withColumnRenamed(old_names[idx], new_names[idx]),

range(len(old_names)),

df

)

# Step 6: Loop through each file and process

for file_path in parquet_files:

print(f"Processing file: {file_path}")

# Load the file into a Spark DataFrame

source_df = spark.read.parquet(file_path)

# Clean and rename columns

old_columns = source_df.columns

new_columns = [clean_column_name(col) for col in old_columns]

source_df = rename_columns(source_df, old_columns, new_columns)

# Convert to Pandas

source_df = source_df.toPandas().reset_index(drop=True)

# Add missing columns

for col in schema_columns:

if col not in source_df.columns:

source_df[col] = pd.NA

# Reorder columns

source_df = source_df[schema_columns]

# Concatenate with empty schema and convert to string

final_df = pd.concat([schema_df, source_df], ignore_index=True, sort=False).astype(str)

# Convert back to Spark DataFrame

final_spark_df = spark.createDataFrame(final_df)

# Show preview (or write to staging)

final_spark_df.show()

View solution in original post

Sureshmannem

Dear Community,

Thank you for your continued support.

I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.

Initial Observation

The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.

This led to data type mismatch errors during the read operation.

Testing

To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.

Solution

I modified my script to iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.

View solution in original post

Sureshmannem

Dear Community,

Thank you for your continued support.

I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.

Initial Observation

The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.

This led to data type mismatch errors during the read operation.

Testing

To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.

Solution

I modified my script to iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.

# Step 1: Import required packages

from pyspark.sql import SparkSession

from pyspark.sql.types import *

import pandas as pd

from functools import reduce

from notebookutils import mssparkutils

# Step 2: Define the Lakehouse path

lakehouse_path = "abfss://xxx@onelake.dfs.fabric.microsoft.com/xxx.Lakehouse/xxx/"

# Step 3: List all Parquet files in the folder

file_list = mssparkutils.fs.ls(lakehouse_path)

parquet_files = [f.path for f in file_list if f.path.endswith(".parquet")]

# Step 4: Read schema reference file once

schema_df = spark.read.parquet("abfss://xxx@onelake.dfs.fabric.microsoft.com/xxx.Lakehouse/Files/xxx/xxx.parquet").toPandas()

schema_df = schema_df.head(0) # Empty schema frame

schema_columns = schema_df.columns.tolist()

# Step 5: Define helper functions

def clean_column_name(col_name):

for sep in ['@', ':']:

if sep in col_name:

col_name = col_name.split(sep)[-1]

return col_name

def rename_columns(df, old_names, new_names):

return reduce(

lambda data, idx: data.withColumnRenamed(old_names[idx], new_names[idx]),

range(len(old_names)),

df

)

# Step 6: Loop through each file and process

for file_path in parquet_files:

print(f"Processing file: {file_path}")

# Load the file into a Spark DataFrame

source_df = spark.read.parquet(file_path)

# Clean and rename columns

old_columns = source_df.columns

new_columns = [clean_column_name(col) for col in old_columns]

source_df = rename_columns(source_df, old_columns, new_columns)

# Convert to Pandas

source_df = source_df.toPandas().reset_index(drop=True)

# Add missing columns

for col in schema_columns:

if col not in source_df.columns:

source_df[col] = pd.NA

# Reorder columns

source_df = source_df[schema_columns]

# Concatenate with empty schema and convert to string

final_df = pd.concat([schema_df, source_df], ignore_index=True, sort=False).astype(str)

# Convert back to Spark DataFrame

final_spark_df = spark.createDataFrame(final_df)

# Show preview (or write to staging)

final_spark_df.show()

v-ssriganesh

Hello @Sureshmannem,
Thank you for reaching out to the Microsoft Fabric Community Forum.

I have reproduced your scenario in a Fabric Notebook, and I got the expected results. Below I’ll share the steps, the code I used and screenshots of the outputs for clarity.

Created a DataFrame with sample data

from datetime import datetime
from pyspark.sql import Row

data = [

    Row(ID="1", Name="Ganesh", CreationDateTime=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]),

    Row(ID="2", Name="Ravi",   CreationDateTime=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3])

]

df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)

Output (Screenshot 1 – Schema & Screenshot 2 – Data):

Saved DataFrame as a Lakehouse table

df.write.mode("overwrite").saveAsTable("DemoTable")

Verified the table in catalog

spark.catalog.listTables("default")

Output (Screenshot 3 – Table Catalog):

With this approach, the table DemoTable was successfully created in the Lakehouse with the expected schema, and data was retrieved correctly with CreationDateTime as a string. It worked in my case because I explicitly formatted the creationdatetime column as a string before saving to the Lakehouse table. By default, Spark can sometimes infer a different data type (like timestamp) depending on how the value is created. Converting it to string ensures consistency and prevents schema mismatch issues.

Best Regards,
Ganesh singamshetty.

Sureshmannem

Hi Ganesh,

Thanks for your kind support and explanation.

My scneario is slightly different, I am sharing the sample script with masking

I have a scneario to read parquet files stored in lakehouse into a data frame to prepare my data, the issue is happening at very first step itself

source_df = spark.read.parquet("abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxx.Lakehouse/Files/xxxx/SRV0001148_*.parquet")

error: org.apache.spark.SparkException: Parquet column cannot be converted in file xxxxxxx Expected: string, Found: INT96.

I have tried by defining my schema explicitly, spark is still ignoring and considering only from parquet files.

Sureshmannem

Dear Community,

Thank you for your continued support.

I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.

Initial Observation

The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.

This led to data type mismatch errors during the read operation.

Testing

To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.

Solution

I modified my script to iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.

Parquet files reading into spark data frame is throwing data type error

Initial Observation

Testing

Solution

Initial Observation

Testing

Solution

Initial Observation

Testing

Solution

Initial Observation

Testing

Solution

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025

Join us at FabCon Vienna from September 15-18, 2025

Parquet files reading into spark data frame is throwing data type error

Initial Observation

Testing

Solution

Initial Observation

Testing

Solution

Initial Observation

Testing

Solution

Initial Observation

Testing

Solution

Helpful resources

Fabric Monthly Update - July 2025

Fabric Community Update - August 2025