<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Re: Parquet files reading into spark data frame is throwing data type error in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802229#M11842</link>
    <description>&lt;P&gt;Hi Ganesh,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your kind support and explanation.&lt;/P&gt;&lt;P&gt;My scneario is slightly different, I am sharing the sample script with masking&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a scneario to read parquet files stored in lakehouse into a data frame to prepare my data, the issue is happening at very first step itself&lt;/P&gt;&lt;P&gt;source_df = spark.read.parquet("abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxx.Lakehouse/Files/xxxx/SRV0001148_*.parquet")&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;error:&amp;nbsp;&lt;SPAN&gt;org.apache.spark.SparkException: Parquet column cannot be converted in file xxxxxxx&amp;nbsp;Expected: string, Found: INT96.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have tried by defining my schema explicitly, spark is still ignoring and considering only from parquet files.&amp;nbsp;&lt;/P&gt;</description>
    <pubDate>Wed, 20 Aug 2025 11:24:58 GMT</pubDate>
    <dc:creator>Sureshmannem</dc:creator>
    <dc:date>2025-08-20T11:24:58Z</dc:date>
    <item>
      <title>Parquet files reading into spark data frame is throwing data type error</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4801853#M11829</link>
      <description>&lt;P&gt;Dear All,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a requirement to read parquet files form the directory into a data frame to prepare the data form Bronze Lakehouse to Silver Lakehouse. while reading files, it is throwing error message&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;org.apache.spark.SparkException: Parquet column cannot be converted in file &lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;filepath/SRV0001148_20250819065539974.parquet. Column: [Syncxxx.xxx:ApplicationArea.xxx:CreationDateTime], Expected: string, Found: INT96.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;#1) sample script:&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; &lt;SPAN&gt;pyspark&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;sql&lt;/SPAN&gt; &lt;SPAN&gt;import&lt;/SPAN&gt; &lt;SPAN&gt;SparkSession&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; &lt;SPAN&gt;pyspark&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;sql&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;types&lt;/SPAN&gt; &lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; *&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;source_df&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;spark&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;read&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;parquet&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"filepath/SRV0001148_*.parquet"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;source_df&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;show&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;P&gt;&lt;SPAN&gt;#2) sample script:&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; &lt;SPAN&gt;pyspark&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;sql&lt;/SPAN&gt; &lt;SPAN&gt;import&lt;/SPAN&gt; &lt;SPAN&gt;SparkSession&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;from&lt;/SPAN&gt; &lt;SPAN&gt;pyspark&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;sql&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;types&lt;/SPAN&gt; &lt;SPAN&gt;import&lt;/SPAN&gt;&lt;SPAN&gt; *&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;schema&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;StructType&lt;/SPAN&gt;&lt;SPAN&gt;([&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;StructField&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"column1"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;StringType&lt;/SPAN&gt;&lt;SPAN&gt;(), &lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;),&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp; &lt;/SPAN&gt;&lt;SPAN&gt;StructField&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"column2"&lt;/SPAN&gt;&lt;SPAN&gt;, &lt;/SPAN&gt;&lt;SPAN&gt;StringType&lt;/SPAN&gt;&lt;SPAN&gt;(), &lt;/SPAN&gt;&lt;SPAN&gt;True&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;])&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;source_df&lt;/SPAN&gt;&lt;SPAN&gt; = &lt;/SPAN&gt;&lt;SPAN&gt;spark&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;read&lt;/SPAN&gt;&lt;SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;SPAN&gt;schema&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;schema&lt;/SPAN&gt;&lt;SPAN&gt;).&lt;/SPAN&gt;&lt;SPAN&gt;parquet&lt;/SPAN&gt;&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;SPAN&gt;"filepath/SRV0001148_*.parquet"&lt;/SPAN&gt;&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;source_df&lt;/SPAN&gt;&lt;SPAN&gt;.&lt;/SPAN&gt;&lt;SPAN&gt;show&lt;/SPAN&gt;&lt;SPAN&gt;()&lt;BR /&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;some of them are working, I was looking for approach to load data with every attribute to be considered as string, it's not working. Hence requesting support. please let us know if any one is experiencing the similar issue, please share your insight. it would be great help. Thanks in advance.&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Regards,&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&lt;SPAN&gt;Suresh&lt;/SPAN&gt;&lt;/DIV&gt;&lt;DIV&gt;&amp;nbsp;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 20 Aug 2025 06:57:43 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4801853#M11829</guid>
      <dc:creator>Sureshmannem</dc:creator>
      <dc:date>2025-08-20T06:57:43Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files reading into spark data frame is throwing data type error</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802181#M11838</link>
      <description>&lt;P&gt;Hello &lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/1296869"&gt;@Sureshmannem&lt;/a&gt;,&lt;BR /&gt;Thank you for reaching out to the Microsoft Fabric Community Forum. &lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;I have reproduced your scenario in a Fabric Notebook, and I got the expected results. Below I’ll share the steps, the code I used and screenshots of the outputs for clarity.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Created a DataFrame with sample data&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI-CODE lang="markup"&gt;from datetime import datetime
from pyspark.sql import Row

data = [

    Row(ID="1", Name="Ganesh", CreationDateTime=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3]),

    Row(ID="2", Name="Ravi",   CreationDateTime=datetime.now().strftime("%Y-%m-%d %H:%M:%S.%f")[:-3])

]

df = spark.createDataFrame(data)
df.printSchema()
df.show(truncate=False)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Output (Screenshot 1 – Schema &amp;amp; Screenshot 2 – Data):&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vssriganesh_0-1755686626562.png" style="width: 362px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1292467i3F5B2B786C60A463/image-dimensions/362x220?v=v2" width="362" height="220" role="button" title="vssriganesh_0-1755686626562.png" alt="vssriganesh_0-1755686626562.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Saved DataFrame as a Lakehouse table&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI-CODE lang="markup"&gt;df.write.mode("overwrite").saveAsTable("DemoTable")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Verified the table in catalog&lt;/LI&gt;
&lt;/UL&gt;
&lt;LI-CODE lang="markup"&gt;spark.catalog.listTables("default")&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Output (Screenshot 3 – Table Catalog):&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vssriganesh_1-1755686691368.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1292468i5E268DE59CD924AB/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vssriganesh_1-1755686691368.png" alt="vssriganesh_1-1755686691368.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;With this approach, the table DemoTable was successfully created in the Lakehouse with the expected schema, and data was retrieved correctly with CreationDateTime as a string.&amp;nbsp;It worked in my case because I explicitly formatted the creationdatetime&amp;nbsp;column as a string before saving to the Lakehouse table. By default, Spark can sometimes infer a different data type (like timestamp) depending on how the value is created. Converting it to string ensures consistency and prevents schema mismatch issues.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Best Regards,&lt;BR /&gt;Ganesh singamshetty.&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Aug 2025 10:50:14 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802181#M11838</guid>
      <dc:creator>v-ssriganesh</dc:creator>
      <dc:date>2025-08-20T10:50:14Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files reading into spark data frame is throwing data type error</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802229#M11842</link>
      <description>&lt;P&gt;Hi Ganesh,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Thanks for your kind support and explanation.&lt;/P&gt;&lt;P&gt;My scneario is slightly different, I am sharing the sample script with masking&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have a scneario to read parquet files stored in lakehouse into a data frame to prepare my data, the issue is happening at very first step itself&lt;/P&gt;&lt;P&gt;source_df = spark.read.parquet("abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxx.Lakehouse/Files/xxxx/SRV0001148_*.parquet")&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;error:&amp;nbsp;&lt;SPAN&gt;org.apache.spark.SparkException: Parquet column cannot be converted in file xxxxxxx&amp;nbsp;Expected: string, Found: INT96.&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;I have tried by defining my schema explicitly, spark is still ignoring and considering only from parquet files.&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Aug 2025 11:24:58 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802229#M11842</guid>
      <dc:creator>Sureshmannem</dc:creator>
      <dc:date>2025-08-20T11:24:58Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files reading into spark data frame is throwing data type error</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802670#M11856</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Dear Community,&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Thank you for your continued support.&lt;/P&gt;&lt;P&gt;I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Initial Observation&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.&lt;/P&gt;&lt;P&gt;This led to&lt;SPAN&gt;&amp;nbsp;&lt;STRONG&gt;data type mismatch errors&lt;SPAN&gt;&amp;nbsp;during the read operation.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Testing&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Solution&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;I modified my script to&lt;SPAN&gt;&amp;nbsp;&lt;STRONG&gt;iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;!--   ScriptorStartFragment   --&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;# Step 1: Import required packages&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;from &lt;SPAN&gt;pyspark&lt;SPAN&gt;.sql &lt;SPAN&gt;import &lt;SPAN&gt;SparkSession&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;from &lt;SPAN&gt;pyspark&lt;SPAN&gt;.sql&lt;SPAN&gt;.types &lt;SPAN&gt;import &lt;SPAN&gt;*&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;import &lt;SPAN&gt;pandas &lt;SPAN&gt;as &lt;SPAN&gt;pd&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;from &lt;SPAN&gt;functools &lt;SPAN&gt;import &lt;SPAN&gt;reduce&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;from &lt;SPAN&gt;notebookutils &lt;SPAN&gt;import &lt;SPAN&gt;mssparkutils&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;# Step 2: Define the Lakehouse path&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;lakehouse_path &lt;SPAN&gt;= &lt;SPAN&gt;"abfss://xxx@onelake.dfs.fabric.microsoft.com/xxx.Lakehouse/xxx/"&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;# Step 3: List all Parquet files in the folder&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;file_list &lt;SPAN&gt;= &lt;SPAN&gt;mssparkutils&lt;SPAN&gt;.fs&lt;SPAN&gt;.ls&lt;SPAN&gt;(&lt;SPAN&gt;lakehouse_path&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;parquet_files &lt;SPAN&gt;= &lt;SPAN&gt;[&lt;SPAN&gt;f&lt;SPAN&gt;.path &lt;SPAN&gt;for &lt;SPAN&gt;f &lt;SPAN&gt;in &lt;SPAN&gt;file_list &lt;SPAN&gt;if &lt;SPAN&gt;f&lt;SPAN&gt;.path&lt;SPAN&gt;.endswith&lt;SPAN&gt;(&lt;SPAN&gt;".parquet"&lt;SPAN&gt;)&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;# Step 4: Read schema reference file once&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;schema_df &lt;SPAN&gt;= &lt;SPAN&gt;spark&lt;SPAN&gt;.read&lt;SPAN&gt;.parquet&lt;SPAN&gt;(&lt;SPAN&gt;"abfss://xxx@onelake.dfs.fabric.microsoft.com/xxx.Lakehouse/Files/xxx/xxx.parquet"&lt;SPAN&gt;)&lt;SPAN&gt;.toPandas&lt;SPAN&gt;(&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;schema_df &lt;SPAN&gt;= &lt;SPAN&gt;schema_df&lt;SPAN&gt;.head&lt;SPAN&gt;(&lt;SPAN&gt;0&lt;SPAN&gt;)&lt;SPAN&gt;&amp;nbsp; &lt;SPAN&gt;# Empty schema frame&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;schema_columns &lt;SPAN&gt;= &lt;SPAN&gt;schema_df&lt;SPAN&gt;.columns&lt;SPAN&gt;.tolist&lt;SPAN&gt;(&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;# Step 5: Define helper functions&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;def &lt;SPAN&gt;clean_column_name&lt;SPAN&gt;(&lt;SPAN&gt;col_name&lt;SPAN&gt;)&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;for &lt;SPAN&gt;sep &lt;SPAN&gt;in &lt;SPAN&gt;[&lt;SPAN&gt;'@'&lt;SPAN&gt;, &lt;SPAN&gt;':'&lt;SPAN&gt;]&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;if &lt;SPAN&gt;sep &lt;SPAN&gt;in &lt;SPAN&gt;col_name&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;col_name &lt;SPAN&gt;= &lt;SPAN&gt;col_name&lt;SPAN&gt;.split&lt;SPAN&gt;(&lt;SPAN&gt;sep&lt;SPAN&gt;)&lt;SPAN&gt;[&lt;SPAN&gt;-&lt;SPAN&gt;1&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;return &lt;SPAN&gt;col_name&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;def &lt;SPAN&gt;rename_columns&lt;SPAN&gt;(&lt;SPAN&gt;df&lt;SPAN&gt;, &lt;SPAN&gt;old_names&lt;SPAN&gt;, &lt;SPAN&gt;new_names&lt;SPAN&gt;)&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;return &lt;SPAN&gt;reduce&lt;SPAN&gt;(&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;lambda &lt;SPAN&gt;data&lt;SPAN&gt;, &lt;SPAN&gt;idx&lt;SPAN&gt;: &lt;SPAN&gt;data&lt;SPAN&gt;.withColumnRenamed&lt;SPAN&gt;(&lt;SPAN&gt;old_names&lt;SPAN&gt;[&lt;SPAN&gt;idx&lt;SPAN&gt;]&lt;SPAN&gt;, &lt;SPAN&gt;new_names&lt;SPAN&gt;[&lt;SPAN&gt;idx&lt;SPAN&gt;]&lt;SPAN&gt;)&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;range&lt;SPAN&gt;(&lt;SPAN&gt;len&lt;SPAN&gt;(&lt;SPAN&gt;old_names&lt;SPAN&gt;)&lt;SPAN&gt;)&lt;SPAN&gt;,&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;df&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;P&gt;&lt;SPAN&gt;&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;# Step 6: Loop through each file and process&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;for &lt;SPAN&gt;file_path &lt;SPAN&gt;in &lt;SPAN&gt;parquet_files&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;print&lt;SPAN&gt;(&lt;SPAN&gt;f&lt;SPAN&gt;"Processing file: {file_path}"&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Load the file into a Spark DataFrame&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;source_df &lt;SPAN&gt;= &lt;SPAN&gt;spark&lt;SPAN&gt;.read&lt;SPAN&gt;.parquet&lt;SPAN&gt;(&lt;SPAN&gt;file_path&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Clean and rename columns&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;old_columns &lt;SPAN&gt;= &lt;SPAN&gt;source_df&lt;SPAN&gt;.columns&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;new_columns &lt;SPAN&gt;= &lt;SPAN&gt;[&lt;SPAN&gt;clean_column_name&lt;SPAN&gt;(&lt;SPAN&gt;col&lt;SPAN&gt;) &lt;SPAN&gt;for &lt;SPAN&gt;col &lt;SPAN&gt;in &lt;SPAN&gt;old_columns&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;source_df &lt;SPAN&gt;= &lt;SPAN&gt;rename_columns&lt;SPAN&gt;(&lt;SPAN&gt;source_df&lt;SPAN&gt;, &lt;SPAN&gt;old_columns&lt;SPAN&gt;, &lt;SPAN&gt;new_columns&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Convert to Pandas&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;source_df &lt;SPAN&gt;= &lt;SPAN&gt;source_df&lt;SPAN&gt;.toPandas&lt;SPAN&gt;(&lt;SPAN&gt;)&lt;SPAN&gt;.reset_index&lt;SPAN&gt;(&lt;SPAN&gt;drop&lt;SPAN&gt;=&lt;SPAN&gt;True&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Add missing columns&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;for &lt;SPAN&gt;col &lt;SPAN&gt;in &lt;SPAN&gt;schema_columns&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;if &lt;SPAN&gt;col &lt;SPAN&gt;not &lt;SPAN&gt;in &lt;SPAN&gt;source_df&lt;SPAN&gt;.columns&lt;SPAN&gt;:&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;source_df&lt;SPAN&gt;[&lt;SPAN&gt;col&lt;SPAN&gt;] &lt;SPAN&gt;= &lt;SPAN&gt;pd&lt;SPAN&gt;.NA&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Reorder columns&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;source_df &lt;SPAN&gt;= &lt;SPAN&gt;source_df&lt;SPAN&gt;[&lt;SPAN&gt;schema_columns&lt;SPAN&gt;]&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Concatenate with empty schema and convert to string&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;final_df &lt;SPAN&gt;= &lt;SPAN&gt;pd&lt;SPAN&gt;.concat&lt;SPAN&gt;(&lt;SPAN&gt;[&lt;SPAN&gt;schema_df&lt;SPAN&gt;, &lt;SPAN&gt;source_df&lt;SPAN&gt;]&lt;SPAN&gt;, &lt;SPAN&gt;ignore_index&lt;SPAN&gt;=&lt;SPAN&gt;True&lt;SPAN&gt;, &lt;SPAN&gt;sort&lt;SPAN&gt;=&lt;SPAN&gt;False&lt;SPAN&gt;)&lt;SPAN&gt;.astype&lt;SPAN&gt;(&lt;SPAN&gt;str&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Convert back to Spark DataFrame&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;final_spark_df &lt;SPAN&gt;= &lt;SPAN&gt;spark&lt;SPAN&gt;.createDataFrame&lt;SPAN&gt;(&lt;SPAN&gt;final_df&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;# Show preview (or write to staging)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&lt;SPAN&gt;&amp;nbsp;&amp;nbsp;&amp;nbsp; &lt;SPAN&gt;final_spark_df&lt;SPAN&gt;.show&lt;SPAN&gt;(&lt;SPAN&gt;)&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;DIV class=""&gt;&lt;SPAN&gt;&amp;nbsp; &amp;nbsp;&lt;SPAN&gt;&lt;!--   ScriptorEndFragment   --&gt;&lt;/SPAN&gt;&lt;/SPAN&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;&lt;/DIV&gt;</description>
      <pubDate>Wed, 20 Aug 2025 17:11:03 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802670#M11856</guid>
      <dc:creator>Sureshmannem</dc:creator>
      <dc:date>2025-08-20T17:11:03Z</dc:date>
    </item>
    <item>
      <title>Re: Parquet files reading into spark data frame is throwing data type error</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802684#M11857</link>
      <description>&lt;P&gt;&lt;STRONG&gt;Dear Community,&lt;/STRONG&gt;&lt;/P&gt;&lt;P&gt;Thank you for your continued support.&lt;/P&gt;&lt;P&gt;I’m happy to share that I’ve resolved the issue I was facing, and I’d like to outline the approach I followed in case it helps others encountering similar challenges.&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Initial Observation&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;The issue occurred when I attempted to load over 50 Parquet files into a single PySpark DataFrame using a wildcard path. PySpark inferred the schema from the data in each file, but inconsistencies arose—some files interpreted a particular attribute as an integer, while others treated the same attribute as a string.&lt;/P&gt;&lt;P&gt;This led to&lt;SPAN&gt;&amp;nbsp;&lt;STRONG&gt;data type mismatch errors&lt;SPAN&gt;&amp;nbsp;during the read operation.&lt;/SPAN&gt;&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Testing&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;To investigate further, I loaded each file individually into a DataFrame. This worked as expected, confirming that the wildcard-based bulk load was failing due to schema inference conflicts across files.&lt;/P&gt;&lt;H3&gt;&lt;STRONG&gt;Solution&lt;/STRONG&gt;&lt;/H3&gt;&lt;P&gt;I modified my script to&lt;SPAN&gt;&amp;nbsp;&lt;STRONG&gt;iterate through each file individually, applying the full processing logic per file. This approach bypasses the schema inference conflict and successfully loads and processes all files.&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&lt;SPAN&gt;&lt;!--   ScriptorStartFragment   --&gt;&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Wed, 20 Aug 2025 17:16:35 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Parquet-files-reading-into-spark-data-frame-is-throwing-data/m-p/4802684#M11857</guid>
      <dc:creator>Sureshmannem</dc:creator>
      <dc:date>2025-08-20T17:16:35Z</dc:date>
    </item>
  </channel>
</rss>

