<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>topic Conversion of int column from Pandas to Spark fails in Data Engineering</title>
    <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3850334#M586</link>
    <description>&lt;P&gt;For a data transformation task on Microsoft Fabric, I am using Pandas DataFrames (because of some missing features in the Spark version).&lt;/P&gt;&lt;P&gt;When trying to push the data to tables, I have to convert to Spark, which fails. The following code highlights the problem:&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Python&lt;/SPAN&gt;&lt;/DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;import&lt;/SPAN&gt; numpy
&lt;SPAN class=""&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class=""&gt;as&lt;/SPAN&gt; pd

df = pd.DataFrame([&lt;SPAN class=""&gt;'id'&lt;/SPAN&gt;] + [numpy.int64(i) &lt;SPAN class=""&gt;for&lt;/SPAN&gt; i &lt;SPAN class=""&gt;in&lt;/SPAN&gt; range(&lt;SPAN class=""&gt;100&lt;/SPAN&gt;)])
print(df.dtypes)
display(df)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;The result is:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:428: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected bytes, got a 'numpy.int64' object Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;The code fails. If I remove the cast to int64, the error still appears, but the code is able to recover.&lt;/P&gt;&lt;P&gt;I found an older instance of the same bug here: &lt;A href="https://learn.microsoft.com/en-us/answers/questions/852534/arrow-optimization-in-python-notebook-fails" target="_blank" rel="nofollow noopener ugc"&gt;https://learn.microsoft.com/en-us/answers/questions/852534/arrow-optimization-in-python-notebook-fails&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The accepted resolution in that thread does not resolve the problem for me. Any suggestions?&lt;/P&gt;</description>
    <pubDate>Thu, 18 Apr 2024 14:22:13 GMT</pubDate>
    <dc:creator>JoergNeulist</dc:creator>
    <dc:date>2024-04-18T14:22:13Z</dc:date>
    <item>
      <title>Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3850334#M586</link>
      <description>&lt;P&gt;For a data transformation task on Microsoft Fabric, I am using Pandas DataFrames (because of some missing features in the Spark version).&lt;/P&gt;&lt;P&gt;When trying to push the data to tables, I have to convert to Spark, which fails. The following code highlights the problem:&lt;/P&gt;&lt;DIV class=""&gt;&lt;SPAN class=""&gt;Python&lt;/SPAN&gt;&lt;/DIV&gt;&lt;PRE&gt;&lt;SPAN&gt;&lt;SPAN class=""&gt;import&lt;/SPAN&gt; numpy
&lt;SPAN class=""&gt;import&lt;/SPAN&gt; pandas &lt;SPAN class=""&gt;as&lt;/SPAN&gt; pd

df = pd.DataFrame([&lt;SPAN class=""&gt;'id'&lt;/SPAN&gt;] + [numpy.int64(i) &lt;SPAN class=""&gt;for&lt;/SPAN&gt; i &lt;SPAN class=""&gt;in&lt;/SPAN&gt; range(&lt;SPAN class=""&gt;100&lt;/SPAN&gt;)])
print(df.dtypes)
display(df)&lt;/SPAN&gt;&lt;/PRE&gt;&lt;P&gt;The result is:&lt;/P&gt;&lt;BLOCKQUOTE&gt;&lt;P&gt;/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:428: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected bytes, got a 'numpy.int64' object Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.&lt;/P&gt;&lt;/BLOCKQUOTE&gt;&lt;P&gt;The code fails. If I remove the cast to int64, the error still appears, but the code is able to recover.&lt;/P&gt;&lt;P&gt;I found an older instance of the same bug here: &lt;A href="https://learn.microsoft.com/en-us/answers/questions/852534/arrow-optimization-in-python-notebook-fails" target="_blank" rel="nofollow noopener ugc"&gt;https://learn.microsoft.com/en-us/answers/questions/852534/arrow-optimization-in-python-notebook-fails&lt;/A&gt;&lt;/P&gt;&lt;P&gt;The accepted resolution in that thread does not resolve the problem for me. Any suggestions?&lt;/P&gt;</description>
      <pubDate>Thu, 18 Apr 2024 14:22:13 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3850334#M586</guid>
      <dc:creator>JoergNeulist</dc:creator>
      <dc:date>2024-04-18T14:22:13Z</dc:date>
    </item>
    <item>
      <title>Re: Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3853385#M587</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/727538"&gt;@JoergNeulist&lt;/a&gt;,&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;Just to double check. Can't you use the asType conversion method for this?&lt;/P&gt;&lt;P&gt;&lt;A href="https://pandas.pydata.org/docs/reference/frame.html#conversion" target="_blank"&gt;https://pandas.pydata.org/docs/reference/frame.html#conversion&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&lt;A href="https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html" target="_blank"&gt;https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html&lt;/A&gt;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 19 Apr 2024 10:28:31 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3853385#M587</guid>
      <dc:creator>Expiscornovus</dc:creator>
      <dc:date>2024-04-19T10:28:31Z</dc:date>
    </item>
    <item>
      <title>Re: Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3853400#M588</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/727538"&gt;@JoergNeulist&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks for using Microsoft Fabric Community.&lt;/P&gt;
&lt;P&gt;I tried to repro the above scenario with the below code:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;import numpy
import pandas as pd

data = [numpy.int64(i) for i in range(100)]
pandas_df = pd.DataFrame(data, columns=['id'])
print(df.dtypes)
spark_df = spark.createDataFrame(pandas_df)

# Print schema (data types)
print(spark_df.dtypes)

# Display DataFrame (depends on your notebook environment)
display(spark_df)&lt;/LI-CODE&gt;
&lt;P&gt;&lt;BR /&gt;Output:&lt;/P&gt;
&lt;P&gt;&lt;span class="lia-inline-image-display-wrapper lia-image-align-inline" image-alt="vcboorlamsft_0-1713522558173.png" style="width: 400px;"&gt;&lt;img src="https://community.fabric.microsoft.com/t5/image/serverpage/image-id/1082245i901BCDC3FB8E385E/image-size/medium?v=v2&amp;amp;px=400" role="button" title="vcboorlamsft_0-1713522558173.png" alt="vcboorlamsft_0-1713522558173.png" /&gt;&lt;/span&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Please try the above code and let me know if the issue still persists.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Hope this helps.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thank you.&lt;/P&gt;</description>
      <pubDate>Fri, 19 Apr 2024 10:32:59 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3853400#M588</guid>
      <dc:creator>v-cboorla-msft</dc:creator>
      <dc:date>2024-04-19T10:32:59Z</dc:date>
    </item>
    <item>
      <title>Re: Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3859833#M589</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/727538"&gt;@JoergNeulist&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.&lt;BR /&gt;Otherwise, will respond back with the more details and we will try to help.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Mon, 22 Apr 2024 15:40:21 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3859833#M589</guid>
      <dc:creator>v-cboorla-msft</dc:creator>
      <dc:date>2024-04-22T15:40:21Z</dc:date>
    </item>
    <item>
      <title>Re: Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3862487#M590</link>
      <description>&lt;P&gt;Hi&amp;nbsp;&lt;a href="https://community.fabric.microsoft.com/t5/user/viewprofilepage/user-id/727538"&gt;@JoergNeulist&lt;/a&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .&lt;BR /&gt;If you have any question relating to the current thread, please do let us know and we will try out best to help you.&lt;BR /&gt;In case if you have any other question on a different issue, we request you to open a new thread.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Thanks.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Apr 2024 13:30:09 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3862487#M590</guid>
      <dc:creator>v-cboorla-msft</dc:creator>
      <dc:date>2024-04-23T13:30:09Z</dc:date>
    </item>
    <item>
      <title>Re: Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3862781#M591</link>
      <description>&lt;P&gt;Thank you for the support! I have built a workaround now and haven't had time to check back yet.&lt;/P&gt;&lt;P&gt;The above solution is interesting! My actual code looks a bit different though, because the data is being read from a CSV file. But your version highlights that not all dataframes are created alike. I'll look into it when it tackles me again!&lt;/P&gt;</description>
      <pubDate>Tue, 23 Apr 2024 14:57:45 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3862781#M591</guid>
      <dc:creator>JoergNeulist</dc:creator>
      <dc:date>2024-04-23T14:57:45Z</dc:date>
    </item>
    <item>
      <title>Re: Conversion of int column from Pandas to Spark fails</title>
      <link>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3862789#M592</link>
      <description>&lt;P&gt;The underlying task here is converting a Pandas DF to a Spark DF. I'm not trying to convert column data types. The cast is there purely to highlight the problem.&lt;/P&gt;&lt;P&gt;The symptom seems to be that Spark is trying to use pyarrow to optimize the conversion, but there's something wrong with the Java dependencies.&lt;/P&gt;</description>
      <pubDate>Tue, 23 Apr 2024 15:00:47 GMT</pubDate>
      <guid>https://community.fabric.microsoft.com/t5/Data-Engineering/Conversion-of-int-column-from-Pandas-to-Spark-fails/m-p/3862789#M592</guid>
      <dc:creator>JoergNeulist</dc:creator>
      <dc:date>2024-04-23T15:00:47Z</dc:date>
    </item>
  </channel>
</rss>

