Conversion of int column from Pandas to Spark fail...

JoergNeulist · ‎04-18-2024

For a data transformation task on Microsoft Fabric, I am using Pandas DataFrames (because of some missing features in the Spark version).

When trying to push the data to tables, I have to convert to Spark, which fails. The following code highlights the problem:

Python

import numpy
import pandas as pd

df = pd.DataFrame(['id'] + [numpy.int64(i) for i in range(100)])
print(df.dtypes)
display(df)

The result is:

/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:428: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected bytes, got a 'numpy.int64' object Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

The code fails. If I remove the cast to int64, the error still appears, but the code is able to recover.

I found an older instance of the same bug here: https://learn.microsoft.com/en-us/answers/questions/852534/arrow-optimization-in-python-notebook-fai...

The accepted resolution in that thread does not resolve the problem for me. Any suggestions?

v-cboorla-msft · ‎04-19-2024

Hi @JoergNeulist

Thanks for using Microsoft Fabric Community.

I tried to repro the above scenario with the below code:

import numpy
import pandas as pd

data = [numpy.int64(i) for i in range(100)]
pandas_df = pd.DataFrame(data, columns=['id'])
print(df.dtypes)
spark_df = spark.createDataFrame(pandas_df)

# Print schema (data types)
print(spark_df.dtypes)

# Display DataFrame (depends on your notebook environment)
display(spark_df)

Output:

Please try the above code and let me know if the issue still persists.

Hope this helps.

Thank you.

JoergNeulist · ‎04-23-2024

Thank you for the support! I have built a workaround now and haven't had time to check back yet.

The above solution is interesting! My actual code looks a bit different though, because the data is being read from a CSV file. But your version highlights that not all dataframes are created alike. I'll look into it when it tackles me again!

v-cboorla-msft · ‎04-22-2024

Hi @JoergNeulist

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others.
Otherwise, will respond back with the more details and we will try to help.

Thanks.

v-cboorla-msft · ‎04-23-2024

Hi @JoergNeulist

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

Thanks.

Expiscornovus · ‎04-19-2024

Hi @JoergNeulist,

Just to double check. Can't you use the asType conversion method for this?

https://pandas.pydata.org/docs/reference/frame.html#conversion

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

Happy to help out 🙂

I share #PowerAutomate and #SharePointOnline content on my Blog, Bluesky profile or Youtube Channel

JoergNeulist · ‎04-23-2024

The underlying task here is converting a Pandas DF to a Spark DF. I'm not trying to convert column data types. The cast is there purely to highlight the problem.

The symptom seems to be that Spark is trying to use pyarrow to optimize the conversion, but there's something wrong with the Java dependencies.

Conversion of int column from Pandas to Spark fails

Helpful resources

Fabric Community Update - July 2025

Fabric Monthly Update - June 2025

Party with Power BI’s own Guy in a Cube

Conversion of int column from Pandas to Spark fails

Helpful resources

Fabric Community Update - July 2025

Fabric Monthly Update - June 2025