topic Conversion of int column from Pandas to Spark fails in Data Engineering

Conversion of int column from Pandas to Spark fails

JoergNeulist — Thu, 18 Apr 2024 14:22:13 GMT

For a data transformation task on Microsoft Fabric, I am using Pandas DataFrames (because of some missing features in the Spark version).

When trying to push the data to tables, I have to convert to Spark, which fails. The following code highlights the problem:

Python

import numpy
import pandas as pd

df = pd.DataFrame(['id'] + [numpy.int64(i) for i in range(100)])
print(df.dtypes)
display(df)

The result is:

/opt/spark/python/lib/pyspark.zip/pyspark/sql/pandas/conversion.py:428: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: Expected bytes, got a 'numpy.int64' object Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true.

The code fails. If I remove the cast to int64, the error still appears, but the code is able to recover.

I found an older instance of the same bug here: https://learn.microsoft.com/en-us/answers/questions/852534/arrow-optimization-in-python-notebook-fails

The accepted resolution in that thread does not resolve the problem for me. Any suggestions?

Re: Conversion of int column from Pandas to Spark fails

Expiscornovus — Fri, 19 Apr 2024 10:28:31 GMT

Hi @JoergNeulist,

Just to double check. Can't you use the asType conversion method for this?

https://pandas.pydata.org/docs/reference/frame.html#conversion

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.astype.html

Re: Conversion of int column from Pandas to Spark fails

v-cboorla-msft — Fri, 19 Apr 2024 10:32:59 GMT

Hi @JoergNeulist

Thanks for using Microsoft Fabric Community.

I tried to repro the above scenario with the below code:

import numpy import pandas as pd data = [numpy.int64(i) for i in range(100)] pandas_df = pd.DataFrame(data, columns=['id']) print(df.dtypes) spark_df = spark.createDataFrame(pandas_df) # Print schema (data types) print(spark_df.dtypes) # Display DataFrame (depends on your notebook environment) display(spark_df)

Output:

Please try the above code and let me know if the issue still persists.

Hope this helps.

Thank you.

Re: Conversion of int column from Pandas to Spark fails

v-cboorla-msft — Mon, 22 Apr 2024 15:40:21 GMT

Hi @JoergNeulist

Thanks.

Re: Conversion of int column from Pandas to Spark fails

v-cboorla-msft — Tue, 23 Apr 2024 13:30:09 GMT

Hi @JoergNeulist

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet. In case if you have any resolution please do share that same with the community as it can be helpful to others .
If you have any question relating to the current thread, please do let us know and we will try out best to help you.
In case if you have any other question on a different issue, we request you to open a new thread.

Thanks.

Re: Conversion of int column from Pandas to Spark fails

JoergNeulist — Tue, 23 Apr 2024 14:57:45 GMT

Thank you for the support! I have built a workaround now and haven't had time to check back yet.

The above solution is interesting! My actual code looks a bit different though, because the data is being read from a CSV file. But your version highlights that not all dataframes are created alike. I'll look into it when it tackles me again!

Re: Conversion of int column from Pandas to Spark fails

JoergNeulist — Tue, 23 Apr 2024 15:00:47 GMT

The underlying task here is converting a Pandas DF to a Spark DF. I'm not trying to convert column data types. The cast is there purely to highlight the problem.

The symptom seems to be that Spark is trying to use pyarrow to optimize the conversion, but there's something wrong with the Java dependencies.