Solved: Re: Reusable function for data transformation - us...

tinbaj · ‎06-04-2025

I retrieve data from JDE as the source, and the table contains date fields where the information is stored as Julian dates. Currently, there are 8 different sources, with one notebook for each source, all located in the same workspace. Therefore, the same function to convert Julian dates to dates is defined in all 8 notebooks. Since all notebooks use the same code, is it possible within the fabric framework to create a reusable component, like a user-defined function, that includes the transformation code from Julian to standard dates? This function could be called from all the notebooks, enhancing the efficiency of this process.

Thanks

tinbaj · ‎06-23-2025

HI @v-ssriganesh ,

As I explained in my previous response, based on Microsoft's response, the User data Functions cannot be used for transformations in the dataframe. Therefore, we need to utilize PySpark's native functions to transform data in the User data functions. So, I need to tweak my solution a bit and not use User data functions for Fabric but instead use pyspark.udf to do the transformation.

I think I know the way ahead now. Thanks for your support and help. We can close the ticket now.

View solution in original post

tinbaj · ‎06-05-2025

Hi @v-ssriganesh ,

Thanks for your response. I created a User data function as suggested, however, It did not work. Here is the code and scenario:

User data function

from datetime import datetime,timedelta

import fabric.functions as fn

import logging

import pandas as pd

udf = fn.UserDataFunctions()

@udf.function()

def convert_julian_to_date(juliandate:str) ->str:

# Extract year and day parts from the julian date string

final_date=""

if juliandate:

year = (int(juliandate[:1]) + 19) * 100 + int(juliandate[1:3])

day_of_year = int(juliandate[3:]) - 1

# Create a date object for January 1st of the given year

date = datetime(year, 1, 1)

# Add the day of year to January 1st

final_date = date + timedelta(days=day_of_year)

return final_date

Code in the notebook:

Instantiate the function
data_functions = notebookutils.udf.getFunctions('data_functions')

Test the function
data_functions.convert_julian_to_date('123241')
Output: '2023-08-29 00:00:00'

Call the function for a dataframe column, and it fails. Error: TypeError: Column is not iterable
df_silver = df_silver.withColumn('request_date', data_functions.convert_julian_to_date(df_silver['request_date']))

---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[46], line 1
----> 1 df_silver = df_silver.withColumn('request_date', data_functions.convert_julian_to_date(df_silver['request_date']))

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/notebookutils/mssparkutils/handlers/udfHandler.py:95, in UDF.__create_dynamic_function.<locals>.dynamic_function(*args, **kwargs)
93 workspace_id = self.__metadata.get("folderObjectId", "")
94 capacity_id = self.__metadata.get("capacityObjectId", "")
---> 95 result = self.__udf_handler.run(artifact_id, name, parameters, workspace_id, capacity_id)
96 if json.loads(result).get("status", "").lower() != "succeeded":
97 raise Exception(f"Function {name} failed with error: {result}")

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/notebookutils/mssparkutils/handlers/udfHandler.py:27, in UdfHandler.run(self, artifact_id, function_name, parameters, workspace_id, capacity_id)
24 if not workspace_id:
25 workspace_id = self.getCurrentWorkspaceId()
---> 27 return self.jvm.notebookutils.udf.run(artifact_id, function_name, parameters, workspace_id, capacity_id)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1314, in JavaMember.__call__(self, *args)
1313 def __call__(self, *args):
-> 1314 args_command, temp_args = self._build_args(*args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1277, in JavaMember._build_args(self, *args)
1275 def _build_args(self, *args):
1276 if self.converters is not None and len(self.converters) > 0:
-> 1277 (new_args, temp_args) = self._get_args(args)
1278 else:
1279 new_args = args

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1264, in JavaMember._get_args(self, args)
1262 for converter in self.gateway_client.converters:
1263 if converter.can_convert(arg):
-> 1264 temp_arg = converter.convert(arg, self.gateway_client)
1265 temp_args.append(temp_arg)
1266 new_args.append(temp_arg)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_collections.py:523, in MapConverter.convert(self, object, gateway_client)
521 java_map = HashMap()
522 for key in object.keys():
--> 523 java_map[key] = object[key]
524 return java_map

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_collections.py:82, in JavaMap.__setitem__(self, key, value)
81 def __setitem__(self, key, value):
---> 82 self.put(key, value)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1314, in JavaMember.__call__(self, *args)
1313 def __call__(self, *args):
-> 1314 args_command, temp_args = self._build_args(*args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1277, in JavaMember._build_args(self, *args)
1275 def _build_args(self, *args):
1276 if self.converters is not None and len(self.converters) > 0:
-> 1277 (new_args, temp_args) = self._get_args(args)
1278 else:
1279 new_args = args

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_gateway.py:1264, in JavaMember._get_args(self, args)
1262 for converter in self.gateway_client.converters:
1263 if converter.can_convert(arg):
-> 1264 temp_arg = converter.convert(arg, self.gateway_client)
1265 temp_args.append(temp_arg)
1266 new_args.append(temp_arg)

File ~/cluster-env/clonedenv/lib/python3.10/site-packages/py4j/java_collections.py:510, in ListConverter.convert(self, object, gateway_client)
508 ArrayList = JavaClass("java.util.ArrayList", gateway_client)
509 java_list = ArrayList()
--> 510 for element in object:
511 java_list.add(element)
512 return java_list

File /opt/spark/python/lib/pyspark.zip/pyspark/sql/column.py:710, in Column.__iter__(self)
709 def __iter__(self) -> None:
--> 710 raise TypeError("Column is not iterable")

TypeError: Column is not iterable

v-ssriganesh · ‎06-05-2025

Hi @tinbaj,

Thank you for sharing the details and code. The error TypeError: Column is not iterable occurs because the User Data Function (UDF) is being applied to a Spark DataFrame column directly, which isn't compatible with the function's expectation of a single string input. To fix this, you need to register the UDF with Spark to handle DataFrame columns.

Here’s how to resolve it:

In your notebook, after instantiating the UDF, register it with Spark:

Use spark.udf.register to make the UDF available for DataFrame operations.
Then, apply it using withColumn with the registered UDF.

Update your notebook code as follows:

Instantiate the UDF: data_functions = notebookutils.udf.getFunctions('data_functions')
Register the UDF: spark.udf.register("convert_julian_to_date", data_functions.convert_julian_to_date)
Apply to the DataFrame: df_silver = df_silver.withColumn('request_date', spark.sql.functions.expr("convert_julian_to_date(request_date)"))

This ensures the UDF processes each row’s request_date column value correctly. Also, verify that the request_date column in df_silver contains valid Julian date strings (e.g: '123241'). If the column has mixed or invalid data types, you may need to preprocess it to ensure all values are strings.

If this helps, please mark it as “Accept as solution” and feel free to give a “Kudos” to help others in the community as well.
Thank you.

tinbaj · ‎06-06-2025

Hi @v-ssriganesh ,

Thanks for your response. I am getting this error when I am running the command to register the UDF: Register the UDF: spark.udf.register("convert_julian_to_date", data_functions.convert_julian_to_date)

--> 612 self.sparkSession._jsparkSession.udf().registerPython(name, register_udf._judf) 613 return return_udf File /opt/spark/python/lib/pyspark.zip/pyspark/sql/udf.py:321, in UserDefinedFunction._judf(self) 314 @property 315 def _judf(self) -> JavaObject: 316 # It is possible that concurrent access, to newly created UDF, 317 # will initialize multiple UserDefinedPythonFunctions. 318 # This is unlikely, doesn't affect correctness, 319 # and should have a minimal performance impact. 320 if self._judf_placeholder is None: --> 321 self._judf_placeholder = self._create_judf(self.func) 322 return self._judf_placeholder File /opt/spark/python/lib/pyspark.zip/pyspark/sql/udf.py:330, in UserDefinedFunction._create_judf(self, func) 327 spark = SparkSession._getActiveSessionOrCreate() 328 sc = spark.sparkContext --> 330 wrapped_func = _wrap_function(sc, func, self.returnType) 331 jdt = spark._jsparkSession.parseDataType(self.returnType.json()) 332 assert sc._jvm is not None File /opt/spark/python/lib/pyspark.zip/pyspark/sql/udf.py:59, in _wrap_function(sc, func, returnType) 57 else: 58 command = (func, returnType) ---> 59 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command) 60 assert sc._jvm is not None 61 return sc._jvm.SimplePythonFunction( 62 bytearray(pickled_command), 63 env, (...) 68 sc._javaAccumulator, 69 ) File /opt/spark/python/lib/pyspark.zip/pyspark/rdd.py:5251, in _prepare_for_python_RDD(sc, command) 5248 def _prepare_for_python_RDD(sc: "SparkContext", command: Any) -> Tuple[bytes, Any, Any, Any]: 5249 # the serialized command will be compressed by broadcast 5250 ser = CloudPickleSerializer() -> 5251 pickled_command = ser.dumps(command) 5252 assert sc._jvm is not None 5253 if len(pickled_command) > sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M 5254 # The broadcast will have same life cycle as created PythonRDD File /opt/spark/python/lib/pyspark.zip/pyspark/serializers.py:469, in CloudPickleSerializer.dumps(self, obj) 467 msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg) 468 print_exec(sys.stderr) --> 469 raise pickle.PicklingError(msg) PicklingError: Could not serialize object: PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

v-ssriganesh · ‎06-06-2025

Hi @tinbaj,
Thank you for providing the error details. The PicklingError: [CONTEXT_ONLY_VALID_ON_DRIVER] occurs because the User Data Function (UDF) is being serialized in a way that references the SparkContext, which isn't allowed in Spark's distributed environment. This is likely due to how the UDF is defined or accessed in your notebook.

To resolve this, try the following steps:

Instead of directly registering the UDF with spark.udf.register, use the Fabric UDF directly in the DataFrame operation, as Fabric’s UDFs are designed to work seamlessly with Spark. Update your notebook code as follows:

Instantiate the UDF: data_functions = notebookutils.udf.getFunctions('data_functions')

Apply the UDF to the DataFrame: df_silver = df_silver.withColumn('request_date', data_functions.convert_julian_to_date(df_silver.request_date))

Ensure your UDF (convert_julian_to_date) in the User Data Functions item doesn’t reference SparkContext or other non-serializable objects. Your provided UDF code looks fine, but confirm it only uses standard Python libraries (e.g., datetime, timedelta) and avoids Spark-specific calls.
Before applying to the DataFrame, test the UDF with a single value to confirm it works: print(data_functions.convert_julian_to_date('123241')). This should return '2023-08-29 00:00:00'.

If the error persists, please share:

The schema of df_silver (df_silver.printSchema()).
Any modifications made to the UDF code.
Whether you’re running this in a Fabric notebook with a Spark session active.

Please try these steps and let me know the outcome. If it resolves the issue, consider marking it as “Accept as solution” and giving a “Kudos” to help others in the community.
Thank you.

tinbaj · ‎06-08-2025

Hi @v-ssriganesh ,

Thanks for your response, but the suggested code did not fix the problem. I can confirm that I am using standard python libraries in UDF and does not use spark context.

The implementation as per the suggestion and the error message is as below:

data_functions = notebookutils.udf.getFunctions('data_functions')

print(data_functions.convert_julian_to_date('123241'))

Return Value: 2023-08-29

df_silver = df_silver.withColumn("request_date", data_functions.convert_julian_to_date(df_silver.request_date))

Error Message: PySparkTypeError: [NOT_ITERABLE] Column is not iterable.

Now if we go through the documentation for UDF's (Link: https://learn.microsoft.com/en-us/fabric/data-engineering/user-data-functions/python-programming-mod...), column data type is not one of the acceptable data type in UDF's. could this be a reason for this error?

Thanks

v-ssriganesh · ‎06-09-2025

Hello @tinbaj,
Thank you for the update and detailed feedback.

he PySparkTypeError: [NOT_ITERABLE] Column is not iterable error occurs because Fabric User Data Functions (UDFs) expect scalar inputs (e.g:L strings, integers), but df_silver.request_date is a Spark DataFrame column, which isn’t directly compatible. The documentation you referenced correctly notes that UDFs don’t accept column objects as inputs, which explains this error.

To resolve this, you need to register the UDF with Spark to process each row’s request_date value individually. Since you’ve confirmed the UDF works for a single input ('123241' returns '2023-08-29'), the issue is specific to DataFrame application. Here’s how to fix it:

Instantiate the UDF: data_functions = notebookutils.udf.getFunctions('data_functions')
Register the UDF with Spark: Use from pyspark.sql.functions import udf and register the UDF as convert_udf = udf(data_functions.convert_julian_to_date).
Apply the UDF to the DataFrame: df_silver = df_silver.withColumn('request_date', convert_udf(df_silver.request_date)).

Additionally, check the request_date column is a string type, as your UDF expects strings.

If this helps, please “Accept as solution” and give a “kudos” to assist other community members.
Thank you.

tinbaj · ‎06-09-2025

Hi @v-ssriganesh ,

Please see message 5 from this thread. We tried this a couple of days ago, and it doesn't work. When we try to register a User-Defined Function (UDF) as a UDF in Spark, it gives a Spark context error.

Does this mean that we cannot use User Data Functions for transformations in Dataframes?

Thanks

v-ssriganesh · ‎06-10-2025

Hello @tinbaj,
Thank you for your patience and for providing detailed feedback.

We recommend raising a support ticket with Microsoft Fabric support for deeper investigation, as the issue may be specific to your workspace or the UDF’s interaction with your Spark environment. You can explain all the troubleshooting steps you have taken to help them better understand the issue.

You can create a Microsoft support ticket with the help of the link below:
https://learn.microsoft.com/en-us/power-bi/support/create-support-ticket

If this information is helpful, consider marking it as “Accept as solution” and giving a “Kudos” to help others in the community.
Thank you.

v-ssriganesh · ‎06-13-2025

Hello @tinbaj,

Could you please confirm if the issue has been resolved after raising a support case? If a solution has been found, it would be greatly appreciated if you could share your insights with the community. This would be helpful for other members who may encounter similar issues.

Thank you for your understanding and assistance.

v-ssriganesh · ‎06-16-2025

Hello @tinbaj,
We are following up once again regarding your query. Could you please confirm if the issue has been resolved through the support ticket with Microsoft?

If the issue has been resolved, we kindly request you to share the resolution or key insights here to help others in the community. If we don’t hear back, we’ll go ahead and close this thread.

Should you need further assistance in the future, we encourage you to reach out via the Microsoft Fabric Community Forum and create a new thread. We’ll be happy to help.

Thank you for your understanding and participation.

tinbaj · ‎06-16-2025

Hi @v-ssriganesh ,

The ticket I raised with Microsoft did not provide any resolution to this issue. The associate classified this problem as more of a pyspark problem than a UDF issue. Please see below how the conversation ended with the Microsoft associate for this ticket.

As reported, you had a User Data Function (UDF) defined to convert a data from oracle database that is stored in Julian format to a data format.

It works when simply passing a julian date as input to this function with implementation like this:
data_functions = notebookutils.udf.getFunctions('data_functions')
When you tried below code, it failed with “Error Message: PySparkTypeError: [NOT_ITERABLE] Column is not iterable”.
df_silver = df_silver.withColumn("request_date", data_functions.convert_julian_to_date(df_silver.request_date))
Registering a SQL function (with below code) threw “PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063”.
spark.udf.register("convert_julian_to_date", data_functions.convert_julian_to_date)

As discussed, you created Fabric user data function “convert_julian_to_date”. When you used “notebookutils.udf” to get / invoke the function, it processed successfully. These show that Fabric user data function “convert_julian_to_date” itself is working without issues.

Just to clarify, Fabric User data functions uses “fabric.functions” library to provide the functionality. And like what you did, you can retrieve and invoke the function via “notebookutils.udf”. To my knowledge, Fabric User data functions (“fabric.functions” library) basically enables you to create user data functions in Python, not offering other methods (integrated with 3rd-parties like PySpark) by default.

Apache Spark DataFrames is 3rd-party and not supported by us – So I couldn’t provide the most accurate information for your other questions. I’d assume that you could call / invoke Fabric User data functions from Apache Spark DataFrames, but you might improperly use those PySpark APIs.

I did some research on “PySparkTypeError: [NOT_ITERABLE] Column is not iterable” – It’d be more about the DataFrame and/or the withColumn() usage.

Can you perform a quick test by df_silver = df_silver.withColumn("request_date", df_silver.request_date)?
Would it even work?

For getting “PySparkRuntimeError: [CONTEXT_ONLY_VALID_ON_DRIVER] It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063” when using spark.udf.register("convert_julian_to_date", data_functions.convert_julian_to_date),

My best guess would be that spark.udf.register() requires a Python function, pyspark.sql.functions.udf() or pyspark.sql.functions.pandas_udf(). While data_functions.convert_julian_to_date is from notebookutils.udf that it cannot be recognized or directly registered using this method.

Unfortunately, all I could do was to check the Fabric user data function, and give my assumptions for those PySpark errors. Hope it would be helpful. To move forward, I’d suggest that:

Not sure if PySpark provides support to Users – If yes, you may want to contact PySpark support for the issues you’re facing.
For further assistance regarding implementation / design of the whole solution you’re trying to accomplish, you can contact Azure sales or get help from an Azure partner.

v-ssriganesh · ‎06-18-2025

Hello @tinbaj,
We appreciate your patience and sharing the update on the issue.

From what you've described, it looks like the Fabric User Data Function (UDF) itself is working as expected when used with notebookutils.udf. The issues seem to come up only when trying to use it inside PySpark operations like withColumn() or when attempting to register it with spark.udf.register.

Since PySpark is a third-party tool and isn't fully integrated with Fabric UDFs, this kind of limitation is expected for now. Currently, calling Fabric UDFs directly inside PySpark transformations or registering them as Spark SQL functions isn't supported.

If you still need to apply similar logic to your DataFrame, you might want to rewrite the function using a regular PySpark UDF (pyspark.sql.functions.udf() or pandas_udf) so it works smoothly within the PySpark context.

I totally understand this might not be the solution you were hoping for, but given the current capabilities of Fabric, using PySpark-native methods or checking with PySpark support channels would be the best way forward.

Thank you for your understanding.

v-ssriganesh · ‎06-23-2025

Hello @tinbaj,

We are following up once again regarding your query. Could you please confirm if the issue has been resolved through the support ticket with Microsoft?

If the issue has been resolved, we kindly request you to share the resolution or key insights here to help others in the community. If we don’t hear back, we’ll go ahead and close this thread.

Should you need further assistance in the future, we encourage you to reach out via the Microsoft Fabric Community Forum and create a new thread. We’ll be happy to help.

Thank you for your understanding.

tinbaj · ‎06-23-2025

HI @v-ssriganesh ,

As I explained in my previous response, based on Microsoft's response, the User data Functions cannot be used for transformations in the dataframe. Therefore, we need to utilize PySpark's native functions to transform data in the User data functions. So, I need to tweak my solution a bit and not use User data functions for Fabric but instead use pyspark.udf to do the transformation.

I think I know the way ahead now. Thanks for your support and help. We can close the ticket now.

v-ssriganesh · ‎06-23-2025

Hello @tinbaj,
Thank you for the update on the issue. Please continue to utilize the Microsoft Fabric Community Forum for further discussions and support.

v-ssriganesh · ‎06-04-2025

Hello @tinbaj,
Thank you for reaching out with your query.

To streamline your Julian date-to-standard date conversion across all eight notebooks, I recommend using Fabric User Data Functions (UDFs). You can create a single UDF in your Fabric workspace to define the conversion logic, which can then be called from all notebooks. This eliminates code duplication, simplifies maintenance, and ensures consistency across your JDE data sources. Simply create a UDF item, define the conversion function, and invoke it in each notebook. For more details, check the Fabric User Data Functions documentation: Overview - Fabric User data functions (preview) - Microsoft Fabric | Microsoft Learn.

If this information is helpful, please “Accept as solution” and give a "kudos" to assist other community members in resolving similar issues more efficiently.
Thank you.

Reusable function for data transformation - user data functions

Helpful resources

Fabric Community Update - July 2025

Fabric Monthly Update - June 2025

Party with Power BI’s own Guy in a Cube

Reusable function for data transformation - user data functions

Helpful resources

Fabric Community Update - July 2025

Fabric Monthly Update - June 2025