When using Runtime 1.3 in Notebook, errors occur i...

n5s5n · ‎12-03-2024

How can I use Runtime 1.3(Spark 3.5, Delta 2.4) in Notebook to manipulate tables containing LakeHouse datetime columns?

I am developing a conversion process for CSV files.

I stored data from a CSV file in the Lakehouse to a table without type conversion. The data was checked for type and required fields using data flow, and only rows with no errors were stored in a new table. The data can be viewed without issues from the Lakehouse table browsing screen.

The problem occurs when developing a process to retrieve this converted data in Notebook and merge it with other data. When I try to retrieve and reference the data with the following code, an error occurs.

In the example, the error occurs when displaying the data, but the same error occurs when executing the UPDATE statement.

The problem occurs when I try to retrieve this converted data in a Notebook. When I try to reference the data with the following code, an error occurs. The error also happens when executing the UPDATE statement.

Code

spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")

df = spark.sql("SELECT * FROM LakehouseName.sampletable LIMIT 1000")
display(df)

Error Message

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
Cell In[11], line 5
      2 spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
      4 df = spark.sql("SELECT * FROM LakehouseName.sampletable LIMIT 1000")
----> 5 display(df)

File ~/cluster-env/trident_env/lib/python3.11/site-packages/notebookutils/visualization/display.py:254, in display(data, summary)
    251     success = False
    252     log4jLogger \
    253         .error(f"display failed with error, language: python, error: {err}, correlationId={correlation_id}")
--> 254     raise err
    255 finally:
    256     duration_ms = ceil((time.time() - start_time) * 1000)

File ~/cluster-env/trident_env/lib/python3.11/site-packages/notebookutils/visualization/display.py:246, in display(data, summary)
    243                 max_row_count = custom_options['maxRowCount']
    244             return display_jupyter.display_without_spark(data, summary, max_row_count)
--> 246     return _display_with_spark(data, correlation_id, summary)
    248 except ImportError:
    249     success = False

File ~/cluster-env/trident_env/lib/python3.11/site-packages/notebookutils/visualization/display.py:273, in _display_with_spark(data, correlation_id, summary)
    269 if is_ipython_enabled(runtime.host_nbutils_version):
    270     # pylint: disable=C0415
    271     from IPython.display import publish_display_data
    272     display_result = json.loads(
--> 273         sc._jvm.display.getDisplayResultForIPython(
    274             df._jdf, summary, correlation_id)
    275     )
    276     if configs.enable_wrangler_entry:
    277         display_result[WRANGLER_ENTRY_CONTEXT_KEY] = get_wrangler_display_entry_context(data)

File ~/cluster-env/trident_env/lib/python3.11/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
   1316 command = proto.CALL_COMMAND_NAME +\
   1317     self.command_header +\
   1318     args_command +\
   1319     proto.END_COMMAND_PART
   1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
   1323     answer, self.gateway_client, self.target_id, self.name)
   1325 for temp_arg in temp_args:
   1326     if hasattr(temp_arg, "_detach"):

File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
    177 def deco(*a: Any, **kw: Any) -> Any:
    178     try:
--> 179         return f(*a, **kw)
    180     except Py4JJavaError as e:
    181         converted = convert_exception(e.java_exception)

File ~/cluster-env/trident_env/lib/python3.11/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
    324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
    325 if answer[1] == REFERENCE_TYPE:
--> 326     raise Py4JJavaError(
    327         "An error occurred while calling {0}{1}{2}.\n".
    328         format(target_id, ".", name), value)
    329 else:
    330     raise Py4JError(
    331         "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
    332         format(target_id, ".", name, value))

Py4JJavaError: An error occurred while calling z:com.microsoft.spark.notebook.visualization.display.getDisplayResultForIPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 235) (vm-44704417 executor 1): java.lang.RuntimeException: Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Could not convert Timestamp(1732266818935202, 1732266818935202) to microseconds, Exception: VeloxUserError
Error Source: USER
Error Code: ARITHMETIC_ERROR
Reason: integer overflow: 1732266818935202 * 1000000
Retriable: False
Function: checkedMultiply
File: /__w/1/s/Velox/velox/common/base/CheckedArithmetic.h
Line: 51

I confirmed that the same error occurs when reading a file by specifying the ABFS path as follows:

df = spark.read.format("delta").load("abfss://xxxxxxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxxx/Tables/sampletable/")
display(df)

This error can be resolved by downgrading the runtime to 1.2 (Spark 3.4, Delta 2.4), but for future consideration, I would like to confirm that it works with the current runtime.

n5s5n · ‎12-05-2024

Hi @Anonymous

I appreciate your advice. I tried the pySpark query you provided, but I got an error message saying "Only UTC adjusted Timestamp is not supported."

Py4JJavaError: An error occurred while calling z:com.microsoft.spark.notebook.visualization.display.getDisplayResultForIPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 127) (vm-cff19570 executor 1): org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Only UTC adjusted Timestamp is supported.
Retriable: False
Expression: logicalType.TIMESTAMP.isAdjustedToUTC
Context: Split [Hive: abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxxx/Tables/table_name/xxxxxxxx.parquet 0 - 416531] Task Gluten_Stage_23_TID_127
Additional Context: Operator: TableScan[0] 0
Function: makeDecoder
File: /__w/1/s/Velox/velox/dwio/parquet/reader/PageReader.cpp
Line: 711

When I first encountered this error, I thought about changing Spark's timezone to UTC and added the following code:

spark.conf.set("spark.sql.session.timeZone", "UTC")

However, the same error occurs: "Only UTC adjusted Timestamp is not supported."

When querying Delta tables using PySpark in runtime 1.2, an error occurred when running the query directly, but by setting the following before execution, the query worked perfectly.

spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")

f there are any parts of my explanation that are difficult to understand, I would appreciate it if you could point them out.

Best Regards,

Anonymous · ‎12-05-2024

Hi @n5s5n

Is it possible to identify which column in the delta table might causing this problem? What's the data type of this column?

Or could you provide some steps to reproduce this issue? I'd like to try to reproduce it.

Best Regards,
Jing

n5s5n · ‎12-06-2024

Hi @Anonymous
Thank you for reaching out to me, and I apologize for replying in the wrong way earlier.

I'm using Dataflow(Gen2) to insert data into a table. When creating the table, I configured the dataflow to create a new table and I configured the destination type as datetime.

This error occurs on a table that contains a column of type "timestamp (nullable = true)" in Lakehouse.

Even when I run a PySpark query to create a table with a datetime-type column and then insert data into it from a dataflow, the result is the same(The query was executed with Runtime 1.3.).

Regarding the error with the datetime column, when the column contains null values, PySpark queries such as SELECT and UPDATE statements can be executed without any issues.

Best Regards,

n5s5n

Anonymous · ‎12-05-2024

Hi @n5s5n

I referred to several similar error posts on other forums. Based on the error messages, I guess that it is possible that Spark 3.5 and Spark 3.4 have different precision or range limits when handling Timestamp type data, which caused data overflow error after conversion.

Error Source: USER 
Error Code: INVALID_ARGUMENT 
Reason: Could not convert Timestamp(1732266818935202, 1732266818935202) to microseconds, Exception: VeloxUserError 
Error Source: USER 
Error Code: ARITHMETIC_ERROR 
Reason: integer overflow: 1732266818935202 * 1000000 
Retriable: False 
Function: checkedMultiply

1. VeloxUserError: Could not convert Timestamp(-9223372036854776, 192000000) to milliseconds · Issue #7...

2. Dates and timestamps should be converted to Pandas dataframe using microsecond resolution (instead o...

3. postgresql - "ERROR: timestamp out of range" converting epoch stored as bigint to a timestamp - Data...

As the data is displayed well in the lakehouse preview window, this indicates the data is stored without any error. It seems like the issue might be occurring during type conversion.

You can try removing the following two configuration lines:

spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")

and querying delta table directly with PySpark:

df = spark.sql("SELECT * FROM LakehouseName.sampletable LIMIT 1000")
display(df)

Test if this would work. This might help identify if these settings are causing the problem. If the issue persists, we might need to consider other aspects.

Best Regards,
Jing
Community Support Team

When using Runtime 1.3 in Notebook, errors occur in tables containing Lakehouse date columns.

Helpful resources

Fabric Data Days

FabCon Atlanta 2026

Fabric Data Days starts November 4th!

When using Runtime 1.3 in Notebook, errors occur in tables containing Lakehouse date columns.

Helpful resources

Fabric Data Days

FabCon Atlanta 2026