Check your eligibility for this 50% exam voucher offer and join us for free live learning sessions to get prepared for Exam DP-700.
Get StartedDon't miss out! 2025 Microsoft Fabric Community Conference, March 31 - April 2, Las Vegas, Nevada. Use code MSCUST for a $150 discount. Prices go up February 11th. Register now.
How can I use Runtime 1.3(Spark 3.5, Delta 2.4) in Notebook to manipulate tables containing LakeHouse datetime columns?
I am developing a conversion process for CSV files.
I stored data from a CSV file in the Lakehouse to a table without type conversion. The data was checked for type and required fields using data flow, and only rows with no errors were stored in a new table. The data can be viewed without issues from the Lakehouse table browsing screen.
The problem occurs when developing a process to retrieve this converted data in Notebook and merge it with other data. When I try to retrieve and reference the data with the following code, an error occurs.
In the example, the error occurs when displaying the data, but the same error occurs when executing the UPDATE statement.
The problem occurs when I try to retrieve this converted data in a Notebook. When I try to reference the data with the following code, an error occurs. The error also happens when executing the UPDATE statement.
Code
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
df = spark.sql("SELECT * FROM LakehouseName.sampletable LIMIT 1000")
display(df)
Error Message
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
Cell In[11], line 5
2 spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
4 df = spark.sql("SELECT * FROM LakehouseName.sampletable LIMIT 1000")
----> 5 display(df)
File ~/cluster-env/trident_env/lib/python3.11/site-packages/notebookutils/visualization/display.py:254, in display(data, summary)
251 success = False
252 log4jLogger \
253 .error(f"display failed with error, language: python, error: {err}, correlationId={correlation_id}")
--> 254 raise err
255 finally:
256 duration_ms = ceil((time.time() - start_time) * 1000)
File ~/cluster-env/trident_env/lib/python3.11/site-packages/notebookutils/visualization/display.py:246, in display(data, summary)
243 max_row_count = custom_options['maxRowCount']
244 return display_jupyter.display_without_spark(data, summary, max_row_count)
--> 246 return _display_with_spark(data, correlation_id, summary)
248 except ImportError:
249 success = False
File ~/cluster-env/trident_env/lib/python3.11/site-packages/notebookutils/visualization/display.py:273, in _display_with_spark(data, correlation_id, summary)
269 if is_ipython_enabled(runtime.host_nbutils_version):
270 # pylint: disable=C0415
271 from IPython.display import publish_display_data
272 display_result = json.loads(
--> 273 sc._jvm.display.getDisplayResultForIPython(
274 df._jdf, summary, correlation_id)
275 )
276 if configs.enable_wrangler_entry:
277 display_result[WRANGLER_ENTRY_CONTEXT_KEY] = get_wrangler_display_entry_context(data)
File ~/cluster-env/trident_env/lib/python3.11/site-packages/py4j/java_gateway.py:1322, in JavaMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):
File /opt/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py:179, in capture_sql_exception.<locals>.deco(*a, **kw)
177 def deco(*a: Any, **kw: Any) -> Any:
178 try:
--> 179 return f(*a, **kw)
180 except Py4JJavaError as e:
181 converted = convert_exception(e.java_exception)
File ~/cluster-env/trident_env/lib/python3.11/site-packages/py4j/protocol.py:326, in get_return_value(answer, gateway_client, target_id, name)
324 value = OUTPUT_CONVERTER[type](answer[2:], gateway_client)
325 if answer[1] == REFERENCE_TYPE:
--> 326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n".
332 format(target_id, ".", name, value))
Py4JJavaError: An error occurred while calling z:com.microsoft.spark.notebook.visualization.display.getDisplayResultForIPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 235) (vm-44704417 executor 1): java.lang.RuntimeException: Exception: VeloxUserError
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Could not convert Timestamp(1732266818935202, 1732266818935202) to microseconds, Exception: VeloxUserError
Error Source: USER
Error Code: ARITHMETIC_ERROR
Reason: integer overflow: 1732266818935202 * 1000000
Retriable: False
Function: checkedMultiply
File: /__w/1/s/Velox/velox/common/base/CheckedArithmetic.h
Line: 51
I confirmed that the same error occurs when reading a file by specifying the ABFS path as follows:
df = spark.read.format("delta").load("abfss://xxxxxxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxxx/Tables/sampletable/")
display(df)
This error can be resolved by downgrading the runtime to 1.2 (Spark 3.4, Delta 2.4), but for future consideration, I would like to confirm that it works with the current runtime.
I appreciate your advice. I tried the pySpark query you provided, but I got an error message saying "Only UTC adjusted Timestamp is not supported."
Py4JJavaError: An error occurred while calling z:com.microsoft.spark.notebook.visualization.display.getDisplayResultForIPython.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 23.0 failed 4 times, most recent failure: Lost task 0.3 in stage 23.0 (TID 127) (vm-cff19570 executor 1): org.apache.gluten.exception.GlutenException: java.lang.RuntimeException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Only UTC adjusted Timestamp is supported.
Retriable: False
Expression: logicalType.TIMESTAMP.isAdjustedToUTC
Context: Split [Hive: abfss://xxxxxx@onelake.dfs.fabric.microsoft.com/xxxxxxx/Tables/table_name/xxxxxxxx.parquet 0 - 416531] Task Gluten_Stage_23_TID_127
Additional Context: Operator: TableScan[0] 0
Function: makeDecoder
File: /__w/1/s/Velox/velox/dwio/parquet/reader/PageReader.cpp
Line: 711
When I first encountered this error, I thought about changing Spark's timezone to UTC and added the following code:
spark.conf.set("spark.sql.session.timeZone", "UTC")
However, the same error occurs: "Only UTC adjusted Timestamp is not supported."
When querying Delta tables using PySpark in runtime 1.2, an error occurred when running the query directly, but by setting the following before execution, the query worked perfectly.
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
f there are any parts of my explanation that are difficult to understand, I would appreciate it if you could point them out.
Best Regards,
Hi @n5s5n
Is it possible to identify which column in the delta table might causing this problem? What's the data type of this column?
Or could you provide some steps to reproduce this issue? I'd like to try to reproduce it.
Best Regards,
Jing
Hi @v-jingzhan-msft
Thank you for reaching out to me, and I apologize for replying in the wrong way earlier.
I'm using Dataflow(Gen2) to insert data into a table. When creating the table, I configured the dataflow to create a new table and I configured the destination type as datetime.
This error occurs on a table that contains a column of type "timestamp (nullable = true)" in Lakehouse.
Even when I run a PySpark query to create a table with a datetime-type column and then insert data into it from a dataflow, the result is the same(The query was executed with Runtime 1.3.).
Regarding the error with the datetime column, when the column contains null values, PySpark queries such as SELECT and UPDATE statements can be executed without any issues.
Best Regards,
n5s5n
Hi @n5s5n
I referred to several similar error posts on other forums. Based on the error messages, I guess that it is possible that Spark 3.5 and Spark 3.4 have different precision or range limits when handling Timestamp type data, which caused data overflow error after conversion.
Error Source: USER
Error Code: INVALID_ARGUMENT
Reason: Could not convert Timestamp(1732266818935202, 1732266818935202) to microseconds, Exception: VeloxUserError
Error Source: USER
Error Code: ARITHMETIC_ERROR
Reason: integer overflow: 1732266818935202 * 1000000
Retriable: False
Function: checkedMultiply
As the data is displayed well in the lakehouse preview window, this indicates the data is stored without any error. It seems like the issue might be occurring during type conversion.
You can try removing the following two configuration lines:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.conf.set("spark.sql.parquet.datetimeRebaseModeInRead", "CORRECTED")
and querying delta table directly with PySpark:
df = spark.sql("SELECT * FROM LakehouseName.sampletable LIMIT 1000")
display(df)
Test if this would work. This might help identify if these settings are causing the problem. If the issue persists, we might need to consider other aspects.
Best Regards,
Jing
Community Support Team
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.
User | Count |
---|---|
6 | |
5 | |
4 | |
2 | |
1 |
User | Count |
---|---|
10 | |
8 | |
8 | |
7 | |
5 |