March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount! Early bird discount ends December 31.
Register NowBe one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now
Hi all,
I am using a Python Notebook that's being triggered by a data factory to update Delta tables each night. In short, the process is as follows:
1. A data factory orchestrates all processes. It knows which tables to process due to an array.
2. For each table, it triggers a notebook that queries the source MySQL database, and puts the data into a parquet file in the Lakehouse.
3. It then triggers another notebook for this specific file, which transfers the data from the file into a delta lake table. This is a MERGE action. So it does not create a new delta lake table, it updates an existing one.
We recently noticed that one of the source tables contained 2 new columns. We do not have explicit mapping. We just say: all columns in source should be transferred to target. I have updated the PySpark notebook with an ALTER TABLE statement, so that new columns are added in the Delta lake table, whenever they pop up in the source. You can find the full python function that I use for this below.
What happens however, when I query the table through the Lakehouse SQL endpoint, these two new columns, as well as new records of data, do not come up in the query. I can only see the records that were already in the table before we added this ALTER TABLE statement in the notebook. But the weird thing is: when I query the exact same delta lake table in a PySpark notebook, I can see all the new records, and those new columns as well.
Any idea how this is caused and how to fix it?
Thanks!
Python function:
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
def upsert_to_delta(df_source, delta_path, name_of_destination, merge_columns, is_incremental):
spark = SparkSession.builder.getOrCreate()
# Load the Delta table. If it's not a delta table yet, create it
if DeltaTable.isDeltaTable(spark, delta_path):
delta_table = DeltaTable.forPath(spark, delta_path)
df_target = delta_table.toDF()
# Check for new columns and their data type in the source
new_columns = {col: df_source.schema[col].dataType.simpleString() for col in df_source.columns if col not in df_target.columns}
else:
df_source.write.format("delta").save(delta_path)
delta_table = DeltaTable.forPath(spark, delta_path)
new_columns = {}
# Add new columns and the data type in the table, and insert NULL values
for column, dataType in new_columns.items():
spark.sql(f"ALTER TABLE delta.`{delta_path}` ADD COLUMNS ({column} {dataType})")
spark.sql(f"UPDATE delta.`{delta_path}` SET {column} = NULL")
# Define the merge operation
merge_condition = " AND ".join([f"source.{col} = target.{col}" for col in merge_columns])
# When matched and source is newer, update the target
update_mapping = {col: f"source.{col}" for col in df_source.columns}
# When not matched, insert the source row to target
insert_mapping = {col: f"source.{col}" for col in df_source.columns}
# Execute the merge operation
delta_table.alias('target') \
.merge(df_source.alias('source'), merge_condition) \
.whenMatchedUpdate(set = update_mapping) \
.whenNotMatchedInsert(values = insert_mapping) \
.execute()
# If not incremental, delete rows in target that are not present in source
if not is_incremental:
# Find the rows to delete
df_to_delete = delta_table.toDF().join(df_source, merge_columns, "left_anti")
# Delete rows
for row in df_to_delete.collect():
delete_condition = " AND ".join([f"{col} = {repr(row[col])}" for col in merge_columns])
delta_table.delete(delete_condition)
Accessing the Delta tables is still in public preview, and currently serverless will synchronize a Delta table with Spark at the time of creation but will not update the schema if the columns are added later using the ALTER TABLE statement in Spark.
This is a public preview limitation. Drop and re-create the Delta table in Spark (if it is possible) instead of altering tables to resolve this issue.
Serverless SQL pool self-help - Azure Synapse Analytics | Microsoft Learn
Hi @Skoltn unfortunately using ALTER statements with Lakehouse tables causes metadata sync issues with the Lakehouse SQL Endpoint (and the semantic models) which will cause any new columns to not be visible/queryable.
In your scenario, are you able to use mergeSchema to evolve the schema? Delta Lake Schema Evolution | Delta Lake
That's a pity. Is this a known problem that will be fixed in the future, or is it how the Lakehouse is supposed to work?
I switched to the ALTER statements in my notebook as using mergeSchema did not work at all. I tried to use this in the first place. Source DF contained 2 new columns. I wrote it into the target delta table, and then after querying the target again, the columns were not there at all.
Hi @Skoltn ,
Thanks for using Fabric Community.
As I understand the issue you're facing is that new columns and records added to a Delta Lake table through notebooks triggered by a Data Factory pipeline are not appearing when queried via the Lakehouse SQL endpoint, but they are visible in PySpark notebooks.
The Delta Lake table's internal metadata may not be fully updated, can you please execute optimization commands like OPTIMIZE or VACUUM?
Can you please try and let me know.
I am encountering the same issue where renaming an existing column in a Delta table reflects in the PySpark Notebook but does not update in the SQL Endpoint. I have tried using VACUUM and OPTIMIZE commands, but they did not resolve the problem.
I have attempted to resolve this issue by dropping the table, deleting associated folders, and recreating the Delta table, but the problem persists—column renames and new columns still do not reflect in the SQL Endpoint
This seems to be a critical limitation, as basic operations like adding, renaming, or dropping columns in Delta tables are not syncing correctly with the SQL endpoint. This issue is severely impacting our project timeline, and we urgently need a workaround or a permanent solution.
@fabricator1 @Skoltn @Joosthvr @AndyDDC @Anonymous
Are you using the column name mapping feature? That may be a reason why the sync to SQL Analytics Endpoint fails.
Something like
ALTER TABLE table_name SET TBLPROPERTIES (
'delta.columnMapping.mode' = 'name',
'delta.minReaderVersion' = '2',
'delta.minWriterVersion' = '5')
In that case, I think the sync to the SQL Analytics Endpoint breaks somehow if you enable the column name mapping.
Or are you experiencing really slow updates in the SQL Analytics Endpoint?
You can try to create another Lakehouse in another Workspace, and use a shortcut to bring your original table into the new lakehouse. For some users, that has made the SQL Analytics Endpoint perform better.
One reason for slow SQL Analytics Endpoint can be many lakehouses in the same workspace.
Perhaps there is something to be gained from these docs:
https://learn.microsoft.com/en-us/fabric/data-warehouse/sql-analytics-endpoint-performance
Hi @Anonymous
Using the OPTIMIZE command helped with one of the two tables in which I'm experiencing this problem. The other table shows the columns in the column list in the Lakehouse explorer, but as soon as I open the preview of the Delta Table, it does not show the new columns or new records of data.
Maybe this has to do with the table size. The table for which OPTIMIZE did not help is quite a bit larger than the other one.
Hello @Skoltn ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
Hi @Skoltn ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .
March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!
Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.
Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.