Skip to main content
cancel
Showing results for 
Search instead for 
Did you mean: 

Be one of the first to start using Fabric Databases. View on-demand sessions with database experts and the Microsoft product team to learn just how easy it is to get started. Watch now

Reply
Skoltn
Frequent Visitor

Lakehouse - Delta table columns not updating

Hi all,

 

I am using a Python Notebook that's being triggered by a data factory to update Delta tables each night. In short, the process is as follows:

1. A data factory orchestrates all processes. It knows which tables to process due to an array.

2. For each table, it triggers a notebook that queries the source MySQL database, and puts the data into a parquet file in the Lakehouse.

3. It then triggers another notebook for this specific file, which transfers the data from the file into a delta lake table. This is a MERGE action. So it does not create a new delta lake table, it updates an existing one.

 

We recently noticed that one of the source tables contained 2 new columns. We do not have explicit mapping. We just say: all columns in source should be transferred to target. I have updated the PySpark notebook with an ALTER TABLE statement, so that new columns are added in the Delta lake table, whenever they pop up in the source. You can find the full python function that I use for this below.

 

What happens however, when I query the table through the Lakehouse SQL endpoint, these two new columns, as well as new records of data, do not come up in the query. I can only see the records that were already in the table before we added this ALTER TABLE statement in the notebook. But the weird thing is: when I query the exact same delta lake table in a PySpark notebook, I can see all the new records, and those new columns as well.

 

Any idea how this is caused and how to fix it?

Thanks!

Python function:

 

from pyspark.sql import SparkSession
from delta.tables import DeltaTable

def upsert_to_delta(df_source, delta_path, name_of_destination, merge_columns, is_incremental):
    spark = SparkSession.builder.getOrCreate()

    # Load the Delta table. If it's not a delta table yet, create it
    if DeltaTable.isDeltaTable(spark, delta_path):
        delta_table = DeltaTable.forPath(spark, delta_path)
        df_target = delta_table.toDF()

        # Check for new columns and their data type in the source
        new_columns = {col: df_source.schema[col].dataType.simpleString() for col in df_source.columns if col not in df_target.columns}

    else:
        df_source.write.format("delta").save(delta_path)
        delta_table = DeltaTable.forPath(spark, delta_path)
        new_columns = {}

    # Add new columns and the data type in the table, and insert NULL values
    for column, dataType in new_columns.items():
        spark.sql(f"ALTER TABLE delta.`{delta_path}` ADD COLUMNS ({column} {dataType})")
        spark.sql(f"UPDATE delta.`{delta_path}` SET {column} = NULL")

    # Define the merge operation
    merge_condition = " AND ".join([f"source.{col} = target.{col}" for col in merge_columns])

    # When matched and source is newer, update the target
    update_mapping = {col: f"source.{col}" for col in df_source.columns}

    # When not matched, insert the source row to target
    insert_mapping = {col: f"source.{col}" for col in df_source.columns}

    # Execute the merge operation
    delta_table.alias('target') \
        .merge(df_source.alias('source'), merge_condition) \
        .whenMatchedUpdate(set = update_mapping) \
        .whenNotMatchedInsert(values = insert_mapping) \
        .execute()

    # If not incremental, delete rows in target that are not present in source
    if not is_incremental:
        # Find the rows to delete
        df_to_delete = delta_table.toDF().join(df_source, merge_columns, "left_anti")

        # Delete rows
        for row in df_to_delete.collect():
            delete_condition = " AND ".join([f"{col} = {repr(row[col])}" for col in merge_columns])
            delta_table.delete(delete_condition)

 

10 REPLIES 10
Joosthvr
New Member

Accessing the Delta tables is still in public preview, and currently serverless will synchronize a Delta table with Spark at the time of creation but will not update the schema if the columns are added later using the ALTER TABLE statement in Spark.

This is a public preview limitation. Drop and re-create the Delta table in Spark (if it is possible) instead of altering tables to resolve this issue.

Serverless SQL pool self-help - Azure Synapse Analytics | Microsoft Learn

AndyDDC
Most Valuable Professional
Most Valuable Professional

Hi @Skoltn unfortunately using ALTER statements with Lakehouse tables causes metadata sync issues with the Lakehouse SQL Endpoint (and the semantic models) which will cause any new columns to not be visible/queryable.

 

In your scenario, are you able to use mergeSchema to evolve the schema? Delta Lake Schema Evolution | Delta Lake

That's a pity. Is this a known problem that will be fixed in the future, or is it how the Lakehouse is supposed to work?

I switched to the ALTER statements in my notebook as using mergeSchema did not work at all. I tried to use this in the first place. Source DF contained 2 new columns. I wrote it into the target delta table, and then after querying the target again, the columns were not there at all.

You can vote for my idea here: Add or remove column from Lakehouse table

Anonymous
Not applicable

Hi @Skoltn ,

Thanks for using Fabric Community.

As I understand the issue you're facing is that new columns and records added to a Delta Lake table through notebooks triggered by a Data Factory pipeline are not appearing when queried via the Lakehouse SQL endpoint, but they are visible in PySpark notebooks.

The Delta Lake table's internal metadata may not be fully updated, can you please execute optimization commands like OPTIMIZE or VACUUM?

Can you please try and let me know.

I am encountering the same issue where renaming an existing column in a Delta table reflects in the PySpark Notebook but does not update in the SQL Endpoint. I have tried using VACUUM and OPTIMIZE commands, but they did not resolve the problem.

 

I have attempted to resolve this issue by dropping the table, deleting associated folders, and recreating the Delta table, but the problem persists—column renames and new columns still do not reflect in the SQL Endpoint

 

This seems to be a critical limitation, as basic operations like adding, renaming, or dropping columns in Delta tables are not syncing correctly with the SQL endpoint. This issue is severely impacting our project timeline, and we urgently need a workaround or a permanent solution.

 

@fabricator1 @Skoltn @Joosthvr @AndyDDC @Anonymous

Are you using the column name mapping feature? That may be a reason why the sync to SQL Analytics Endpoint fails.

 

Something like 

 

ALTER TABLE table_name SET TBLPROPERTIES (

   'delta.columnMapping.mode' = 'name',

   'delta.minReaderVersion' = '2',

   'delta.minWriterVersion' = '5')

 

In that case, I think the sync to the SQL Analytics Endpoint breaks somehow if you enable the column name mapping.

 

Or are you experiencing really slow updates in the SQL Analytics Endpoint?

You can try to create another Lakehouse in another Workspace, and use a shortcut to bring your original table into the new lakehouse. For some users, that has made the SQL Analytics Endpoint perform better. 

One reason for slow SQL Analytics Endpoint can be many lakehouses in the same workspace.

 

Perhaps there is something to be gained from these docs:

https://learn.microsoft.com/en-us/fabric/data-warehouse/sql-analytics-endpoint-performance

Hi @Anonymous 

Using the OPTIMIZE command helped with one of the two tables in which I'm experiencing this problem. The other table shows the columns in the column list in the Lakehouse explorer, but as soon as I open the preview of the Delta Table, it does not show the new columns or new records of data.

 

Maybe this has to do with the table size. The table for which OPTIMIZE did not help is quite a bit larger than the other one.

Anonymous
Not applicable

Hello @Skoltn ,
We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

Anonymous
Not applicable

Hi @Skoltn ,

We haven’t heard from you on the last response and was just checking back to see if you have a resolution yet .
In case if you have any resolution please do share that same with the community as it can be helpful to others .
Otherwise, will respond back with the more details and we will try to help .

Helpful resources

Announcements
Las Vegas 2025

Join us at the Microsoft Fabric Community Conference

March 31 - April 2, 2025, in Las Vegas, Nevada. Use code MSCUST for a $150 discount!

Dec Fabric Community Survey

We want your feedback!

Your insights matter. That’s why we created a quick survey to learn about your experience finding answers to technical questions.

ArunFabCon

Microsoft Fabric Community Conference 2025

Arun Ulag shares exciting details about the Microsoft Fabric Conference 2025, which will be held in Las Vegas, NV.

December 2024

A Year in Review - December 2024

Find out what content was popular in the Fabric community during 2024.