Solved: Re: Data Ingestion Troubles

michaelgambling · ‎08-07-2025

Hi All,

Pulling my teath out a bit trying to get my head around data ingestion.

I am new to Fabric and data engineering in general. I am trying to set up a metadata driven pipeline that implements a 'watermark strategy'

I have many questions on this, but my first one is this, in many Youtube videos and documentation articles i read, people suggest utiliting a date/time column to use as your 'watermark'.

I have tried ingesting a table with a nominated column, it never works. When i try to use the 'Upsert' feature it always results in this error:

"ErrorCode=FailedToUpsertDataIntoDeltaTable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Hit an error when upsert data to table in Lakehouse. Error message: Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,''Type=System.IO.FileNotFoundException,Message=Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,'"

All research indicates this is a Fabric platform issue with nothing that can be done from my end. Therefore i am forced to use the 'append' table action.

When i run the pipeline after set up:

- It runs for the first time, data is ingested and looks good.

- I make some test changes to historical data, that has been ingested into the Lakehouse tables.

- I run the pipeline a second time, those changes are added as a new line to the data, with the opriginal line remaining present.

- This as a results means i have an additional accurate line, and a previous line of stale outdated and inaccurate data.

I cannot seem to find a way to make this work. I have speant three solid days trying to tinker with things. There is minimal documentation, often reffering to articles from Data Factorfy prior to it being rolled into Fabric. Youtube videos are never relevant, they simply talk about data scenarios where additional rows are added or people working with .CSV files. I need to find articles that are relevant to actual enterprise level scenarios as opposed to .CSV files with a few movies listed in a small handfull of rows, with data remaining completley static.

My scenario is:

1) I have on-premis MS SQL servers that are connected to my Fabric workspace via a datagateway.

2) I am utilising SQL SELECT queries to extract only particular columns from my SQL tables.

3) I have a data 'warehouse' that holds my watrermark and metadata control tables.

4) I have a Lakehouse where my ingested data will sit.

5) I am trying to set up a data pipeline (avoiding dataflow gen2's in order to have as least resource impact on my source server as possible)

6) I do not want to overwrite the tables i ingest everytime a schedule runs, and i seemingly am unable to utilise the upsert preview feature currently, therefore forced to use the 'append' table action.

I appear to need to be able to utilise some SQL MERGE logic, however as you are aware, Lakehouses are read-only for SQL. So as far as i can tell this is not really an option? I am aware that using a warehouse would be a workaround for this, however i need my data in a Lakehouse to down the track leverage M.L features.

I am very new to this, and i am getting insanely frustrated trying to make sense of it all, i have speant week upon week making tiny amounts of progress, this soloution is supposedly designed to make things easier for the end user, i cannot see how this does. Any help would be appreciated, as you can likely tell i am getting a little frustrated 😞

* Just to add, i am also aware of CDC implementation, however it would require my on-prem SQL server to become public facing and this cannot happen, so this unfortunatley not an option.

smeetsh · ‎08-07-2025

Is it an option for you to use SQL and the gateway connector to get your data from the on-prem SQL server and load that in , what i would call, a raw table (for instance tablename: mydata_raw) and next use a merge in a notebook to merge your raw data with your existing data? Since you only ingest specific data from your sql server, the overwrite of your "mydata_raw" table, should not impact CU's to much.

The notebook code would look something like below ("name" is just an example column)

%%sql
MERGE INTO mydata AS target
USING mydata_raw AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
INSERT (id, name) VALUES (source.id, source.name);

Cheers

Hans

View solution in original post

v-prasare · ‎08-19-2025

Hi @michaelgambling ,
Thanks for pointing out known issue here that: it's affecting upsert action table when using on-primise datgateway. your insights truely valueble to fabric community users with similar issues.

I’d encourage you to submit your detailed feedback and ideas via Microsoft's official feedback channels, such as the Microsoft Fabric Ideas.

Feedback submitted here is often reviewed by the product teams and can lead to meaningful improvement.

Thank you for your patience and look forward to hearing from you.
Best Regards,
Prashanth Are

apturlov · ‎08-13-2025

@michaelgambling glad you found a workaround, though I am curious if you have considered a path of using a mirror for your on-prem SQL database and CDC in your data pipeline. I take it that your upsert strategy is aiming at incremental data load which technically should be easier to implement with CDC as it's supposed to be natively supported for SQL databases.

michaelgambling · ‎08-13-2025

Yes I was aware of this set up, but unfortunately I believe that I need to expose my on prem SQL servers to the public to use this method (from my understanding), whitch is not an option 😞

apturlov · ‎08-13-2025

I believe that exposure of on-prem database to the public network is not necessary as mirroring is supposed to be working via On-premise Data Gateway Microsoft Fabric Mirrored Databases From SQL Server

michaelgambling · ‎08-12-2025

Thanks all to whom have written, just to provide an update i did utilise some merge logic as @smeetsh reccomended,

I have used this as a workaround to the upsert issues should anyone else be facing simular problems when trying to implement there watermark strategy. I have structured my pipeline as per the below canvas image:

2025-08-12 16_55_59-Chat _ Saikat Mukherjee (LTIMINDTREE LIMITED) _ Microsoft Teams.png

The pipeline is derived mainly from this article. However i have made some changes,

1) I used a notebook activity to:

1) Create an ainitial staging.TABLE.

2) Create another destination.TABLE.

My copy data activity now copies initially to the 'staging' table, and then a notebook follows to compare the pre-existing contents of the destination table to that of the staging, and merge any changes, once done the staging table truncates to clear the data ready for the next run.

2025-08-12 17_19_09-_new 10 - Notepad++.png

Although this takes more time it is the only workaround i have found to conduct a working upsert operation, if anyone has any suggestions or reccomendations please do let me know?

I had to tweak it a bit to meet my on-prem scenario (.4 of Step 4) due to the fact I am ingesting tables and not files. Besides that the only other additional step assed was the MERGE & TRUNCATE notebook alongside pre-creating the two tables prior to running the pipeline.

I now have a schedule to run hourly overnight to ensure data intergrity is kept between the source and sink.

jaz0894 · ‎08-07-2025

I am having the same issue, consuming data from Oracle and selected Upsert:

ErrorCode=FailedToUpsertDataIntoDeltaTable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Hit an error when upsert data to table in Lakehouse. Error message: Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,''Type=System.IO.FileNotFoundException,Message=Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,'

The exact same error.

michaelgambling · ‎08-13-2025

Are you going via an on-prem data gateway? If so it appears there is a known issue that is activley being worked on: https://support.fabric.microsoft.com/known-issues/

Hope this helps answer some questions.

jaz0894 · ‎08-07-2025

The exact thing is happening to me, in my case I am consuming data from Oracle Data base.
When I run the pipeline the first time, it creates the table, but when I select Upsert and change my initial SQL to bring just the latest changes, I get: ErrorCode=FailedToUpsertDataIntoDeltaTable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Hit an error when upsert data to table in Lakehouse. Error message: Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,''Type=System.IO.FileNotFoundException,Message=Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,'

smeetsh · ‎08-07-2025

Is it an option for you to use SQL and the gateway connector to get your data from the on-prem SQL server and load that in , what i would call, a raw table (for instance tablename: mydata_raw) and next use a merge in a notebook to merge your raw data with your existing data? Since you only ingest specific data from your sql server, the overwrite of your "mydata_raw" table, should not impact CU's to much.

The notebook code would look something like below ("name" is just an example column)

%%sql
MERGE INTO mydata AS target
USING mydata_raw AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
INSERT (id, name) VALUES (source.id, source.name);

Cheers

Hans

Data Ingestion Troubles

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026

FabCon is coming to Atlanta

Data Ingestion Troubles

Helpful resources

Fabric Monthly Update - November 2025

Fabric Data Days

FabCon Atlanta 2026