Supplies are limited. Contact info@espc.tech right away to save your spot before the conference sells out.
Get your discountScore big with last-minute savings on the final tickets to FabCon Vienna. Secure your discount
Hi All,
Pulling my teath out a bit trying to get my head around data ingestion.
I am new to Fabric and data engineering in general. I am trying to set up a metadata driven pipeline that implements a 'watermark strategy'
I have many questions on this, but my first one is this, in many Youtube videos and documentation articles i read, people suggest utiliting a date/time column to use as your 'watermark'.
I have tried ingesting a table with a nominated column, it never works. When i try to use the 'Upsert' feature it always results in this error:
"ErrorCode=FailedToUpsertDataIntoDeltaTable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Hit an error when upsert data to table in Lakehouse. Error message: Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,''Type=System.IO.FileNotFoundException,Message=Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,'"
All research indicates this is a Fabric platform issue with nothing that can be done from my end. Therefore i am forced to use the 'append' table action.
When i run the pipeline after set up:
- It runs for the first time, data is ingested and looks good.
- I make some test changes to historical data, that has been ingested into the Lakehouse tables.
- I run the pipeline a second time, those changes are added as a new line to the data, with the opriginal line remaining present.
- This as a results means i have an additional accurate line, and a previous line of stale outdated and inaccurate data.
I cannot seem to find a way to make this work. I have speant three solid days trying to tinker with things. There is minimal documentation, often reffering to articles from Data Factorfy prior to it being rolled into Fabric. Youtube videos are never relevant, they simply talk about data scenarios where additional rows are added or people working with .CSV files. I need to find articles that are relevant to actual enterprise level scenarios as opposed to .CSV files with a few movies listed in a small handfull of rows, with data remaining completley static.
My scenario is:
1) I have on-premis MS SQL servers that are connected to my Fabric workspace via a datagateway.
2) I am utilising SQL SELECT queries to extract only particular columns from my SQL tables.
3) I have a data 'warehouse' that holds my watrermark and metadata control tables.
4) I have a Lakehouse where my ingested data will sit.
5) I am trying to set up a data pipeline (avoiding dataflow gen2's in order to have as least resource impact on my source server as possible)
6) I do not want to overwrite the tables i ingest everytime a schedule runs, and i seemingly am unable to utilise the upsert preview feature currently, therefore forced to use the 'append' table action.
I appear to need to be able to utilise some SQL MERGE logic, however as you are aware, Lakehouses are read-only for SQL. So as far as i can tell this is not really an option? I am aware that using a warehouse would be a workaround for this, however i need my data in a Lakehouse to down the track leverage M.L features.
I am very new to this, and i am getting insanely frustrated trying to make sense of it all, i have speant week upon week making tiny amounts of progress, this soloution is supposedly designed to make things easier for the end user, i cannot see how this does. Any help would be appreciated, as you can likely tell i am getting a little frustrated 😞
* Just to add, i am also aware of CDC implementation, however it would require my on-prem SQL server to become public facing and this cannot happen, so this unfortunatley not an option.
Hi @michaelgambling,
May I ask if you have resolved this issue? If so, Can you please share the resolution steps here. This will be helpful for other community members who have similar problems to solve it faster.
If we don’t hear back, we’ll go ahead and close this thread. For any further discussions or questions, please start a new thread in the Microsoft Fabric Community Forum we’ll be happy to assist.
Thank you for being part of the Microsoft Fabric Community.
hi @michaelgambling,
We would like to confirm if our community members answer resolves your query or if you need further help. If you still have any questions or need more support, please feel free to let us know. We are happy to help you.
Thank you for your patience and look forward to hearing from you.
Best Regards,
Prashanth Are
@michaelgambling glad you found a workaround, though I am curious if you have considered a path of using a mirror for your on-prem SQL database and CDC in your data pipeline. I take it that your upsert strategy is aiming at incremental data load which technically should be easier to implement with CDC as it's supposed to be natively supported for SQL databases.
Yes I was aware of this set up, but unfortunately I believe that I need to expose my on prem SQL servers to the public to use this method (from my understanding), whitch is not an option 😞
I believe that exposure of on-prem database to the public network is not necessary as mirroring is supposed to be working via On-premise Data Gateway Microsoft Fabric Mirrored Databases From SQL Server
Thanks all to whom have written, just to provide an update i did utilise some merge logic as @smeetsh reccomended,
Hi @michaelgambling,
We would like to confirm if our community members answer resolves your query or if you need further help. If you still have any questions or need more support, please feel free to let us know. We are happy to help you.
Thanks,
Prashanth
MS Fabric community support
I am having the same issue, consuming data from Oracle and selected Upsert:
ErrorCode=FailedToUpsertDataIntoDeltaTable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Hit an error when upsert data to table in Lakehouse. Error message: Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,''Type=System.IO.FileNotFoundException,Message=Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,'
The exact same error.
Are you going via an on-prem data gateway? If so it appears there is a known issue that is activley being worked on: https://support.fabric.microsoft.com/known-issues/
Hope this helps answer some questions.
The exact thing is happening to me, in my case I am consuming data from Oracle Data base.
When I run the pipeline the first time, it creates the table, but when I select Upsert and change my initial SQL to bring just the latest changes, I get: ErrorCode=FailedToUpsertDataIntoDeltaTable,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Hit an error when upsert data to table in Lakehouse. Error message: Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,''Type=System.IO.FileNotFoundException,Message=Could not load file or assembly 'System.Linq.Async, Version=6.0.0.0, Culture=neutral, PublicKeyToken=94bc3704cddfc263' or one of its dependencies. The system cannot find the file specified.,Source=Microsoft.DataTransfer.Connectors.LakehouseTableConnector,'
Is it an option for you to use SQL and the gateway connector to get your data from the on-prem SQL server and load that in , what i would call, a raw table (for instance tablename: mydata_raw) and next use a merge in a notebook to merge your raw data with your existing data? Since you only ingest specific data from your sql server, the overwrite of your "mydata_raw" table, should not impact CU's to much.
The notebook code would look something like below ("name" is just an example column)
%%sql
MERGE INTO mydata AS target
USING mydata_raw AS source
ON target.id = source.id
WHEN MATCHED THEN
UPDATE SET target.name = source.name
WHEN NOT MATCHED THEN
INSERT (id, name) VALUES (source.id, source.name);
Cheers
Hans
User | Count |
---|---|
4 | |
4 | |
2 | |
2 | |
2 |
User | Count |
---|---|
10 | |
8 | |
7 | |
6 | |
6 |