The ultimate Microsoft Fabric, Power BI, Azure AI, and SQL learning event: Join us in Stockholm, September 24-27, 2024.
Save €200 with code MSCUST on top of early bird pricing!
Find everything you need to get certified on Fabric—skills challenges, live sessions, exam prep, role guidance, and more. Get started
Hi all,
I have a problem again with dataflow gen2.
I'm getting my data from on-prem sql server with dataflow, and because there is no Incremental refresh on gen2, I set standard scheduled refresh daily on dataflow.
But I realized that this refresh duplicates data on the lakehouse table.
Is it a bug or is there any option to prevent this I couldn't see?
Thanks
Solved! Go to Solution.
Hi!
My name is Jeroen Luitwieler, and I’m a Senior Product Manager on the Dataflow team. While this guide shows how to use an OData source, it’s just an example. You can use this method with your on-premises SQL Server without any issues. If you want to perform a full refresh of your data every time you refresh, you can change the current data destination setting from “append” to “replace”.
Hope this resolves your issue! 😁
Incremental updates is similar to implementing SCD type 2. This might be useful:
Hi @BeyzaKzlky
Thanks for using Fabric Community.
As I understand the above question, you are trying to achieve incremental load using Dataflow Gen2.
Incrementally amassing data in a data destination requires a technique to load only new or updated data into your data destination. This technique can be done by using a query to filter the data based on the data destination.
Please refer to this link for more information: https://learn.microsoft.com/en-us/fabric/data-factory/tutorial-setup-incremental-refresh-with-datafl...
Hope this helps. Please let us know if you have any further queries.
Hi @v-nikhilan-msft ,
Thanks for your response. I have seen the learning page which you have shared with me, but it's OData source. Unfortunately not proper for my case.
I'm getting data from on-prem sql server.
And also there is a data duplication problem with dataflow like I have mentioned above.
Hi!
My name is Jeroen Luitwieler, and I’m a Senior Product Manager on the Dataflow team. While this guide shows how to use an OData source, it’s just an example. You can use this method with your on-premises SQL Server without any issues. If you want to perform a full refresh of your data every time you refresh, you can change the current data destination setting from “append” to “replace”.
Hope this resolves your issue! 😁
I can confirm that it works - with that example i made incremental refreshes from on prem sql server: query warehouse for max date (or entry no, if Your tables have them) and filter source sql server, check that it works correctly and query folding works. I have also implemented pipelines and stored procedures to delete some data from warehouse and refresh a week or a month old of data.
Hi,
I double checked your metioned setting this morning. I'm already selecting replace for tables.
Have a nice day.
Hi @BeyzaKzlky
Apologies for the issue you have been facing. Are you still facing this issue?
If yes we will respond back with more details and try to help.
Thanks.
Hi,
Thank you very much for your response.
Yes I know the option which you have mentioned, but we cannot see that option since yesterday, also data destination selection is greyed out like below:
Also, we set scheduled refresh from the settings of dataflow:
and also I want to mention any other problem scheduled refresh is not triggered randomly.
Is there any limit to show data on data transformation in dataflow? It cuts the operation while merging (inner join) multiple tables after 2 million rows.
I will try the mentioned example.
Thanks for your help.
Hi @BeyzaKzlky !
It is correct what @fabricator1 mentioned and if a data destination is set you can edit it in the left bottom corner. However, the issue you are facing with not triggering a refresh if it was scheduled is quite interesting. Can you create a support ticket when you face this issue again so that our support team can look into it? There should also be no row limit for the operations that you are perfroming. You may want to consider to split the ingestion of the data and the merge in two separate queries so that the lakehouse/warehouse compute can engage and help you perform the inner join faster.
Add data destination is greyed out if the query already has a destination.
It is possible to edit the Destination settings by clicking on the settings icon (cog wheel) in the bottom right corner of the screen:
I don't have information to answer about the other questions, someone else answer please 🙂
I think you can use that link from @v-nikhilan-msft also with on-prem sql server.
That procedure is only to append rows to the Lakehouse table.
Not update (change) existing data in the Lakehouse table.
So it's called Incrementally amass data.
I think the main thinking behind that procedure is that you connect your Dataflow Gen2 to your Lakehouse table, to find out what is the newest row in your Lakehouse table. Then you use that information to filter your on-prem sql server data, so you ask the sql server to only return rows which are newer than your newest row in the Lakehouse. Then you append this new data from sql server into the Lakehouse.
When using that procedure, it shouldn't become duplicate rows, because your query to the SQL server is only asking for new rows.
There is an idea for Incremental refresh (which includes both append and updates to existing data), please vote:
Microsoft Idea
Join the community in Stockholm for expert Microsoft Fabric learning including a very exciting keynote from Arun Ulag, Corporate Vice President, Azure Data.
Check out the August 2024 Fabric update to learn about new features.
User | Count |
---|---|
3 | |
2 | |
2 | |
2 | |
1 |
User | Count |
---|---|
7 | |
4 | |
3 | |
3 | |
2 |